0. Data set description
    - Rock data: continuous variable
    - Housing data: categorical variable
1. Multiple Linear Regression
    - Continuous
    - Categorical
2. Performance evalulation
    - R^2
    - Adjusted R^2
3. Regression coefficient interpretation
    - No collinearity
        * (Cover independent features - hypothesis testing)
    - Multicollinearity
        * Unstable model parameters
        * Detection
        * Remedies
            - Leave as it is
            - PCA
            - Drop one of the variables
            - obtain more data
            - feature selection
4. Feature selection & ranking
    - Forward selection
    - Random forest feature importance
    - Recursive feature elimnation?

We seldom encounter real-life problems that can be modeled by a single feature. In the other words, you can't model them by visualizing the data in a simple 2D plane. If 

In [2]:
import pandas as pd

prices = pd.DataFrame({
    'city': ['A', 'A', 'A', 'B', 'B', 'C'],
    'price': [1, 1, 1, 2, 2, 3],
})
mean_price = prices.groupby('city').mean()
data = pd.DataFrame({'city': ['A', 'B', 'C', 'A', 'B', 'A']})

print(data.merge(mean_price, on='city', how='left'))

  city  price
0    A      1
1    B      2
2    C      3
3    A      1
4    B      2
5    A      1


In [3]:
prices

Unnamed: 0,city,price
0,A,1
1,A,1
2,A,1
3,B,2
4,B,2
5,C,3


In [4]:
data.merge(mean_price, on='city', how='left')

Unnamed: 0,city,price
0,A,1
1,B,2
2,C,3
3,A,1
4,B,2
5,A,1


Hacker's Guide: Flowchart

Many real-life problems are multivariate. Simple linear regression with only one feature is limited. It violates the assumption of linear regression - Independence of ??? 

1. Bivariate Linear regression
    * Pythonic Tip: aa
2. Multivariate linear regression. 
    There is a linear 2D plane. 
    * Pythonic Tip: aa
3. Feature Ranking
    Observations
    
    
y = ax + b 

becomes

y = a1x1 + a2x2 + a3x3 + b 

The multiple linear regression explains the relationship between one continuous dependent variable (y) and two or more independent variables (x1, x2, x3… etc).

best fit line to represent the effects of your predictors and the dependant variable, and does not include the effects of one predictor on another.

Linear regression assumes the features to be independent.

## Why use multiple linear regression?

Adding radio to the model leads to a substantial improvement in R-squared. This implies that a model that uses TV and radio expenditures to predict sales is substantially better than one that uses only TV advertising.

## Does multi-colinearity affect prediction?

Multicollinearity makes it hard to assess the relative importance of independent variables, but it does not affect the usefulness of the regression equation for prediction.

There are usually two aims by using linear regression: identification of the predictor effect and prediction. For prediction purpose, multicoloinerity has no “bad” effect on it. You don’t need use PCA or ridge to make corrections.

But for identification, the multicollinearity must be considered, hence you can use PCA or ridge or other approach

Principal component analysis and ridge regression are popular used. 

Linear dependence between features induces the multicollinearity problem and leads to instability of the model and redundancy of the feature set.


**Multi-collinearity makes the model parameters to be unreliable, individually**

 In the presence of multicollinearity, common methods of regression analysis, such as least squares, build unstable models of excessive complexity

### Multi-colinearity

Try seeing what happens if you use independent subsets of your data for estimation and apply those estimates to the whole data set. Theoretically you should obtain somewhat higher variance from the smaller datasets used for estimation, but the expectation of the coefficient values should be the same. Naturally, the observed coefficient values will vary, but look at how much they vary.

In the presence of multicollinearity, the estimate of one variable's impact on the dependent variable ${\displaystyle Y}$ while controlling for the others tends to be less precise than if predictors were uncorrelated with one another. The usual interpretation of a regression coefficient is that it provides an estimate of the effect of a one unit change in an independent variable, ${\displaystyle X_{1}}$, holding the other variables constant. If ${\displaystyle X_{1}}$ is highly correlated with another independent variable, ${\displaystyle X_{2}}$, in the given data set, then we have a set of observations for which ${\displaystyle X_{1}}$ and ${\displaystyle X_{2}}$ have a particular linear stochastic relationship. We don't have a set of observations for which all changes in ${\displaystyle X_{1}}$ are independent of changes in ${\displaystyle X_{2}}$, so we have an imprecise estimate of the effect of independent changes in ${\displaystyle X_{1}}$.

Multicollinearity does not reduce the predictive power or reliability of the model as a whole, at least within the sample data set; it only affects calculations regarding individual predictors. That is, a multivariate regression model with collinear predictors can indicate how well the entire bundle of predictors predicts the outcome variable, but it may not give valid results about any individual predictor, or about which predictors are redundant with respect to others.

In this situation the coefficient estimates of the multiple regression may change erratically in response to small changes in the model or the data

 In the presence of multicollinearity, common methods of regression analysis, such as least squares, build unstable models of excessive complexity

Multicollinearity is a strong correlation between features that affect the target vector simultaneously

Independent variable is no longer independent. 

Consider PCA

When the t-tests for each of the individual slopes are non-significant, but the overall F-test is significant. This is because multicollinearity causes some variables to seem useless, so lowering the t-stat, but has no effect on the F-statistics which takes an overall view.

In regression, multicollinearity refers to the extent to which independent variables are correlated. Multicollinearity exists when:

One independent variable is correlated with another independent variable.
One independent variable is correlated with a linear combination of two or more independent variables.

When one independent variable is highly correlated with another independent variable (or with a combination of two or more other independent variables), the marginal contribution of that independent variable is influenced by other independent variables. As a result:
Estimates for regression coefficients can be unreliable.
Tests of significance for regression coefficients can be misleading.

This means that the analysis of regression coefficients should be preceded by an analysis of multicollinearity.


If you only want to predict the value of a dependent variable, you may not have to worry about multicollinearity. Multiple regression can produce a regression equation that will work for you, even when independent variables are highly correlated.

https://stattrek.com/multiple-regression/multicollinearity.aspx

https://towardsdatascience.com/super-simple-machine-learning-by-me-multiple-linear-regression-part-1-447800e8b624

https://www.kaggle.com/mashimo/features-selection-for-multiple-linear-regression

### Performance Evaluation

In case of more than 3D:

They look at the magnitude of coefficients, and they test the statistical significance of coefficients.

R^2

If the coefficient for a particular variable is significantly greater than zero, researchers judge that the variable contributes to the predictive ability of the regression equation. In this way, it is possible to distinguish variables that are more useful for prediction from those that are less useful. **This approach works only when multi-collinearity is negligible.**

One thing to note is that R-squared will always increase when more variables are added to the model, even if those variables are only weakly associated with the response.
Therefore an adjusted R-squared is provided, which is R-squared adjusted by the number of predictors.

# Feature Ranking

These methods are most often used for prediction, but looking at the feature importances can give you a sense of which of your variables have the most effect in these models. You can use that information to engineer new features, drop out features that look like noise, or just to inform you as you continue building models.

Prior to inspecting the feature importances, it is important to check that the model predictive performance is high enough. Indeed there would be little interest of inspecting the important features of a non-predictive model.

R^2 and adjusted R^2

Could try LASSO regression, but LASSO does not work well in the presence of high multi-collinearity [here](https://stats.stackexchange.com/questions/25611/how-to-deal-with-multicollinearity-when-performing-variable-selection)

I'm not actually dead-set on using linear regression: the only thing I want is to be able to understand which of the 9 variables is truly driving the variation in the Score variable. Preferably, this would be some method that takes the strong potential for collinearity in these 9 variables into account.
 
For example in Ecology it is very common to calculate a correlation matrix between all the independent variables and remove one of them, when the correlation is bigger than 0.7.
 
"Feature Importance" means which predictors drives the target variable the most. 

Training a model that accurately predicts outcomes is great, but most of the time you don't just need predictions, you want to be able to interpret your model. For example, if you build a model of house prices, knowing which features are most predictive of price tells us which features people are willing to pay for

The intuition behind permutation importance is that if a feature is not useful for predicting an outcome, then altering or permuting its values will not result in a significant reduction in a model’s performance. This technique is commonly used in random forests

**Random Forest Feature importance**

 - Random forest does not require normalization

There is actually a second way of computing feature importance often called “Gini importance“. In the case of a random forest regressor, the importance is assessed by how much a feature decreases the weighted variance in a tree (in the case of classification it’s not variance but Gini impurity)

 In fact, the RF importance technique we'll introduce here (permutation importance) is applicable to any model, though few machine learning practitioners seem to realize this. Permutation importance is a common, reasonably efficient, and very reliable technique. It directly measures variable importance by observing the effect on model accuracy of randomly shuffling each predictor variable. This technique is broadly-applicable because it doesn't rely on internal model parameters, such as linear regression coefficients (which are really just poor proxies for feature importance).
 
 In scikit-learn, we implement the importance as described in [1] (often cited, but unfortunately rarely read...). It is sometimes called "gini importance" or "mean decrease impurity" and is defined as the total decrease in node impurity (weighted by the probability of reaching that node (which is approximated by the proportion of samples reaching that node)) averaged over all trees of the ensemble.

In the literature or in some other packages, you can also find feature importances implemented as the "mean decrease accuracy". Basically, the idea is to measure the decrease in accuracy on OOB data when you randomly permute the values for that feature. If the decrease is low, then the feature is not important, and vice-versa.

The mean decrease in impurity importance of a feature is computed by measuring how effective the feature is at reducing uncertainty (classifiers) or variance (regressors) when creating decision trees within RFs

https://explained.ai/rf-importance/

This inflates the importance of continuous & high-cardinality categorical variables

The advantage of Random Forests, of course, is that they provide OOB samples by construction so users don't have to extract their own validation set and pass it to the feature importance function.

**Permuation feature importance**

Does not require normalization



## Shapley Value

The Shapley value, coined by Shapley (1953)41, is a method for assigning payouts to players depending on their contribution to the total payout. Players cooperate in a coalition and receive a certain profit from this cooperation.
 
## Recursive VIF elimination
 
