## Multicollinearity in Regression Analysis: Problems, Detection, and Solutions

Multicollinearity occurs when independent variables in a regression model are correlated. If the degree of correlation between variables is high enough, it can cause problems when you fit the model and interpret the result.

### Why is Multicollinearity a Potential Problem?


Note that the goal of regression analysis is to isolate the relationship between each indpendent variable and the dependent variable.
The regression coefficient represents the mean change in the dependent variable for each 1 unit change in an independent variable provided
*hold all of the other independent variables constant*.

When independent variables are correlated, then changes in one variable are associated with shifts in another variable. Then stronger the 
correlation the harder it is to change one variable without changing another.

There are two basic kinds of multicollinearity:
- **Structural multicollinearity**: Occurs when we create a model term using other terms. E.g. if you square term X to model curvature then X 
is correlated to $X^2$.
- **Data Multicollinearity**: Present in data itself rather than being an artifact of our model. Oberservational experiments are more likely 
to exhibit this kind of multicollinearity.

## What Problmes Do Multicollinearity Cause?

- The coefficients become very sensitive to small changes in the model.
- It reduces the precision of the estimated coefficients. May not be able to trust the p-values to identify independent variables that are statistically significant.

You don't feel like you know the actual effect of each variable! This makes it difficult to specify the correct model and to justify the  model if many of your p-values
are not statistically significant.

## Do I Have to Fix Multicollinearity?

The need to reduce multicollinearity depends on its severity and you primary goal for your regression model. Keep in mind:
1. The severity of the problems increase with the degree of the multicollinearity. So if you have moderate multicollinearity,
you may not needt to resolve it.
2. It affects only specific independent variables that are correlated. If multicollinearity is not present in independent variables
you are interested in, you may not need to resolve it.
3. Multicollinearity affects the coefficients and p-values, but it does not influence the predictions, precision of the predictions, 
and the goodness-of-fit statistics. If your main goal is to make predictions without understanding the role of each independent variable,
you don't need to reduce severe multicollinearity.

## Testing for Multicollinearity with Variance Inflation Factors (VIF)

Statistical software calculates a VIF for each independent variable. VIFs start at 1, indicating no correlation between other indepenent variable. It has no uppper limit.
VIFs between 1 and 5 suggest that there is moderate correlation, but not severe enough to warrant corrective measures. VIFs greater than 5 are critical enough where the 
coefficients are poorly estimated, and the p-values are questionable.

Assesing VIFs is particularly important for observational studies.


## Multicollinearity Example: Predicting Bone Density in the Femur

This example will show you how to detect multicollinearity as well as illustrate its effects.
It will also show you how to remove structural multicollinearity.

We'll use regression analysis to model the relationship between the independent variables(physical activity, body fat percentage,
weight, and the interaction between weight and body fat) and the dependent variable (bone mineral density of the femoral neck).

## Steps for Implementing VIF

1. Run a multiple regression. 
2. Calculate the VIF factors.
3. Inspect the factors for each predictor variable, if the VIF is between 5-10, multicolinearity is likely present.

In [1]:
import pandas as pd 
import numpy as np
import statsmodels.api as sm
from patsy import dmatrices
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression
from scipy import stats


In [2]:
data = pd.read_csv('MulticollinearityExample.csv')
print(data.head())

   Femoral_Neck  Fat_percentage  Weight_kg  Activity  Fat_percentage_S  \
0         0.934            25.3  52.163126   3508.44         -3.265217   
1         0.888            29.3  61.801965   2773.54          0.734783   
2         0.933            37.7  93.440034   1738.97          9.134783   
3         0.757            32.8  59.874197   1665.29          4.234783   
4         1.031            24.6  50.348756   3982.95         -3.965217   

    Weight_S   Activity_S  
0  -1.765066   946.450435  
1   7.873772   211.550435  
2  39.511842  -823.019565  
3   5.946005  -896.699565  
4  -3.579436  1420.960435  


In [3]:
data = data[['Femoral_Neck',  'Fat_percentage',  'Weight_kg',  'Activity']]
print(data.head())

   Femoral_Neck  Fat_percentage  Weight_kg  Activity
0         0.934            25.3  52.163126   3508.44
1         0.888            29.3  61.801965   2773.54
2         0.933            37.7  93.440034   1738.97
3         0.757            32.8  59.874197   1665.29
4         1.031            24.6  50.348756   3982.95


In [4]:
data["Fat_Weight_kg"] = data.apply(
    lambda row: row["Fat_percentage"]*row["Weight_kg"], axis=1
)

print("DataFrame after addition of new colulmn")
print(data.head())

DataFrame after addition of new colulmn
   Femoral_Neck  Fat_percentage  Weight_kg  Activity  Fat_Weight_kg
0         0.934            25.3  52.163126   3508.44    1319.727088
1         0.888            29.3  61.801965   2773.54    1810.797560
2         0.933            37.7  93.440034   1738.97    3522.689297
3         0.757            32.8  59.874197   1665.29    1963.873655
4         1.031            24.6  50.348756   3982.95    1238.579407


In [5]:
y, X = dmatrices("Femoral_Neck ~ Fat_percentage+Weight_kg+Activity+Fat_Weight_kg", data=data, return_type='dataframe')


In [6]:
vif = pd.DataFrame()
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['variable'] = X.columns

print(vif)

          VIF        variable
0  321.088504       Intercept
1   14.931555  Fat_percentage
2   33.948375       Weight_kg
3    1.053005        Activity
4   75.059251   Fat_Weight_kg


These results show what Weight, Activity, and the interaction between them are statistically significant. The perctent body fat is not statistically significant. However, the VIFs
indicate that our model has severe multicollinearity for some of the independent variables.

Notice that Activity has a VIF near 1, which shows that multicollinearity does not affect it and we can trust the coefficient and p-value with no further action.

Additionally, at least some of the multicollinearity in our model is the structural type. The term Fat_Weight_kg is the product of body fat and weight. Clearly, there is a correlation between the interaction term and both of the main effect terms.
The VIFs relfect these relationships.

## Center the Independt Variable to Reduce Structural Multicollinearity

In our model, the interaction term is at least partially repsonsible for the high VIFs, since these terms
include the main effects.

Centering the variables (standardizing the variables) by subtracting the mean. This process involves calculating
the mean for each continous independent variable and then subtracting the mean from all observed values of that 
variable.

The advantage of just subtracting the mean is that the interpretation of the coefficients remains the same.

## Regression with Centered Variables

In [7]:
c_data = pd.read_csv('MulticollinearityExample.csv')
print(c_data.head())

   Femoral_Neck  Fat_percentage  Weight_kg  Activity  Fat_percentage_S  \
0         0.934            25.3  52.163126   3508.44         -3.265217   
1         0.888            29.3  61.801965   2773.54          0.734783   
2         0.933            37.7  93.440034   1738.97          9.134783   
3         0.757            32.8  59.874197   1665.29          4.234783   
4         1.031            24.6  50.348756   3982.95         -3.965217   

    Weight_S   Activity_S  
0  -1.765066   946.450435  
1   7.873772   211.550435  
2  39.511842  -823.019565  
3   5.946005  -896.699565  
4  -3.579436  1420.960435  


In [8]:
c_data = c_data[['Femoral_Neck',  'Fat_percentage_S',  'Weight_S',  'Activity_S']]
print(c_data.head())

   Femoral_Neck  Fat_percentage_S   Weight_S   Activity_S
0         0.934         -3.265217  -1.765066   946.450435
1         0.888          0.734783   7.873772   211.550435
2         0.933          9.134783  39.511842  -823.019565
3         0.757          4.234783   5.946005  -896.699565
4         1.031         -3.965217  -3.579436  1420.960435


In [9]:
c_data["Fat_Weight_kg_S"] = c_data.apply(
    lambda row: row["Fat_percentage_S"]*row["Weight_S"], axis=1
)

print("DataFrame after addition of new colulmn")
print(c_data.head())

DataFrame after addition of new colulmn
   Femoral_Neck  Fat_percentage_S   Weight_S   Activity_S  Fat_Weight_kg_S
0         0.934         -3.265217  -1.765066   946.450435         5.763324
1         0.888          0.734783   7.873772   211.550435         5.785511
2         0.933          9.134783  39.511842  -823.019565       360.932090
3         0.757          4.234783   5.946005  -896.699565        25.180037
4         1.031         -3.965217  -3.579436  1420.960435        14.193241


In [10]:
c_y, c_X = dmatrices("Femoral_Neck ~ Fat_percentage_S+Weight_S+Activity_S+Fat_Weight_kg_S", data=c_data, return_type='dataframe')

In [11]:
vif = pd.DataFrame()
vif['VIF'] = [variance_inflation_factor(c_X.values, i) for i in range(c_X.shape[1])]
vif['variable'] = c_X.columns

print(vif)

        VIF          variable
0  1.753396         Intercept
1  3.323870  Fat_percentage_S
2  4.745648          Weight_S
3  1.053005        Activity_S
4  1.991063   Fat_Weight_kg_S


Notice that all VIFs are less than 5. By removing the stuctural multicollinearity, we can see that there is some multicollinearity in our data, but
it is not severe enough to warrant further corrective measures.

## Comparing Regression Models to Reveal Multicollinearity Effects

In [14]:
X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary())

                            OLS Regression Results                            
Dep. Variable:           Femoral_Neck   R-squared:                       0.562
Model:                            OLS   Adj. R-squared:                  0.542
Method:                 Least Squares   F-statistic:                     27.95
Date:                Wed, 09 Feb 2022   Prob (F-statistic):           6.24e-15
Time:                        11:45:47   Log-Likelihood:                 116.01
No. Observations:                  92   AIC:                            -222.0
Df Residuals:                      87   BIC:                            -209.4
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept          0.1549      0.132      1.

In [15]:
c_X2 = sm.add_constant(c_X)
c_est = sm.OLS(c_y, c_X2)
c_est2 = c_est.fit()
print(c_est2.summary())

                            OLS Regression Results                            
Dep. Variable:           Femoral_Neck   R-squared:                       0.562
Model:                            OLS   Adj. R-squared:                  0.542
Method:                 Least Squares   F-statistic:                     27.95
Date:                Wed, 09 Feb 2022   Prob (F-statistic):           6.24e-15
Time:                        11:46:03   Log-Likelihood:                 116.01
No. Observations:                  92   AIC:                            -222.0
Df Residuals:                      87   BIC:                            -209.4
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept            0.8216      0.010  

The first independent variable we'll look at is Activity. This variable was the only to have
almost no multicollinearity in the first models. Compare the Activity coefficients and p-values between the
two models and you'll see that they are the same(coefficient = 0.000022, p-value = 0.003). This illustrates how only
the variables that are highly correlated are affected by its problems.

Let's look at the variables that had high VIFs in the first model. The standard error of the coefficient measures
the precision of the estimates. Lower values indicates more precise estimates. The standard errors in the second model
are lower for both fat percentage and weights. Additionally, fat percentage is significant in the second model even though
it wasn't in the first model. Not only that, but the coefficient sign for fat percentage has changed from positive to negative!

The lower percision, switched signs, and a lack of statistical significance are typical problems associated with multicollinearity.

Now, take a look at the OLS Regression Results of both models. You'll notice that the standard error of the regression, R-squared,
adjusted R-squared, and predicted R-squared are all identical. Multicollinearity doesn't affect the predictions or goodness-to-fit.
If you just want to make predictions, the model with severe multicollinearity is just as good!

## How to Deal with Multicollinearity

What if you have severe multicollinearity in your data and you find that you must deal with it?
The answer isn't always clear but potential solutions include the following:
- Remove some of the highly correlated independent variables.
- Linearly combine the independent variables, such as adding them together.
- Perform an analysis designed for highly correlated variables (e.g. PCA or Partial least squares regression).
- LASSO and Ridge regression are advanced forms of regression analysis that can handle multicollinearity.

Remember that all of these have downsides. If you can accept less precise coefficients, or a regression model with 
high R-squared but hardly any statistically significant variables, then not doing anything about the multicollinearity might 
be the best solution.