## Multicollinearity in Regression Analysis: Problems, Detection, and Solutions

Multicollinearity occurs when independent variables in a regression model are correlated. If the degree of correlation between variables is high enough, it can cause problems when you fit the model and interpret the result.

### Why is Multicollinearity a Potential Problem?


Note that the goal of regression analysis is to isolate the relationship between each indpendent variable and the dependent variable.
The regression coefficient represents the mean change in the dependent variable for each 1 unit change in an independent variable provided
*hold all of the other independent variables constant*.

When independent variables are correlated, then changes in one variable are associated with shifts in another variable. The stronger the 
correlation the harder it is to change one variable without changing another.

There are two basic kinds of multicollinearity:
- **Structural multicollinearity**: Occurs when we create a model term using other terms. E.g. if you square term X to model curvature then X 
is correlated to $X^2$.
- **Data Multicollinearity**: Present in data itself rather than being an artifact of our model. Observational experiments are more likely 
to exhibit this kind of multicollinearity.

## What Problmes Do Multicollinearity Cause?

- The coefficients become very sensitive to small changes in the model.
- It reduces the precision of the estimated coefficients. May not be able to trust the p-values to identify independent variables that are statistically significant.

You don't feel like you know the actual effect of each variable! This makes it difficult to specify the correct model and to justify the  model if many of your p-values
are not statistically significant.

## Do I Have to Fix Multicollinearity?

The need to reduce multicollinearity depends on its severity and your primary goal for your regression model. Keep in mind:
1. The severity of the problems increase with the degree of the multicollinearity. So if you have moderate multicollinearity,
you may not needt to resolve it.
2. It affects only specific independent variables that are correlated. If multicollinearity is not present in independent variables
you are interested in, you may not need to resolve it.
3. Multicollinearity affects the coefficients and p-values, but it does not influence the predictions, precision of the predictions, 
and the goodness-of-fit statistics. If your main goal is to make predictions without understanding the role of each independent variable,
you don't need to reduce severe multicollinearity.

## Testing for Multicollinearity with Variance Inflation Factors (VIF)

Statistical software calculates a VIF for each independent variable. VIFs start at 1, indicating no correlation between other indepenent variable. It has no uppper limit.
VIFs between 1 and 5 suggest that there is moderate correlation, but not severe enough to warrant corrective measures. VIFs greater than 5 are critical enough where the 
coefficients are poorly estimated, and the p-values are questionable.

Assesing VIFs is particularly important for observational studies.


## Multicollinearity Example: Predicting Bone Density in the Femur

This example will show you how to detect multicollinearity as well as illustrate its effects.
It will also show you how to remove structural multicollinearity.

We'll use regression analysis to model the relationship between the independent variables(physical activity, body fat percentage,
weight, and the interaction between weight and body fat) and the dependent variable (bone mineral density of the femoral neck).

## Steps for Implementing VIF

1. Run a multiple regression. 
2. Calculate the VIF factors.
3. Inspect the factors for each predictor variable, if the VIF is between 5-10, multicolinearity is likely present.

In [1]:
import pandas as pd 
import numpy as np
import statsmodels.api as sm
from patsy import dmatrices
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn import datasets
from sklearn.linear_model import LinearRegression
from scipy import stats


In [2]:
data = pd.read_csv('MulticollinearityExample.csv')
data.head()

Unnamed: 0,Femoral_Neck,Fat_percentage,Weight_kg,Activity,Fat_percentage_S,Weight_S,Activity_S
0,0.934,25.3,52.163126,3508.44,-3.265217,-1.765066,946.450435
1,0.888,29.3,61.801965,2773.54,0.734783,7.873772,211.550435
2,0.933,37.7,93.440034,1738.97,9.134783,39.511842,-823.019565
3,0.757,32.8,59.874197,1665.29,4.234783,5.946005,-896.699565
4,1.031,24.6,50.348756,3982.95,-3.965217,-3.579436,1420.960435


In [3]:
data = data[['Femoral_Neck',  'Fat_percentage',  'Weight_kg',  'Activity']]
data.head()

Unnamed: 0,Femoral_Neck,Fat_percentage,Weight_kg,Activity
0,0.934,25.3,52.163126,3508.44
1,0.888,29.3,61.801965,2773.54
2,0.933,37.7,93.440034,1738.97
3,0.757,32.8,59.874197,1665.29
4,1.031,24.6,50.348756,3982.95


In [4]:
data["Fat_percentage_times_Weight_kg"] = data["Fat_percentage"]*data["Weight_kg"]
print("DataFrame after addition of new colulmn")
data.head()

DataFrame after addition of new colulmn


Unnamed: 0,Femoral_Neck,Fat_percentage,Weight_kg,Activity,Fat_percentage_times_Weight_kg
0,0.934,25.3,52.163126,3508.44,1319.727088
1,0.888,29.3,61.801965,2773.54,1810.79756
2,0.933,37.7,93.440034,1738.97,3522.689297
3,0.757,32.8,59.874197,1665.29,1963.873655
4,1.031,24.6,50.348756,3982.95,1238.579407


In [5]:
y = data["Femoral_Neck"]
X = data[["Fat_percentage",	"Weight_kg", "Activity", "Fat_percentage_times_Weight_kg"]]

# variance_inflation_factor expects the presence of a constant in the matrix of explanatory variables
X = X.assign(constant=1)

In [6]:
X.shape

(92, 5)

In [7]:
X.head()

Unnamed: 0,Fat_percentage,Weight_kg,Activity,Fat_percentage_times_Weight_kg,constant
0,25.3,52.163126,3508.44,1319.727088,1
1,29.3,61.801965,2773.54,1810.79756,1
2,37.7,93.440034,1738.97,3522.689297,1
3,32.8,59.874197,1665.29,1963.873655,1
4,24.6,50.348756,3982.95,1238.579407,1


In [8]:
model = sm.OLS(y, X).fit()
print(model.summary())

vif = pd.DataFrame()
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['variable'] = X.columns
print(vif)


                            OLS Regression Results                            
Dep. Variable:           Femoral_Neck   R-squared:                       0.562
Model:                            OLS   Adj. R-squared:                  0.542
Method:                 Least Squares   F-statistic:                     27.95
Date:                Mon, 03 Jul 2023   Prob (F-statistic):           6.24e-15
Time:                        15:21:04   Log-Likelihood:                 116.01
No. Observations:                  92   AIC:                            -222.0
Df Residuals:                      87   BIC:                            -209.4
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                                     coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
Fat_percenta

These results show what Weight, Activity, and the interaction between them are statistically significant. The perctent body fat is not statistically significant. However, the VIFs
indicate that our model has severe multicollinearity for some of the independent variables.

Notice that Activity has a VIF near 1, which shows that multicollinearity does not affect it and we can trust the coefficient and p-value with no further action.

Additionally, at least some of the multicollinearity in our model is the structural type. The term Fat_percentage_times_Weight_kg is the product of body fat and weight. Clearly, there is a correlation between the interaction term and both of the main effect terms.
The VIFs relfect these relationships.

## Reducing Structural Multicollinearity: Regression with No Interaction

In [9]:
X_c = X[["Fat_percentage",	"Weight_kg", "Activity"]]
X_c['const'] = 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [10]:
model = sm.OLS(y, X_c).fit()
print(model.summary())

vif = pd.DataFrame()
vif['VIF'] = [variance_inflation_factor(X_c.values, i) for i in range(X_c.shape[1])]
vif['variable'] = X_c.columns
print(vif)

                            OLS Regression Results                            
Dep. Variable:           Femoral_Neck   R-squared:                       0.520
Model:                            OLS   Adj. R-squared:                  0.504
Method:                 Least Squares   F-statistic:                     31.79
Date:                Mon, 03 Jul 2023   Prob (F-statistic):           5.14e-14
Time:                        15:21:04   Log-Likelihood:                 111.77
No. Observations:                  92   AIC:                            -215.5
Df Residuals:                      88   BIC:                            -205.5
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Fat_percentage    -0.0049      0.002     -2.

Notice that all VIFs are less than 5. By removing the stuctural multicollinearity, we can see that there is some multicollinearity in our data, but
it is not severe enough to warrant further corrective measures.

## Comparing Regression Models to Reveal Multicollinearity Effects

In [11]:
est = sm.OLS(y, X)
est = est.fit()
print(est.summary())

                            OLS Regression Results                            
Dep. Variable:           Femoral_Neck   R-squared:                       0.562
Model:                            OLS   Adj. R-squared:                  0.542
Method:                 Least Squares   F-statistic:                     27.95
Date:                Mon, 03 Jul 2023   Prob (F-statistic):           6.24e-15
Time:                        15:21:04   Log-Likelihood:                 116.01
No. Observations:                  92   AIC:                            -222.0
Df Residuals:                      87   BIC:                            -209.4
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                                     coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
Fat_percenta

In [12]:
c_est = sm.OLS(y, X_c)
c_est = c_est.fit()
print(c_est.summary())

                            OLS Regression Results                            
Dep. Variable:           Femoral_Neck   R-squared:                       0.520
Model:                            OLS   Adj. R-squared:                  0.504
Method:                 Least Squares   F-statistic:                     31.79
Date:                Mon, 03 Jul 2023   Prob (F-statistic):           5.14e-14
Time:                        15:21:04   Log-Likelihood:                 111.77
No. Observations:                  92   AIC:                            -215.5
Df Residuals:                      88   BIC:                            -205.5
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Fat_percentage    -0.0049      0.002     -2.

Now, take a look at the Summary of Model tables for both models. You’ll notice that the standard error of the regression (S), R-squared, adjusted R-squared, and predicted R-squared are all identical. As I mentioned earlier, multicollinearity doesn’t affect the predictions or goodness-of-fit. If you just want to make predictions, the model with severe multicollinearity is just as good!

## How to Deal with Multicollinearity

What if you have severe multicollinearity in your data and you find that you must deal with it?
The answer isn't always clear but potential solutions include the following:
- Remove some of the highly correlated independent variables (should not be first option).
- Linearly combine the independent variables, such as adding them together.
- Perform an analysis designed for highly correlated variables (e.g. PCA or Partial least squares regression).
- LASSO and Ridge regression are advanced forms of regression analysis that can handle multicollinearity.

Remember that all of these have downsides. If you can accept less precise coefficients, or a regression model with 
high R-squared but hardly any statistically significant variables, then not doing anything about the multicollinearity might 
be the best solution.