### What is Multicollinearity?

Multicollinearity occurs when two or more independent variables (predictors) in a regression model are highly correlated with each other. This means that one predictor can be linearly predicted from the others with a high degree of accuracy. Multicollinearity can cause several issues in regression analysis, particularly in interpreting the coefficients.

### Problems Caused by High Multicollinearity:
- Unstable Coefficients
- Inflated Standard Errors
- Difficulty in Determining the Effect of Each Predictor
- Misleading Interpretation

### How VIF Works:

- VIF quantifies how much the variance of a regression coefficient is inflated due to multicollinearity.
- The formula for VIF is:

  $$
  \text{VIF}(X_i) = \frac{1}{1 - R^2_i}
  $$

  where:
  - $\text{VIF}(X_i)$ is the Variance Inflation Factor for predictor $X_i$.
  - $R^2_i$ is the R-squared value obtained when $X_i$ is regressed on all other predictors.

- A VIF value of 1 indicates no multicollinearity.
- A VIF value between 1 and 5 suggests moderate multicollinearity.
- A VIF value greater than 5 indicates high multicollinearity, and a value greater than 10 is often considered very problematic.


In [5]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
import matplotlib.pyplot as plt

# Set a random seed for reproducibility
np.random.seed(42)

# Step 1: Generate random features
n = 100  # Number of samples
feature_1 = np.random.normal(0, 1, n)
feature_2 = np.random.normal(0, 1, n)
feature_3 = np.random.normal(0, 1, n)

# Step 2: Create a correlated feature (feature_4)
feature_4 = 2 * feature_1 + np.random.normal(0, 0.2, n) + feature_3  # Highly correlated with feature_1 and feature_3

# Create a DataFrame
df = pd.DataFrame({
    'feature_1': feature_1,
    'feature_2': feature_2,
    'feature_3': feature_3,
    'feature_4': feature_4  # Correlated with feature_1
})

# Step 3: Fit a linear regression model and calculate VIF
X = sm.add_constant(df)  # Add intercept
model = sm.OLS(np.random.normal(0, 1, n), X).fit()  # Fit a model with random target

# Calculate VIF for each feature
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print(vif_data)

     feature         VIF
0      const    1.030504
1  feature_1  106.201414
2  feature_2    1.020843
3  feature_3   40.291827
4  feature_4  170.058278


> So, there's a high multicollinearity between feaatures feature_4, feature_1 and feature_3