# Multicollinearity

This notebook we will examine the impact of collinearity on linear regression model coefficients.

In [None]:
# Packages

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm

# the dataset for the demo
from sklearn.datasets import fetch_california_housing
from sklearn.feature_selection import f_regression, r_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

## Load data

In [None]:
# load the California House price data from Scikit-learn
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
X = X.drop(columns = ["Latitude", "Longitude"])

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0,
)

# Scale data
scaler = MinMaxScaler().set_output(transform="pandas").fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
print('X_train shape: ', X_train.shape)
print('y_train shape: ', y_train.shape)
print('X_test shape: ', X_test.shape)
print('y_test shape: ', y_test.shape)
print()

print('5 first rows in X_train:')
X_train.head()

## Correlation

Let's determine the correlation among the predictors.

In [None]:
# We calculate the correlations using pandas corr()
# and we round the values to 2 decimals.
correlation_matrix = X_train.corr().round(2)

# Plot the correlation matrix using seaborn.
# We use annot = True to print the correlation values
# inside the squares.

figure = plt.figure(figsize=(8, 8))
sns.heatmap(data=correlation_matrix, annot=True);

We see that `AveRooms` and `AveBdrms`, as expected, are highly correlated.
There is also some correlation between `AveRooms` and the `MedInc` of the `HouseAge` and `Population`.

## Train a linear regression model with StatsModels

In [None]:
# The model needs an intercept so we add a column of 1s:

X_train = sm.add_constant(X_train)
X_test = sm.add_constant(X_test)

In [None]:
# Fit model

linreg = sm.OLS(y_train, X_train)
results = linreg.fit()
print(results.summary())

## Coefficients direction (sign)

In [None]:
# Coefficients value

pd.Series(results.params, index=X_train.columns).plot.bar(
    yerr=pd.Series(results.bse))

plt.ylabel("Coefficients' value")
plt.title("Coefficients")
plt.show()

Even though the variables `AveRooms` and `AveBdrms` are highly correlated, they show different directions in their contribution toward the target variable, and this is because of the correlation.