# Multicollinearity

This notebook we will examine the impact of collinearity on linear regression model coefficients.

In [None]:
# Packages

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm

# the dataset for the demo
from sklearn.datasets import fetch_california_housing
from sklearn.feature_selection import f_regression, r_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

## Load data

In [None]:
# load the California House price data from Scikit-learn
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
X = X.drop(columns = ["Latitude", "Longitude"])

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0,
)

# Scale data
scaler = MinMaxScaler().set_output(transform="pandas").fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
print('X_train shape: ', X_train.shape)
print('y_train shape: ', y_train.shape)
print('X_test shape: ', X_test.shape)
print('y_test shape: ', y_test.shape)
print()

print('5 first rows in X_train:')
X_train.head()

## Correlation

Let's determine the correlation among the predictors.

In [None]:
# We calculate the correlations using pandas corr()
# and we round the values to 2 decimals.
correlation_matrix = X_train.corr().round(2)

# Plot the correlation matrix using seaborn.
# We use annot = True to print the correlation values
# inside the squares.

figure = plt.figure(figsize=(8, 8))
sns.heatmap(data=correlation_matrix, annot=True);

We see that `AveRooms` and `AveBdrms`, as expected, are highly correlated.
There is also some correlation between `AveRooms` and the `MedInc` of the `HouseAge` and `Population`.

## Train a linear regression model with StatsModels

In [None]:
# The model needs an intercept so we add a column of 1s:

X_train = sm.add_constant(X_train)
X_test = sm.add_constant(X_test)

In [None]:
# Fit model

linreg = sm.OLS(y_train, X_train)
results = linreg.fit()
print(results.summary())

## Coefficients direction (sign)

In [None]:
# Coefficients value

pd.Series(results.params, index=X_train.columns).plot.bar(
    yerr=pd.Series(results.bse))

plt.ylabel("Coefficients' value")
plt.title("Coefficients")
plt.show()

Even though the variables `AveRooms` and `AveBdrms` are highly correlated, they show different directions in their contribution toward the target variable, and this is because of the correlation.

## Scatter plots

Let's explore the true relationship between `AveRooms` and `AveBdrms` and the target.

In [None]:
plt.figure(figsize=(12, 4), dpi=80)

plt.subplot(1, 3, 1)
plt.scatter(X_train['AveRooms'], y_train, marker="o")
plt.xlabel('Number of rooms')
plt.ylabel('House price')

plt.subplot(1, 3, 2)
plt.plot(X_train['AveBedrms'], y_train,marker="o", ls="")
plt.xlabel('Number of bedrooms')
plt.ylabel('House price')

plt.subplot(1, 3, 3)
plt.plot(X_train['AveBedrms'], X_train['AveRooms'], marker="o", ls="")
plt.xlabel('Number of bedrooms')
plt.ylabel('Number of rooms')

plt.show()

It is hard to see a clear association between `AveRooms` and `AveBdrms` and the target, but if anything else, by eye it looks very similar.

## Correlation with the target

Let's then calculate the correlation coefficient and its significance.

In [None]:
# Pearson's correlation coefficient

r_regression(X_train[['AveRooms', 'AveBedrms']], y_train)

In [None]:
# Significance

f_regression(X_train[['AveRooms', 'AveBedrms']], y_train)[1:]

There is a positive association between the number of rooms and the house price, that is also significant.
There is almost no association between the number of bedrooms and the house price.

## Remove correlated variable

Let's remove one of the correlated variables and re-train the model.

In [None]:
X_train = X_train.drop(columns=['AveBedrms'])

In [None]:
# Train a new model, without a highly correlated feature - AveBedrms

linreg = sm.OLS(y_train, X_train)
results = linreg.fit()
print(results.summary())

In [None]:
# Visualize the coefficient values

pd.Series(results.params, index=X_train.columns).plot.bar(
    yerr=pd.Series(results.bse))

plt.ylabel("Coefficients' value")
plt.title("Coefficients")
plt.show()

We still see that the number of bedrooms contributes negatively to the house price, even though it is positively correlated.

This is probably because it is correlated to `MedInc` and there is also some degree of correlation among other variables.

## Correlation between predictors and target

Let's examine each variable individually.

In [None]:
plt.figure(figsize=(20, 4), dpi=80)

plt.subplot(1, 5, 1)
plt.scatter(X_train['MedInc'], y_train, marker="o")
plt.xlabel('Median income')
plt.ylabel('House price')

plt.subplot(1, 5, 2)
plt.plot(X_train['HouseAge'], y_train,marker="o", ls="")
plt.xlabel('House age')
plt.ylabel('House price')

plt.subplot(1, 5, 3)
plt.scatter(X_train[ 'AveRooms'], y_train, marker="o")
plt.xlabel('Number of rooms')
plt.ylabel('House price')

plt.subplot(1, 5, 4)
plt.plot(X_train['Population'], y_train,marker="o", ls="")
plt.xlabel('Population')
plt.ylabel('House price')

plt.subplot(1, 5, 5)
plt.plot(X_train['AveOccup'], y_train,marker="o", ls="")
plt.xlabel('Occupancy')
plt.ylabel('House price');

In [None]:
# Coefficients

coeffs = r_regression(X_train.drop(columns=["const"]), y_train)
coeffs

In [None]:
# p values

p_values = f_regression(X_train.drop(columns=["const"]), y_train)[1]
p_values

In [None]:
coeff_df = pd.DataFrame(
    {"corr_coef": coeffs, "p_values": p_values}, index=X_train.columns[1:])

coeff_df

The contribution of `Population` and `AveOccup` seems to be negligible to the house price. If anything else, they have a negative association.

However, in the model they show a positive contribution towards the target variable.

In [None]:
# Let's remove those variables

X_train = X_train.drop(columns=['Population', "AveOccup"])

linreg = sm.OLS(y_train, X_train)
results = linreg.fit()
print(results.summary())

We see that there is no decrease in R2 after removing those variables.

In [None]:
# Coefficients value

pd.Series(results.params, index=X_train.columns).plot.bar(
    yerr=pd.Series(results.bse))

plt.ylabel("Coefficients' value")
plt.title("Coefficients")
plt.show()

In [None]:
plt.scatter(X_train['MedInc'], X_train["AveRooms"], marker="o")
plt.xlabel('Median income')
plt.ylabel('AveRooms');

What seems to be happening is that for greater `MedInc` values, and hence higher house prices, there don't seem to be houses with more rooms. And the number of rooms seems to be taking the price down after accounting for the median income.

In [None]:
plt.figure(figsize=(10, 8), dpi=80)

sns.scatterplot(
    x=X_train['MedInc'],
    y=y_train,
    hue=pd.qcut(X_train["AveRooms"], 5),
    );

In [None]:
plt.figure(figsize=(20, 20), dpi=80)

sns.jointplot(
    x=X_train['MedInc'],
    y=y_train,
    hue=pd.qcut(X_train["AveRooms"], 3),
);