# Linear Regression Coefficients Variability

This notebook focuses on the understanding the information provided by the linear regression model coefficients, and the calculation of their variability.


In [None]:
# Packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from scipy import stats

from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.preprocessing import MinMaxScaler

import statsmodels.api as sm

## Load Data

In [None]:
# load the California House price data from Scikit-learn
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
X = X.drop(columns=["Latitude", "Longitude"])

# scale the variables
X = MinMaxScaler().set_output(transform="pandas").fit_transform(X)

# display top 5 rows
X.head()

## Train 500 models

In [None]:
# Train 500 models on different partitions of x

s = dict()
linreg = None
X_train, X_test, y_train, y_test = None, None, None, None

for i in np.linspace(1, 500, num=500):

    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=int(i))

    # Train model
    linreg = LinearRegression().fit(X_train, y_train)

    # Store coefficients
    s[str(int(i))] = pd.Series(linreg.coef_)

In [None]:
# Put coefficients in a dataframe

df = pd.concat(s, axis=1)
df.index = linreg.feature_names_in_
df = df.T

df.head()

In [None]:
# Display variability of coefficients

df.hist(bins=30, figsize=(10,12))
plt.show()

In [None]:
# Summarize variability of coefficients

coeff_summary = df.agg(['mean', 'std'])
coeff_summary

## Coefficient direction (sign)

In [None]:
s = pd.Series(
    coeff_summary.loc['mean'],
    index=linreg.feature_names_in_
)

s.plot.bar(yerr=coeff_summary.loc['std'])
plt.ylabel('Beta')
plt.title('Coefficient value (slope)')
plt.show()

We can see that the variability of the coefficients is different. AveOccup seems to be the most variable. In other words, it has the biggest estimation error.
We can also see that the variables AveRooms and AveBedrms have opposite directions. We'd, intuitively, expect them to have the same direction. And we also expect those variables to be highly correlated.

## Compare Absolute Values of Coefficients - Feature Importance

This is what is normally used as a measure of feature importance. Here we determine the contribution of each feature to the target.

In [None]:
# Plot mean coefficient and std

s = pd.Series(
    np.abs(coeff_summary.loc['mean']),
    index=linreg.feature_names_in_
)

s.plot.bar(yerr=coeff_summary.loc['std'])
plt.ylabel('Absolute coefficient values')
plt.title('Absolute coefficient values')
plt.show()

From the previous plot, we'd expect the variables AveRooms and AveBedrms to be the ones with the highest importance. However, the coefficients for those variables also show more variability or a bigger error. Consequently, we can trust them less.

## t

In [None]:
# Estimate and plot t

s = pd.Series(
    np.abs(coeff_summary.loc['mean'])/coeff_summary.loc['std'],
    index=linreg.feature_names_in_
)

s.plot.bar()
plt.ylabel('t')
plt.title('t')
plt.show()

After correction of the coefficients by their respective error, MedInc appears to be a most robust predictor for house price.

## Cross Validation

We won't probably need to train 500 models to obtain the coefficient errors, but we could infer them by using cross-validation.

In [None]:
# We use cross validation to estimate the coefficient variability

results = cross_validate(
    estimator=linreg,
    X=X_train,
    y=y_train,
    scoring='r2',
    cv=5,
    return_train_score=True,
    return_estimator=True,
)

pd.DataFrame(results)

In [None]:
# R2 in train set

print("R2 in train set: mean, std:")
print(f"{np.mean(results['train_score'])}, {np.std(results['train_score'])}")

In [None]:
# R2 in test set

print("R2 in test set: mean, std:")
print(f"{np.mean(results['test_score'])}, {np.std(results['test_score'])}")