# Linear Regression Coefficients Variability

This notebook focuses on the understanding the information provided by the linear regression model coefficients, and the calculation of their variability.


In [None]:
# Packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from scipy import stats

from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

import statsmodels.api as sm

## Load Data

In [None]:
# load the California House price data from Scikit-learn
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
X = X.drop(columns=["Latitude", "Longitude"])

# display top 5 rows
X.head()

## Train 500 models

In [None]:
# Train 500 models on different partitions of x

s = dict()
linreg = None

for i in np.linspace(1, 500, num=500):

    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=int(i))

    # Train model
    linreg = LinearRegression().fit(X_train, y_train)

    # Store coefficients
    s[str(int(i))] = pd.Series(linreg.coef_)

In [None]:
# Put coefficients in a dataframe

df = pd.concat(s, axis=1)
df.index = linreg.feature_names_in_
df = df.T

df.head()

In [None]:
# Display variability of coefficients

df.hist(bins=30, figsize=(10,12))
plt.show()

In [None]:
# Summarize variability of coefficients

coeff_summary = df.agg(['mean', 'std'])
coeff_summary

## Coefficient direction (sign)

In [None]:
s = pd.Series(
    coeff_summary.loc['mean'],
    index=linreg.feature_names_in_
)

s.plot.bar(yerr=coeff_summary.loc['std'])
plt.ylabel('Beta')
plt.title('Coefficient value (slope)')
plt.show()

We can see that the variability of the coefficients is different. AveOccup/AveBedrms seems to be the most variable. In other words, it has the biggest estimation error.
We can also see that the variables AveRooms and AveBedrms have opposite directions. We'd expect them, intuitively, to have the same direction. And we also expect those variables to be highly correlated.