# Linear Regression Coefficients Variability
This notebook evaluates the variability of the coefficients of the linear regression models provided by statsmodels.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm

# the dataset for the demo
from sklearn.datasets import fetch_california_housing
from sklearn.feature_selection import f_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

## Load Data

In [None]:
# load the California House price data from Scikit-learn
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
X = X.drop(columns = ["Latitude", "Longitude"])

# Split data
X_train, X_test, y_train, y_test = None, None, None, None
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0,
)

# Scale data
scaler = MinMaxScaler().set_output(transform="pandas").fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

print('X_train shape: ', X_train.shape)
print('y_train shape: ', y_train.shape)
print('X_test shape: ', X_test.shape)
print('y_test shape: ', y_test.shape)
print()

print('5 first rows in X_train:')
X_train.head()

## Train a Linear Regression Model

In [None]:
# Our model needs an intercept so we add a column of 1s:

X_train = sm.add_constant(X_train)
X_test = sm.add_constant(X_test)

In [None]:
# Fit model

linreg = sm.OLS(y_train, X_train)
results = linreg.fit()
print(results.summary())

## Coefficients direction (sign)

In [None]:
# Coefficients value

s = pd.Series(
    results.params,
    index=X_train.columns,
)

s.plot.bar(yerr=results.bse)
plt.ylabel("Coefficients' value")
plt.title("Coefficients")
plt.show()

We see that the variability of the coefficients is different.

We also see that the variables `AveRooms` and `AveBdrms` have opposite direction. We'd expect, intuitively that they have the same direction. And we also expect these variables to be highly correlated.

Note that the errors estimated by statsmodels are smaller than those observed with cross-validation in the previously.

## Coefficient absolute value - feature importance

In [None]:
# Plot mean coefficient and std

s = pd.Series(
    np.abs(results.params),
    index=X_train.columns,
)

s.plot.bar(yerr=results.bse)
plt.ylabel("Absolute coefficient value")
plt.title("Absolute coefficient value")
plt.show()

From the previous plot, we'd expect the variables `AveRooms` and `AveBdrms` to be the ones with the highest importance. However, the coefficients for those variables also show more variability, or a bigger error. Then, we can trust them less.

## t

In [None]:
# estimate and plot t

s = pd.Series(
    np.abs(results.tvalues),
    index=X_train.columns,
)

s.plot.bar()
plt.ylabel("t")
plt.title("t")
plt.show()

After correcting the coefficient by its error, we see that `MedInc` is a more robust predictor of house price.