<a href="https://colab.research.google.com/github/arutraj/.githubcl/blob/main/01_linear_regression_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
import numpy as np
import pandas as pd

rs = np.random.RandomState(seed=666)

# Let's create some fake data, just for practice
n = 2170
p = 4

X = rs.random(size=(n, p))

betas = np.arange(1, p+2)

y = betas[0] + (X @ betas[1:]) + rs.normal(size=n)

df = pd.DataFrame({f"X{i+1}": X[:, i] for i in range(p)}).assign(y=y)

model = " + ".join([f"{betas[i]} X{i}" for i in range(1, p+1)])
print(f"The model is y ~ {betas[0]} + {model}\n")

df

The model is y ~ 1 + 2 X1 + 3 X2 + 4 X3 + 5 X4



Unnamed: 0,X1,X2,X3,X4,y
0,0.700437,0.844187,0.676514,0.727858,10.665165
1,0.951458,0.012703,0.413588,0.048813,4.852854
2,0.099929,0.508066,0.200248,0.744154,7.034994
3,0.192892,0.700845,0.293228,0.774479,8.187246
4,0.005109,0.112858,0.110954,0.247668,1.624202
...,...,...,...,...,...
2165,0.653573,0.029482,0.412838,0.112104,5.135213
2166,0.910710,0.756870,0.218970,0.478947,6.195375
2167,0.212697,0.226373,0.239228,0.137604,4.501809
2168,0.736453,0.255201,0.887921,0.004492,7.268949


In [2]:
np.arange(1, p+2)

array([1, 2, 3, 4])

# Practical appraoch to fitting OLS/LinearRegression models

Two popular modeling libraries in `python`:

* `statsmodels`: As the name suggests, more geared towards statistics.
  * Based on copying `R` APIs.
  * Provides lots of statistical summaries/scores.
  * Doesn't integrate well with the broader "ML" ecosystem

* `sklearn`: A much broader ML library
  * De facto standard for many ML libraries (via `Estimator`, `Transformer`, `Pipeline` APIs)
  * Not so "ergonomic" for statistical inference (e.g., getting coefficient values)

## `statsmodels`: Better summary of model properties; more _stats_ oriented (as the name suggests)

In [4]:
import statsmodels.formula.api as smf

# Note that we _create_ a model with the data included
model = smf.ols(formula="y ~ X1 + X2 + X3", data=df)

# Calling fit() triggers a computation with the already-initialized
# data, resulting in a _different_ "fitted" object (cf. sklearn)
fitted_model = model.fit()

fitted_model.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.485
Model:,OLS,Adj. R-squared:,0.484
Method:,Least Squares,F-statistic:,678.8
Date:,"Sun, 10 Nov 2024",Prob (F-statistic):,4.7e-311
Time:,17:45:26,Log-Likelihood:,-4282.9
No. Observations:,2170,AIC:,8574.0
Df Residuals:,2166,BIC:,8597.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,3.2809,0.115,28.413,0.000,3.054,3.507
X1,2.0566,0.130,15.863,0.000,1.802,2.311
X2,3.1216,0.129,24.271,0.000,2.869,3.374
X3,4.2918,0.127,33.834,0.000,4.043,4.541

0,1,2,3
Omnibus:,37.236,Durbin-Watson:,1.999
Prob(Omnibus):,0.0,Jarque-Bera (JB):,21.488
Skew:,0.036,Prob(JB):,2.16e-05
Kurtosis:,2.518,Cond. No.,6.03


In [5]:
fitted_model.params

Unnamed: 0,0
Intercept,3.280898
X1,2.056569
X2,3.121597
X3,4.291764


## `sklearn`: More geared towards blackbox ML-style "fit/predict". No "inference".

In [6]:
from sklearn.linear_model import LinearRegression

In [7]:
# Note that we create a data-agnostic LinearRegression model
# but then fit() with the data (cf. statsmodels)
lr = LinearRegression()
lr.fit(X=df[["X1", "X2", "X3"]], y=df.y)

In [8]:
# Extracting the parameters is a little more work
lr.intercept_

3.2808978882438717

In [9]:
lr.coef_

array([2.05656928, 3.12159717, 4.2917639 ])