# Testing `sklearn` and `statsmodel` functions using artificial data

1. [Testing how `sklearn` performs with high collinearity](#testing-sklearn-and-statsmodel-functions-using-artificial-data)
1. [Testing `StandardScaler`](#testing-standardscaler)
1. [Testing ridge regression](#testing-ridge-regression)
1. [Testing LASSO](#testing-lasso)


In [244]:
# import modules

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import sklearn.linear_model as skl_lm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# ridge and lasso
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LassoCV

---

## Testing how `sklearn` performs with high collinearity

We generate a dataframe with 3 predictors:
$$Y=X_1 +2X_2 + 3X_3,$$

where $X_1$ and $X_2$ are strongly correlated with the relationship $X_2=2X_1$.

Results: 
- uses pseudoinverse
- doesn't give warning
- good fit when predictors are not scaled, expected confounding effect when scaled
- "equal weightage distribution" are due to scaler, not OLS function

In [245]:
# set parameters as you like to play around 
beta1=1
beta2=2
beta3=3
correlation=3 # X_2 = 2*X_1

# Define the number of rows and columns
num_rows = 100
num_cols = 4

# Define column names (optional, but good practice)
column_names = [f'col_{i+1}' for i in range(num_cols)]

# Generate random data using NumPy
# For random integers: np.random.randint(low, high, size=(rows, cols))
# For random floats: np.random.rand(rows, cols) or np.random.uniform(low, high, size=(rows, cols))
random_data = np.random.randint(0, 100, size=(num_rows, num_cols)) # Example: random integers between 0 and 99

# Create the DataFrame
df = pd.DataFrame(random_data, columns=column_names)

for i in range(num_rows):
    df.iloc[i,1]=correlation*df.iloc[i,0]+ np.random.randint(-50,50)
    df.iloc[i,num_cols-1] = beta1*df.iloc[i,0] + beta2*df.iloc[i,1] + beta3*df.iloc[i,2] + np.random.randint(-100,100)

# Print the generated DataFrame
print(df)

    col_1  col_2  col_3  col_4
0       3     50     73    373
1      54    133     97    553
2      76    241      0    463
3      47    132     75    590
4      35    149     24    381
..    ...    ...    ...    ...
95     97    306     99    959
96     54    189     81    672
97     83    217     24    546
98     72    218     47    688
99      6     32     48    282

[100 rows x 4 columns]


In [246]:
X=df.drop(columns=['col_4'])
y=df['col_4']

X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.8,random_state=42)

# standardise
scaler=StandardScaler()
X_train_scaled=scaler.fit_transform(X_train)
X_test_scaled=scaler.transform(X_test)

# build model
model=LinearRegression()
model_scaled=model.fit(X_train_scaled,y_train)

# make predictions
y_pred1=model.predict(X_test_scaled)

print(model_scaled.coef_, model_scaled.intercept_)


[ 10.45822229 201.47076933  80.58962345] 504.3125


Real relationship:

$$Y=X_1+2X_2+3X_3$$

In [247]:
# equally distributed weightage?

b1=(model_scaled.coef_[1]+model_scaled.coef_[0])

print(b1)

211.92899162592866


In [248]:
# pseudoinverse for scaled data

pinv_arr=np.ones((80,4))
for i in range(X_train_scaled.shape[0]):
    for j in range(4):
        if j<3:
            pinv_arr[i,j]=X_train_scaled[i,j]

pinv=np.linalg.pinv(pinv_arr)
print(np.matmul(pinv,y_train))

[ 10.45822229 201.47076933  80.58962345 504.3125    ]


### What if we remove the second (collinear) column?

Now, there is no collinearity in our model, and we see that the coefficient for the first predictor is the sum of the coefficients of the two correlated variables in the previous model.

In [249]:
X_new=X.drop(columns=['col_2'])

X_new_train, X_new_test, y_train, y_test = train_test_split(X_new,y,train_size=0.8,random_state=42)

# standardise
X_new_train_scaled=scaler.fit_transform(X_new_train)
X_new_test_scaled=scaler.transform(X_new_test)

# build model
model_new=LinearRegression().fit(X_new_train_scaled,y_train)

# test
y_pred2=model_new.predict(X_new_test_scaled)

print(model_new.coef_, model_new.intercept_)

[200.95609715  85.47910139] 504.3125


In [250]:
# comparing R2 values
print("with second column: ", r2_score(y_test,y_pred1))
print(" w/o second column: ", r2_score(y_test,y_pred2))

with second column:  0.9212136936055169
 w/o second column:  0.8057671040075728


## Testing StandardScaler

### 1. What if we don't scale the regressors?

In [251]:
# model fit to unscaled data

model_unscaled=LinearRegression()
model_unscaled.fit(X_train,y_train)
print(model_unscaled.coef_, model_unscaled.intercept_)

# real relationship: Y=X_1+2X_2+3X_3

[0.3528274  2.08528539 2.867593  ] 30.025333148315553


In [252]:
# using pseudoinverse for unscaled data matrix

# add one column of ones for constant term
pinv_arr=np.ones((80,4))
for i in range(80):
    for j in range(4):
        if j<3:
            pinv_arr[i,j]=X_train.iloc[i,j]

pinv2=np.linalg.pinv(pinv_arr)
print(np.matmul(pinv2,y_train))

[ 0.3528274   2.08528539  2.867593   30.02533315]


In [253]:
X_train.head

<bound method NDFrame.head of     col_1  col_2  col_3
55     10    -15     40
88     41     82     88
26     95    247      0
42     46    120      9
69     67    236     84
..    ...    ...    ...
60     26     36     23
71     29     47     38
14     16      6     60
92     96    327     65
51     10     -2     68

[80 rows x 3 columns]>

In [254]:
X_train_scaled

array([[-1.38700595e+00, -1.73613552e+00, -2.85995613e-01],
       [-3.41163824e-01, -7.32155220e-01,  1.42197197e+00],
       [ 1.48062569e+00,  9.75646317e-01, -1.70930193e+00],
       [-1.72479609e-01, -3.38843350e-01, -1.38905801e+00],
       [ 5.35994092e-01,  8.61792882e-01,  1.27964133e+00],
       [-1.35326911e+00, -8.04607406e-01, -1.49580598e+00],
       [ 4.68520406e-01,  7.68640071e-01,  2.47744256e-01],
       [ 9.74151340e-02,  3.75328201e-01,  1.17289336e+00],
       [ 2.32362506e-01, -2.83339800e-02, -9.62066113e-01],
       [-1.55569017e+00, -1.75683614e+00, -1.03323143e+00],
       [-3.07426981e-01, -6.18301784e-01, -4.99491560e-01],
       [ 5.69730935e-01,  4.68481013e-01,  2.83326914e-01],
       [-1.25205858e+00, -7.01104283e-01, -1.21114472e+00],
       [ 1.04204674e+00,  1.44141037e+00,  1.10172805e+00],
       [-1.65690070e+00, -1.89139020e+00,  8.52649440e-01],
       [ 4.01046720e-01,  8.72143194e-01,  3.42483082e-02],
       [ 1.27820464e+00,  1.14125131e+00

Let's compare $\mathbf{X}$ before and after scaling 

In [255]:
# trying to standardise a column manually``

X_new=X_train
mean1=np.mean(X_train.iloc[:,0])
var1=np.var(X_train.iloc[:,0])
sd1=var1**0.5
X_new.iloc[:,0]=(X_train.iloc[:,0] - mean1 )/sd1
print(mean1,var1)
X_new.iloc[:,0]

51.1125 878.5998437499996


88   -0.341164
26    1.480626
42   -0.172480
69    0.535994
        ...   
60   -0.847216
71   -0.746006
14   -1.184585
92    1.514363
51   -1.387006
Name: col_1, Length: 80, dtype: float64' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
  X_new.iloc[:,0]=(X_train.iloc[:,0] - mean1 )/sd1


55   -1.387006
88   -0.341164
26    1.480626
42   -0.172480
69    0.535994
        ...   
60   -0.847216
71   -0.746006
14   -1.184585
92    1.514363
51   -1.387006
Name: col_1, Length: 80, dtype: float64

### 2. Testing the scaler with non-collinear data

Results: everything normal and consistent. The comparison between output of scaled and unscaled model is as expected.

In [256]:
# Define the number of rows and columns
num_rows = 100
num_cols = 4

# Define column names (optional, but good practice)
column_names = [f'col_{i+1}' for i in range(num_cols)]

# Generate random data using NumPy
# For random integers: np.random.randint(low, high, size=(rows, cols))
# For random floats: np.random.rand(rows, cols) or np.random.uniform(low, high, size=(rows, cols))

random_data = np.random.randint(0, 100, size=(num_rows, num_cols)) # Example: random integers between 0 and 99

# Create the DataFrame
df = pd.DataFrame(random_data, columns=column_names)

for i in range(num_rows):
    df.iloc[i,num_cols-1]=df.iloc[i,0]+2*df.iloc[i,1]+3*df.iloc[i,2]+np.random.randint(-10,10)

# Print the generated DataFrame
print(df)

    col_1  col_2  col_3  col_4
0      56     44     43    273
1      40     41     30    205
2      26     34     24    166
3      85      6     72    311
4      82     10     35    216
..    ...    ...    ...    ...
95      4     89     86    442
96     96     72     10    272
97     33     49     30    221
98      0     34      5     84
99     24     18     25    137

[100 rows x 4 columns]


In [257]:
X=df.drop(columns=['col_4'])
y=df['col_4']

X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.8,random_state=42)

# standardise
scaler=StandardScaler()
X_train_scaled=scaler.fit_transform(X_train)
X_test_scaled=scaler.transform(X_test)

# correct
model1=LinearRegression()
model2=LinearRegression()
model_scaled=model1.fit(X_train_scaled,y_train)
model_unscaled=model2.fit(X_train,y_train)

# wrong
# model=LinearRegression()
# model_scaled=model.fit(X_train_scaled,y_train)
# model_unscaled=model.fit(X_train,y_train)


In [258]:
df

Unnamed: 0,col_1,col_2,col_3,col_4
0,56,44,43,273
1,40,41,30,205
2,26,34,24,166
3,85,6,72,311
4,82,10,35,216
...,...,...,...,...
95,4,89,86,442
96,96,72,10,272
97,33,49,30,221
98,0,34,5,84


In [259]:
X_train_scaled

array([[ 1.60582582,  1.81051356, -0.14283049],
       [-0.05651975,  0.43863888,  0.30191116],
       [ 1.37309744, -1.40256135, -0.38230677],
       [ 1.14036906,  0.7274546 , -0.31388498],
       [-1.28665547, -0.78882794, -0.62178305],
       [ 0.40893701,  0.43863888,  0.8834964 ],
       [-1.41964311,  0.18592512,  1.32823806],
       [ 1.63907273,  0.8357605 , -1.20336829],
       [ 0.17620863, -1.47476528,  1.36244896],
       [-0.52197651,  0.6191487 ,  1.08876178],
       [ 1.63907273,  0.40253691,  0.71244192],
       [-0.62171724,  0.04151726,  1.63613613],
       [-0.85444562, -0.60831811, -0.55336125],
       [-0.18950739,  1.5577998 ,  1.56771434],
       [-0.65496415, -1.61917314,  1.46508165],
       [ 1.60582582, -0.68052205,  1.56771434],
       [-1.12042091, -1.47476528, -1.0665247 ],
       [-0.38898886, -0.1028906 , -0.82704842],
       [-1.18691473,  0.94406639, -0.45072856],
       [-1.25340856, -1.22205152, -0.89547022],
       [ 1.70556655, -0.86103187,  0.575

In [260]:
print("Scaled: ",model_scaled.coef_, model_scaled.intercept_)
print("Unscaled: ",model_unscaled.coef_, model_unscaled.intercept_)

Scaled:  [30.71013975 54.56729394 87.56053925] 278.6375
Unscaled:  [1.02101729 1.96998655 2.99552455] -0.6006722168861529


## Testing ridge regression

What I want to know: whether if it just gives a model that is useful for predictions OR it is actually able to capture the true underlying relationship. 

We reuse the same generated dataframe from last section, with relationship
$$Y=X_1+2X_2+3X_3$$
where $X_1$ and $X_2$ are correlated with $X_2=2X_1$.

In [261]:
# reuse the data
print(df)


    col_1  col_2  col_3  col_4
0      56     44     43    273
1      40     41     30    205
2      26     34     24    166
3      85      6     72    311
4      82     10     35    216
..    ...    ...    ...    ...
95      4     89     86    442
96     96     72     10    272
97     33     49     30    221
98      0     34      5     84
99     24     18     25    137

[100 rows x 4 columns]


In [262]:
X=df.iloc[:,0:3]
y=df.iloc[:,3]

X_train, X_test, y_train, y_test=train_test_split(X,y,train_size=0.8, random_state=42)

scaler=StandardScaler()
X_train_scaled=scaler.fit_transform(X_train)
X_test_scaled=scaler.transform(X_test)

ols=LinearRegression().fit(X_train_scaled,y_train)
ridge_cv=RidgeCV(alphas=[0.0001,0.001,0.01,0.1,1.0,5.0,10.0],cv=10) #alpha = regularisation parameter
ridge_cv.fit(X_train_scaled, y_train)

# make predictions
y_pred_ols=ols.predict(X_test_scaled)
y_pred_ridge=ridge_cv.predict(X_test_scaled)
print("Model score (R^2) for OLS: ", r2_score(y_test,y_pred_ols))
print("Model score (R^2) for ridge: ", r2_score(y_test,y_pred_ridge))



Model score (R^2) for OLS:  0.9973538754317404
Model score (R^2) for ridge:  0.9972865528157732


In [263]:
from sklearn.model_selection import cross_val_score, KFold

# define cross-validation strategy
cv = KFold(n_splits=5, shuffle=True, random_state=42)

# compute cross-validated R^2 scores
cv_scores = cross_val_score(ols, X_train_scaled, y_train, cv=cv, scoring='r2')

print("Cross-validated R^2 scores (OLS):", cv_scores)
print("Mean CV R^2 (OLS):", np.mean(cv_scores))

Cross-validated R^2 scores (OLS): [0.99745812 0.99794622 0.99800506 0.99851217 0.99543926]
Mean CV R^2 (OLS): 0.9974721663330166


In [264]:
ols.coef_

array([30.71013975, 54.56729394, 87.56053925])

In [265]:
ridge_cv.coef_

array([30.70064674, 54.50848182, 87.4572582 ])

In [266]:
ridge_cv.alpha_

0.1

## Testing LASSO 

(same as ridge)
What I want to know: whether if it just gives a model that is useful for predictions OR it is actually able to capture the true underlying relationship. 

We reuse the same generated dataframe from previous sections, with relationship
$$Y=X_1+2X_2+3X_3$$
where $X_1$ and $X_2$ are correlated with $X_2=2X_1$.

In [269]:
#Lasso Cross validation
lasso_cv = LassoCV(alphas = [0.0001, 0.001,0.01, 0.1, 1, 10], random_state=42).fit(X_train_scaled, y_train)

y_pred_lasso=lasso_cv.predict(X_test_scaled)
#score
# print(lasso_cv.score(X_train, y_train))
# print(lasso_cv.score(X_test, y_test))
r2_lasso=r2_score(y_test,y_pred_lasso)

# compare
print("Model score (R^2) for OLS:   ", r2_score(y_test,y_pred_ols))
print("Model score (R^2) for ridge: ", r2_score(y_test,y_pred_ridge))
print("Model score (R^2) for lasso: ", r2_lasso)
print(f'Alpha selected out of {lasso_cv.alphas}: {lasso_cv.alpha_}')

Model score (R^2) for OLS:    0.9973538754317404
Model score (R^2) for ridge:  0.9972865528157732
Model score (R^2) for lasso:  0.9973538195472932
Alpha selected out of [0.0001, 0.001, 0.01, 0.1, 1, 10]: 0.0001


In [270]:
print("Test MSE for OLS:   ", mean_squared_error(y_test,y_pred_ols))
print("Test MSE for ridge: ", mean_squared_error(y_test,y_pred_ridge))
print("Test MSE for lasso: ", mean_squared_error(y_test,y_pred_lasso))

Test MSE for OLS:    36.74511112812397
Test MSE for ridge:  37.67997906095754
Test MSE for lasso:  36.745887161211655
