# Testing `sklearn` and `statsmodel` functions using artificial data

1. [Testing how `sklearn` performs with high collinearity](#testing-sklearn-and-statsmodel-functions-using-artificial-data)
1. [Testing `StandardScaler`]()
1. [Testing ridge regression]()


In [162]:
# import modules

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import sklearn.linear_model as skl_lm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# ridge
from sklearn.linear_model import RidgeCV

---

## Testing how `sklearn` performs with high collinearity

We generate a dataframe with 3 predictors:
$$Y=X_1 +2X_2 + 3X_3,$$

where $X_1$ and $X_2$ are strongly correlated with the relationship $X_2=2X_1$.

Results: 
- uses pseudoinverse
- doesn't give warning
- good fit when predictors are not scaled, expected confounding effect when scaled
- "equal weightage distribution" are due to scaler, not OLS function

In [163]:
# set parameters as you like to play around 
beta1=1
beta2=2
beta3=3
correlation=2 # X_2 = 2*X_1

# Define the number of rows and columns
num_rows = 100
num_cols = 4

# Define column names (optional, but good practice)
column_names = [f'col_{i+1}' for i in range(num_cols)]

# Generate random data using NumPy
# For random integers: np.random.randint(low, high, size=(rows, cols))
# For random floats: np.random.rand(rows, cols) or np.random.uniform(low, high, size=(rows, cols))
random_data = np.random.randint(0, 100, size=(num_rows, num_cols)) # Example: random integers between 0 and 99

# Create the DataFrame
df = pd.DataFrame(random_data, columns=column_names)

for i in range(num_rows):
    df.iloc[i,1]=correlation*df.iloc[i,0]+ np.random.randint(-50,50)
    df.iloc[i,num_cols-1] = beta1*df.iloc[i,0] + beta2*df.iloc[i,1] + beta3*df.iloc[i,2] + np.random.randint(-100,100)

# Print the generated DataFrame
print(df)

    col_1  col_2  col_3  col_4
0      31    104     42    403
1       6     57     78    392
2      16     -7     40    133
3      69    130     42    536
4      44    112     58    475
..    ...    ...    ...    ...
95     15     77     72    367
96     23     53     90    477
97     38    101     83    496
98     42     39      2    126
99     27     77     77    446

[100 rows x 4 columns]


In [164]:
X=df.drop(columns=['col_4'])
y=df['col_4']

X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.8,random_state=42)

# standardise
scaler=StandardScaler()
X_train_scaled=scaler.fit_transform(X_train)
X_test_scaled=scaler.transform(X_test)

# build model
model=LinearRegression()
model_scaled=model.fit(X_train_scaled,y_train)

print(model_scaled.coef_, model_scaled.intercept_)


[ 31.13727192 131.24845276  90.21227197] 350.1625


Real relationship:

$$Y=X_1+2X_2+3X_3$$

In [165]:
# equally distributed weightage?

b1=(model_scaled.coef_[1]+model_scaled.coef_[0])

print(b1)

162.38572467833592


In [166]:
# pseudoinverse for scaled data

pinv_arr=np.ones((80,4))
for i in range(X_train_scaled.shape[0]):
    for j in range(4):
        if j<3:
            pinv_arr[i,j]=X_train_scaled[i,j]

pinv=np.linalg.pinv(pinv_arr)
print(np.matmul(pinv,y_train))

[ 31.13727192 131.24845276  90.21227197 350.1625    ]


### What if we remove the second (collinear) column?

Now, there is no collinearity in our model, and we see that the coefficient for the first predictor is the sum of the coefficients of the two correlated variables in the previous model.

In [167]:
X_new=X.drop(columns=['col_2'])

X_new_train, X_new_test, y_train, y_test = train_test_split(X_new,y,train_size=0.8,random_state=42)

# standardise
X_new_train_scaled=scaler.fit_transform(X_new_train)
X_new_test_scaled=scaler.transform(X_new_test)

# build model
model=LinearRegression()
model_new=model.fit(X_new_train_scaled,y_train)

print(model_new.coef_, model_new.intercept_)

[150.65610416 105.25172218] 350.1625


## Testing StandardScaler

### 1. What if we don't scale the regressors?

In [168]:
# model fit to unscaled data

model_unscaled=LinearRegression()
model_unscaled.fit(X_train,y_train)
print(model_unscaled.coef_, model_unscaled.intercept_)

# real relationship: Y=X_1+2X_2+3X_3

[1.05374646 1.95255523 2.89790861] 10.848553568252782


In [169]:
# using pseudoinverse for unscaled data matrix

# add one column of ones for constant term
pinv_arr=np.ones((80,4))
for i in range(80):
    for j in range(4):
        if j<3:
            pinv_arr[i,j]=X_train.iloc[i,j]

pinv2=np.linalg.pinv(pinv_arr)
print(np.matmul(pinv2,y_train))

[ 1.05374646  1.95255523  2.89790861 10.84855357]


In [170]:
X_train.head

<bound method NDFrame.head of     col_1  col_2  col_3
55     50     74      0
88     34     54      9
26     76    119     52
42     62    143     43
69     44    121     24
..    ...    ...    ...
60     86    162     62
71      4     41     95
14     90    213      1
92     98    208      2
51     13      0      1

[80 rows x 3 columns]>

In [171]:
X_train_scaled

array([[ 0.23689375, -0.16104122, -1.42305862],
       [-0.30457768, -0.45857695, -1.13394965],
       [ 1.11678484,  0.50841418,  0.24734879],
       [ 0.64299733,  0.86545706, -0.04176019],
       [ 0.03384196,  0.53816776, -0.65210135],
       [-0.81220716, -1.41069129, -1.3909354 ],
       [ 1.82746611,  2.24899822,  0.82556674],
       [-0.6768393 , -0.47345373,  1.46803113],
       [-1.45520449, -1.21729307, -0.71634779],
       [ 0.43994554,  0.70181241,  1.27529181],
       [ 0.40610358,  0.43403025, -0.74847101],
       [ 0.98141698,  0.56792133, -0.33086916],
       [-1.01525895, -1.61896631,  0.11885591],
       [ 1.2521527 ,  1.57954282,  1.72501688],
       [-1.28599467, -1.291677  , -0.71634779],
       [ 1.75978218,  0.88033385,  1.56440079],
       [-0.10152589, -0.66685196,  0.40796489],
       [ 0.03384196, -0.6222216 ,  0.31159523],
       [-0.91373305, -1.03877163, -1.42305862],
       [ 0.91373305,  0.95471778,  0.05460947],
       [-0.40610358, -0.60734481, -0.106

Let's compare $\mathbf{X}$ before and after scaling 

In [172]:
# trying to standardise a column manually``

X_new=X_train
mean1=np.mean(X_train.iloc[:,0])
var1=np.var(X_train.iloc[:,0])
sd1=var1**0.5
X_new.iloc[:,0]=(X_train.iloc[:,0] - mean1 )/sd1
print(mean1,var1)
X_new.iloc[:,0]

43.0 873.15


88   -0.304578
26    1.116785
42    0.642997
69    0.033842
        ...   
60    1.455204
71   -1.319837
14    1.590572
92    1.861308
51   -1.015259
Name: col_1, Length: 80, dtype: float64' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
  X_new.iloc[:,0]=(X_train.iloc[:,0] - mean1 )/sd1


55    0.236894
88   -0.304578
26    1.116785
42    0.642997
69    0.033842
        ...   
60    1.455204
71   -1.319837
14    1.590572
92    1.861308
51   -1.015259
Name: col_1, Length: 80, dtype: float64

### 2. Testing the scaler with non-collinear data

Results: everything normal and consistent. The comparison between output of scaled and unscaled model is as expected.

In [173]:
# Define the number of rows and columns
num_rows = 100
num_cols = 4

# Define column names (optional, but good practice)
column_names = [f'col_{i+1}' for i in range(num_cols)]

# Generate random data using NumPy
# For random integers: np.random.randint(low, high, size=(rows, cols))
# For random floats: np.random.rand(rows, cols) or np.random.uniform(low, high, size=(rows, cols))

random_data = np.random.randint(0, 100, size=(num_rows, num_cols)) # Example: random integers between 0 and 99

# Create the DataFrame
df = pd.DataFrame(random_data, columns=column_names)

for i in range(num_rows):
    df.iloc[i,num_cols-1]=df.iloc[i,0]+2*df.iloc[i,1]+3*df.iloc[i,2]+np.random.randint(-10,10)

# Print the generated DataFrame
print(df)

    col_1  col_2  col_3  col_4
0      78     51     64    364
1      90     61     36    317
2      18     28     30    159
3      94     60     25    294
4      82     99     78    509
..    ...    ...    ...    ...
95     70     57     81    431
96     10     83     25    245
97     98      9     37    225
98     59     73     99    501
99     24     96     57    391

[100 rows x 4 columns]


In [174]:
X=df.drop(columns=['col_4'])
y=df['col_4']

X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.8,random_state=42)

# standardise
scaler=StandardScaler()
X_train_scaled=scaler.fit_transform(X_train)
X_test_scaled=scaler.transform(X_test)

# correct
model1=LinearRegression()
model2=LinearRegression()
model_scaled=model1.fit(X_train_scaled,y_train)
model_unscaled=model2.fit(X_train,y_train)

# wrong
# model=LinearRegression()
# model_scaled=model.fit(X_train_scaled,y_train)
# model_unscaled=model.fit(X_train,y_train)


In [175]:
df

Unnamed: 0,col_1,col_2,col_3,col_4
0,78,51,64,364
1,90,61,36,317
2,18,28,30,159
3,94,60,25,294
4,82,99,78,509
...,...,...,...,...
95,70,57,81,431
96,10,83,25,245
97,98,9,37,225
98,59,73,99,501


In [176]:
X_train_scaled

array([[ 1.24070198, -0.66375986, -1.58582921],
       [-0.22144766, -1.42708371, -1.62034127],
       [-1.37756599,  0.79651184,  0.31233415],
       [-1.03753119,  0.26550395,  0.83001507],
       [ 0.79865674, -0.73013585,  0.13977385],
       [ 0.45862194,  0.4978199 , -1.10266035],
       [-0.90151727, -0.9624518 ,  0.58843064],
       [-1.44557295,  1.09520377, -0.75753974],
       [ 0.22059758, -1.32751973, -0.27437089],
       [ 1.41071938, -1.42708371, -0.82656387],
       [ 1.10468806, -0.26550395, -1.58582921],
       [-0.79950683, -0.63057187,  0.72647889],
       [-0.35746158,  0.66375986,  0.96806332],
       [ 1.30870894,  0.36506792,  1.79635278],
       [ 0.45862194, -1.09520377,  0.76099095],
       [ 0.0165767 , -1.42708371,  1.31318393],
       [ 0.76465326,  0.39825592, -0.55046738],
       [-0.52747899,  1.55983568,  0.34684621],
       [ 0.1865941 ,  0.79651184,  0.83001507],
       [-1.00352771, -1.29433173,  0.86452713],
       [-1.17354511,  0.39825592,  1.244

In [177]:
print("Scaled: ",model_scaled.coef_, model_scaled.intercept_)
print("Unscaled: ",model_unscaled.coef_, model_unscaled.intercept_)

Scaled:  [29.80786061 60.00086265 88.17268413] 293.1625
Unscaled:  [1.013571   1.99130822 3.04302107] -2.4978970232043594


## Testing ridge regression

What I want to know: whether if it just gives a model that is useful for predictions OR it is actually able to capture the true underlying relationship. 

We reuse the same generated dataframe from last section, with relationship
$$Y=X_1+2X_2+3X_3$$
where $X_1$ and $X_2$ are correlated with $X_2=2X_1$.

In [178]:
# reuse the data
print(df)


    col_1  col_2  col_3  col_4
0      78     51     64    364
1      90     61     36    317
2      18     28     30    159
3      94     60     25    294
4      82     99     78    509
..    ...    ...    ...    ...
95     70     57     81    431
96     10     83     25    245
97     98      9     37    225
98     59     73     99    501
99     24     96     57    391

[100 rows x 4 columns]


In [179]:
X=df.iloc[:,0:3]
y=df.iloc[:,3]

X_train, X_test, y_train, y_test=train_test_split(X,y,train_size=0.8, random_state=42)

scaler=StandardScaler()
X_train_scaled=scaler.fit_transform(X_train)
X_test_scaled=scaler.transform(X_test)

ols=LinearRegression().fit(X_train_scaled,y_train)
ridge_cv=RidgeCV(alphas=[0.0001,0.001,0.01,0.1,1.0,5.0,10.0],cv=10) #alpha = regularisation parameter
ridge_cv.fit(X_train_scaled, y_train)

# make predictions
y_pred_ols=ols.predict(X_test_scaled)
y_pred_ridge=ridge_cv.predict(X_test_scaled)
print("Model score (R^2) for OLS: ", r2_score(y_test,y_pred_ols))
print("Model score (R^2) for ridge: ", r2_score(y_test,y_pred_ridge))



Model score (R^2) for OLS:  0.9964168459578655
Model score (R^2) for ridge:  0.9964168867142018


In [180]:
from sklearn.model_selection import cross_val_score, KFold

# define cross-validation strategy
cv = KFold(n_splits=5, shuffle=True, random_state=42)

# compute cross-validated R^2 scores
cv_scores = cross_val_score(ols, X_train_scaled, y_train, cv=cv, scoring='r2')

print("Cross-validated R^2 scores (OLS):", cv_scores)
print("Mean CV R^2 (OLS):", np.mean(cv_scores))

Cross-validated R^2 scores (OLS): [0.99709457 0.99751358 0.99845694 0.99296522 0.99764655]
Mean CV R^2 (OLS): 0.9967353730549586


In [181]:
ols.coef_

array([29.80786061, 60.00086265, 88.17268413])

In [182]:
ridge_cv.coef_

array([29.80781294, 60.00079484, 88.17257945])

In [183]:
ridge_cv.alpha_

0.0001

## Testing LASSO 

(same as ridge)
What I want to know: whether if it just gives a model that is useful for predictions OR it is actually able to capture the true underlying relationship. 

We reuse the same generated dataframe from previous sections, with relationship
$$Y=X_1+2X_2+3X_3$$
where $X_1$ and $X_2$ are correlated with $X_2=2X_1$.