# Testing `sklearn` and `statsmodel` functions using artificial data

1. [Testing how `sklearn` performs with high collinearity](#testing-sklearn-and-statsmodel-functions-using-artificial-data)
1. [Testing `StandardScaler`]()
1. [Testing ridge regression]()


In [25]:
# import modules

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import sklearn.linear_model as skl_lm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

# ridge
from sklearn.linear_model import RidgeCV

---

## Testing how `sklearn` performs with high collinearity

We generate a dataframe with 3 predictors:
$$Y=X_1 +2X_2 + 3X_3,$$

where $X_1$ and $X_2$ are strongly correlated with the relationship $X_2=2X_1$.

Results: 
- uses pseudoinverse
- doesn't give warning
- good fit when predictors are not scaled, expected confounding effect when scaled
- "equal weightage distribution" are due to scaler, not OLS function

In [23]:
# set parameters as you like to play around 
beta1=1
beta2=2
beta3=3
correlation=2 # X_2 = 2*X_1

# Define the number of rows and columns
num_rows = 100
num_cols = 4

# Define column names (optional, but good practice)
column_names = [f'col_{i+1}' for i in range(num_cols)]

# Generate random data using NumPy
# For random integers: np.random.randint(low, high, size=(rows, cols))
# For random floats: np.random.rand(rows, cols) or np.random.uniform(low, high, size=(rows, cols))
random_data = np.random.randint(0, 100, size=(num_rows, num_cols)) # Example: random integers between 0 and 99

# Create the DataFrame
df = pd.DataFrame(random_data, columns=column_names)

for i in range(num_rows):
    df.iloc[i,1]=correlation*df.iloc[i,0]
    df.iloc[i,num_cols-1] = beta1*df.iloc[i,0] + beta2*df.iloc[i,1] + beta3*df.iloc[i,2] + np.random.randint(-10,10)

# Print the generated DataFrame
print(df)

    col_1  col_2  col_3  col_4
0      47     94     20    304
1       9     18     94    328
2      17     34     26    154
3      57    114     34    380
4      59    118     68    489
..    ...    ...    ...    ...
95     50    100     48    385
96     54    108     52    431
97     71    142     89    624
98     57    114     88    549
99     78    156     62    577

[100 rows x 4 columns]


In [8]:
X=df.drop(columns=['col_4'])
y=df['col_4']

X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.8,random_state=42)

# standardise
scaler=StandardScaler()
X_train_scaled=scaler.fit_transform(X_train)
X_test_scaled=scaler.transform(X_test)

# build model
model=LinearRegression()
model_scaled=model.fit(X_train_scaled,y_train)

print(model_scaled.coef_, model_scaled.intercept_)


[75.36637295 75.36637295 96.09858325] 364.05


Real relationship:

$$Y=X_1+2X_2+3X_3$$

In [9]:
# equally distributed weightage?

b1=(model_scaled.coef_[1]+model_scaled.coef_[0])

print(b1)

150.73274590345403


In [10]:
# pseudoinverse for scaled data

pinv_arr=np.ones((80,4))
for i in range(X_train_scaled.shape[0]):
    for j in range(4):
        if j<3:
            pinv_arr[i,j]=X_train_scaled[i,j]

pinv=np.linalg.pinv(pinv_arr)
print(np.matmul(pinv,y_train))

[ 75.36637295  75.36637295  96.09858325 364.05      ]


### What if we remove the second (collinear) column?

Now, there is no collinearity in our model, and we see that the coefficient for the first predictor is the sum of the coefficients of the two correlated variables in the previous model.

In [11]:
X_new=X.drop(columns=['col_2'])

X_new_train, X_new_test, y_train, y_test = train_test_split(X_new,y,train_size=0.8,random_state=42)

# standardise
X_new_train_scaled=scaler.fit_transform(X_new_train)
X_new_test_scaled=scaler.transform(X_new_test)

# build model
model=LinearRegression()
model_new=model.fit(X_new_train_scaled,y_train)

print(model_new.coef_, model_new.intercept_)

[150.7327459   96.09858325] 364.05


## Testing StandardScaler

### 1. What if we don't scale the regressors?

In [12]:
# model fit to unscaled data

model_unscaled=LinearRegression()
model_unscaled.fit(X_train,y_train)
print(model_unscaled.coef_, model_unscaled.intercept_)

# real relationship: Y=X_1+2X_2+3X_3

[1.00347213 2.00694427 2.99946173] -1.6314176005353715


In [13]:
# using pseudoinverse for unscaled data matrix

# add one column of ones for constant term
pinv_arr=np.ones((80,4))
for i in range(80):
    for j in range(4):
        if j<3:
            pinv_arr[i,j]=X_train.iloc[i,j]

pinv2=np.linalg.pinv(pinv_arr)
print(np.matmul(pinv2,y_train))

[ 1.00347213  2.00694427  2.99946173 -1.6314176 ]


In [14]:
X_train.head

<bound method NDFrame.head of     col_1  col_2  col_3
55     36     72     76
88     10     20     90
26     54    108     12
42      7     14     19
69     63    126     50
..    ...    ...    ...
60     89    178     73
71     98    196     82
14     16     32     48
92     42     84     95
51     79    158     49

[80 rows x 3 columns]>

In [15]:
X_train_scaled

array([[-2.31757031e-01, -2.31757031e-01,  8.09960244e-01],
       [-1.09720519e+00, -1.09720519e+00,  1.24693302e+00],
       [ 3.67399387e-01,  3.67399387e-01, -1.18762957e+00],
       [-1.19706459e+00, -1.19706459e+00, -9.69143183e-01],
       [ 6.66977596e-01,  6.66977596e-01, -1.56061704e-03],
       [-1.19706459e+00, -1.19706459e+00, -1.56217766e+00],
       [-6.64481111e-01, -6.64481111e-01,  2.16925769e-01],
       [-4.64762305e-01, -4.64762305e-01,  1.34057004e+00],
       [-6.31194643e-01, -6.31194643e-01, -1.03156786e+00],
       [-4.98048772e-01, -4.98048772e-01,  9.34809608e-01],
       [-1.06391872e+00, -1.06391872e+00,  6.85110881e-01],
       [-2.98329966e-01, -2.98329966e-01, -1.28126659e+00],
       [ 1.29942048e+00,  1.29942048e+00, -5.32170411e-01],
       [ 1.09970168e+00,  1.09970168e+00, -7.19444456e-01],
       [-1.39678340e+00, -1.39678340e+00,  1.52784408e+00],
       [ 1.36599342e+00,  1.36599342e+00,  6.53898540e-01],
       [ 1.49913929e+00,  1.49913929e+00

Let's compare $\mathbf{X}$ before and after scaling 

In [16]:
# trying to standardise a column manually``

X_new=X_train
mean1=np.mean(X_train.iloc[:,0])
var1=np.var(X_train.iloc[:,0])
sd1=var1**0.5
X_new.iloc[:,0]=(X_train.iloc[:,0] - mean1 )/sd1
print(mean1,var1)
X_new.iloc[:,0]

42.9625 902.5360937500003


88   -1.097205
26    0.367399
42   -1.197065
69    0.666978
        ...   
60    1.532426
71    1.832004
14   -0.897486
92   -0.032038
51    1.199561
Name: col_1, Length: 80, dtype: float64' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
  X_new.iloc[:,0]=(X_train.iloc[:,0] - mean1 )/sd1


55   -0.231757
88   -1.097205
26    0.367399
42   -1.197065
69    0.666978
        ...   
60    1.532426
71    1.832004
14   -0.897486
92   -0.032038
51    1.199561
Name: col_1, Length: 80, dtype: float64

### 2. Testing the scaler with non-collinear data

Results: everything normal and consistent. The comparison between output of scaled and unscaled model is as expected.

In [17]:
# Define the number of rows and columns
num_rows = 100
num_cols = 4

# Define column names (optional, but good practice)
column_names = [f'col_{i+1}' for i in range(num_cols)]

# Generate random data using NumPy
# For random integers: np.random.randint(low, high, size=(rows, cols))
# For random floats: np.random.rand(rows, cols) or np.random.uniform(low, high, size=(rows, cols))

random_data = np.random.randint(0, 100, size=(num_rows, num_cols)) # Example: random integers between 0 and 99

# Create the DataFrame
df = pd.DataFrame(random_data, columns=column_names)

for i in range(num_rows):
    df.iloc[i,num_cols-1]=df.iloc[i,0]+2*df.iloc[i,1]+3*df.iloc[i,2]+np.random.randint(-10,10)

# Print the generated DataFrame
print(df)

    col_1  col_2  col_3  col_4
0      57      9     89    342
1      56     43     17    192
2      30     19     65    262
3      18     99     13    261
4      71     84     41    364
..    ...    ...    ...    ...
95     63     22     28    184
96     72      6     22    141
97      7     19     35    155
98     37      1     30    128
99     88     84     25    335

[100 rows x 4 columns]


## Testing ridge using artificial data

In [18]:
X=df.drop(columns=['col_4'])
y=df['col_4']

X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.8,random_state=42)

# standardise
scaler=StandardScaler()
X_train_scaled=scaler.fit_transform(X_train)
X_test_scaled=scaler.transform(X_test)

# correct
model1=LinearRegression()
model2=LinearRegression()
model_scaled=model1.fit(X_train_scaled,y_train)
model_unscaled=model2.fit(X_train,y_train)

# wrong
# model=LinearRegression()
# model_scaled=model.fit(X_train_scaled,y_train)
# model_unscaled=model.fit(X_train,y_train)


In [19]:
df

Unnamed: 0,col_1,col_2,col_3,col_4
0,57,9,89,342
1,56,43,17,192
2,30,19,65,262
3,18,99,13,261
4,71,84,41,364
...,...,...,...,...
95,63,22,28,184
96,72,6,22,141
97,7,19,35,155
98,37,1,30,128


In [20]:
X_train_scaled

array([[ 1.36657538,  1.5713347 , -1.46901314],
       [ 0.42872953,  1.28969144, -0.11615039],
       [ 0.22776256, -0.77569249, -1.29100488],
       [ 1.13211392, -0.55663661,  1.23671236],
       [ 0.6296965 ,  1.10192927,  0.73828924],
       [ 1.19910291,  1.13322296, -0.65017516],
       [ 1.40006988, -0.83827988,  1.84194043],
       [ 0.76367448, -1.33897901, -0.82818341],
       [ 1.03163043,  1.41486622,  0.20426447],
       [-0.24116036, -0.08723118,  1.34351732],
       [-0.24116036, -0.90086727,  0.20426447],
       [-0.6430943 , -0.18111226, -0.04494708],
       [-1.58094015,  1.32098514, -1.32660653],
       [-0.47562182,  0.88287339, -0.43656525],
       [ 0.73017998, -0.33758074, -1.54021644],
       [ 0.46222403,  0.56993644, -0.75698011],
       [-0.04019339,  0.28829317,  0.48907768],
       [-1.51395116,  0.97675448,  1.55712722],
       [-0.71008329, -0.14981857,  0.95189915],
       [-1.07852273, -0.6818114 ,  0.31106943],
       [ 1.63453134,  1.35227883,  1.058

In [21]:
print("Scaled: ",model_scaled.coef_, model_scaled.intercept_)
print("Unscaled: ",model_unscaled.coef_, model_unscaled.intercept_)

Scaled:  [30.51953743 65.01235948 83.42499829] 281.675
Unscaled:  [1.02223648 2.034477   2.9700677 ] -2.3087709684191395


## Testing ridge regression

What I want to know: whether if it just gives a model that is useful for predictions OR it is actually able to capture the true underlying relationship. 

We reuse the same generated dataframe from last section, with relationship
$$Y=X_1+2X_2+3X_3$$
where $X_1$ and $X_2$ are correlated with $X_2=2X_1$.

In [24]:
# reuse the data
print(df)

    col_1  col_2  col_3  col_4
0      47     94     20    304
1       9     18     94    328
2      17     34     26    154
3      57    114     34    380
4      59    118     68    489
..    ...    ...    ...    ...
95     50    100     48    385
96     54    108     52    431
97     71    142     89    624
98     57    114     88    549
99     78    156     62    577

[100 rows x 4 columns]
