In [2]:
import pandas as pd
import numpy as np
import os

Read the simulated data data. 

In [6]:
data = pd.read_csv(os.path.join(os.getcwd(), "test_sample.csv"))

In [7]:
data.head()

Unnamed: 0,Y,X0,X1,X2,X3,X4,X5,X6,X7,X8,...,X480,X481,X482,X483,X484,X485,X486,X487,X488,X489
0,39.856298,2.891926,0.20461,0.653506,2.272925,1.648488,1.157913,-0.758341,-2.037747,0.752338,...,-0.218255,-1.206675,-0.279249,0.682557,-3.541084,-0.462835,-1.937582,-0.189706,1.226054,-2.79613
1,35.347972,-0.770712,1.844547,2.098988,-1.050588,1.791796,0.596607,0.116533,-0.998199,0.550185,...,0.107843,3.294647,-2.129394,-0.110989,-1.070187,-0.53082,0.394228,1.83319,0.300815,0.522388
2,-16.961448,1.418651,0.409358,-0.725631,0.112889,0.401866,-2.988968,1.683945,0.51911,1.584711,...,-0.757589,3.566662,-1.13676,-0.928827,0.771188,-1.937726,-0.891583,-3.015275,-0.28463,-2.003886
3,-9.462801,-4.236335,0.62558,1.319739,-0.591914,3.546638,-3.542041,-3.23498,0.015065,1.730905,...,2.009435,0.693618,-1.140715,1.600189,-4.114455,-3.776094,3.876101,-2.706423,0.370431,-2.293249
4,-56.214348,-1.025841,-2.1521,1.929969,-4.135384,-0.142428,-2.158415,-2.248478,1.486194,-1.97014,...,-2.461554,0.802512,-2.323218,1.536111,1.818574,-0.2845,3.147942,-2.518383,3.232242,0.421749


Variable Y was simulated using all predictors $X_0, \ldots, X_{489}$ with randomly selected coefficients from $[-0.1,1.9]$.

All predictors were simulated independenlty. Some of them are significant and some are not due to the slopes sampling interval.

The goal is to eliminate insignificant predictors of the model $Y = \beta_0 + \beta_1 X_0 + \ldots + \beta_{490} X_{489} + \epsilon$ using:
1) Lasso regresision from sklearn module.
2) Ordinary regression from statsmodels.api.OLS.

Let us extract feature data and dependant variable.

In [44]:
X = data.iloc[:,1:]
Y = data.iloc[:,0]

Let us first import LassoCV and Lasso from sklearn module.

In [45]:
from sklearn.linear_model import LassoCV, Lasso

Define LassoCV object with rand_state=1 and cv=5.

In [47]:
lasso_cv = LassoCV(cv=5, random_state=1, verbose = True)

Finding optimal $\alpha_{min}$ parameter for lasso regression.

In [48]:
lasso_cv.fit(X,Y)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    2.1s finished


In [49]:
alpha_min = lasso_cv.alpha_

Now let us use Lasso from sklearn with achieved $\alpha_{min}$, fit it to our data and look for eliminated slopes.

In [65]:
lasso = Lasso(alpha=alpha_min, random_state=1)
lasso.fit(X,Y)
eliminated_by_Lasso = np.arange(0,490)[lasso.coef_==0]
#lasso.coef_
indices

array([ 18,  19,  23,  25,  32,  42,  46,  48,  49,  53,  59,  67,  74,
        77,  78,  82,  86,  87,  94,  96, 111, 123, 125, 130, 134, 135,
       142, 143, 154, 155, 157, 166, 188, 194, 215, 220, 232, 236, 253,
       268, 270, 272, 277, 281, 285, 288, 296, 301, 312, 315, 323, 328,
       332, 333, 336, 345, 351, 354, 364, 375, 377, 380, 384, 390, 394,
       398, 400, 420, 429, 430, 431, 441, 445, 448, 449, 450, 459, 461,
       465, 473, 476, 477, 478, 488])

Now let us apply regular linear regression to X and check p-values for estimation of coefficients' values.

In [72]:
import statsmodels.api as sm

In [73]:
lr = sm.OLS(Y, sm.add_constant(X)).fit()
lr.summary()

0,1,2,3
Dep. Variable:,Y,R-squared:,1.0
Model:,OLS,Adj. R-squared:,1.0
Method:,Least Squares,F-statistic:,182600.0
Date:,"Thu, 02 Mar 2023",Prob (F-statistic):,3.61e-23
Time:,23:48:47,Log-Likelihood:,1406.4
No. Observations:,500,AIC:,-1831.0
Df Residuals:,9,BIC:,238.5
Df Model:,490,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.4867,0.040,12.284,0.000,0.397,0.576
X0,1.6091,0.015,105.596,0.000,1.575,1.644
X1,1.1403,0.014,80.671,0.000,1.108,1.172
X2,0.1648,0.017,9.730,0.000,0.126,0.203
X3,1.6036,0.014,110.869,0.000,1.571,1.636
X4,0.0422,0.014,2.943,0.016,0.010,0.075
X5,1.5196,0.017,91.032,0.000,1.482,1.557
X6,1.4547,0.015,97.298,0.000,1.421,1.489
X7,1.7836,0.020,89.678,0.000,1.739,1.829

0,1,2,3
Omnibus:,1.968,Durbin-Watson:,2.051
Prob(Omnibus):,0.374,Jarque-Bera (JB):,1.792
Skew:,0.14,Prob(JB):,0.408
Kurtosis:,3.088,Cond. No.,151.0


In [75]:
eliminated_by_lm=np.arange(0,490)[lr.pvalues[1:] > 0.1]
eliminated_by_lm

array([ 48,  49,  96, 155, 189, 232, 272, 331, 333, 347, 348, 350, 360,
       454])

In [70]:
lasso_zeros = ' '.join([str(idx) for idx in eliminated_by_Lasso])
lm_zeros = ' '.join([str(idx) for idx in eliminated_by_lm])
pd.DataFrame([lasso_zeros,lm_zeros], index = ['eliminated_by_Lasso','eliminated_by_lm']).to_csv('answer.csv')