In [1]:
import sys
sys.path.append('..')  
from _linear import LinearRegression
import _dgp as dg
from _metrics import mse_score, CrossValidation, accuracy_score,f1_score
from  _logistic import LogisticRegression
import numpy as np
import pandas as pd


## **Linear Regression** <a id='linear-regression'></a>


### **Generate data and fit first model**

Generate data with no correlation with only 5 true variables among 50 variables

In [2]:
np.random.seed(0)
x,y=dg.gen(type=1)
#get matrix size to understand what is the data begind
print(x.shape)
print(y.shape)

(500, 50)
(500,)


verify that the inverse of hessian matrix exists

In [3]:
np.linalg.det(x.T@x)

2.7458538617650583e+133

Try all available solvers

In [3]:

regression=LinearRegression(solver="ols")
regression.fit(x,y)[0:5]

array([-2.48725770e-03,  2.50087703e+00,  1.30321026e+00,  5.03895156e-01,
       -4.94005962e-01])

In [5]:

regression=LinearRegression(solver="gd")
regression.fit(x,y)[0:5]

algorithm did  converge under 100 iterations (at 18 iterations)


array([-2.43116199e-03,  2.50082799e+00,  1.30318689e+00,  5.03870594e-01,
       -4.94019206e-01])

In [6]:

regression=LinearRegression(solver="sgd",learning_rate=0.0001,max_iteration=500,mini_batch_size=32)
regression.fit(x,y)[0:5]

algorithm did  converge under 8000 iterations (at 700 iterations)


array([ 0.00280405,  2.45847891,  1.22620089,  0.48947171, -0.47747947])

Use Newton Raphson optimisation which is the best optimiser for parametric models

In [3]:

regression=LinearRegression(solver="nr")
regression.fit(x,y)

algorithm did  converge under 100 iterations (at 2 iterations)


array([-2.48725770e-03,  2.50087703e+00,  1.30321026e+00,  5.03895156e-01,
       -4.94005962e-01, -3.40120953e+00, -6.62744670e-04,  2.89575105e-03,
       -6.57143046e-03,  1.24907738e-03,  1.08991512e-03, -2.83466625e-03,
        6.59026584e-03,  2.14909709e-03,  3.61908512e-03, -5.34439065e-03,
        6.98139110e-03, -2.13662901e-03,  3.24877969e-03,  6.96731447e-03,
        2.51870922e-03,  6.25699564e-03, -3.72606873e-03,  1.05575843e-04,
       -1.71106290e-04,  7.62291638e-03, -5.01660496e-03,  8.31522042e-04,
       -5.26329795e-03, -9.64264907e-04, -4.87166712e-03,  5.78810130e-04,
        2.16471429e-03, -2.10917583e-04, -5.98761628e-03,  5.27963824e-04,
       -4.17699985e-03, -9.56052086e-04, -2.60930868e-03,  2.21308335e-03,
        8.18463106e-06,  6.03644848e-03,  3.63678027e-03, -3.22257985e-03,
        2.40369851e-03, -7.03695114e-03, -1.50076773e-03, -5.06092594e-03,
        9.05010978e-04,  5.08752535e-03, -2.88922587e-03])

As we used NR optimisation, it converged very fast, only in 2 iterations

if We want to interpret Linear regression results, we can do so by typing this command:

In [4]:
regression.get_inference().head()

Unnamed: 0,params,std,t value,p value
0,-0.002487,0.004515,-0.550937,0.582
1,2.500877,0.004338,576.474595,0.0
2,1.30321,0.004679,278.526942,0.0
3,0.503895,0.004693,107.376757,0.0
4,-0.494006,0.004532,-108.998452,0.0


### **Evaluate model performance**

We can evaluate model performance by looking at Mean squared error

In [9]:
mse_score(y,regression.predict(x))

0.009078764794174696

However, the best way to evaluate model performance remains Cross Validation for several reasons

In [10]:
list_of_mse=CrossValidation(Class_algorithm=regression,x=x,y=y,metrics_function=mse_score,nb_k_fold=6)
print(list_of_mse)

algorithm did  converge under 100 iterations (at 2 iterations)
algorithm did  converge under 100 iterations (at 2 iterations)
algorithm did  converge under 100 iterations (at 2 iterations)
algorithm did  converge under 100 iterations (at 2 iterations)
algorithm did  converge under 100 iterations (at 2 iterations)
algorithm did  converge under 100 iterations (at 2 iterations)
algorithm did  converge under 100 iterations (at 2 iterations)
[0.012115490734680417, 0.010588167476004147, 0.013069279948627981, 0.009508892181278136, 0.015148046111754635, 0.008595595428807798, 0.01615547482276282]


we can average these results to get average model performance

In [11]:
np.mean(list_of_mse)

0.01216870667198799

### **Automatic selection of variables**

**However, it is not a good practice to perform linear regression using all variables.** In order to obtain the best subset, several strategies can be adopted:

- **VIF models:** These identify and drop collinear variables to avoid multicollinearity issues.
- **Lasso regression:** This technique can be used for feature selection by penalizing the absolute size of the regression coefficients.
- **Forward/Backward/Stepwise selection:** These are iterative methods for feature selection, where variables are added or removed based on their impact on model performance.

In this section we will use last suggestion


In [12]:
col_index=regression.autoselection("forward","BIC_ll",print_message=False)
col_index

array([4, 0, 1, 2, 3])

and we get the following columns of X Matrix , first 5 variables that are also 5 true variables that we defined in dgp

In [13]:
x[:,col_index]

array([[ 1.86755799,  1.76405235,  0.40015721,  0.97873798,  2.2408932 ],
       [-0.02818223, -0.89546656,  0.3869025 , -0.51080514, -1.18063218],
       [-1.17312341,  1.8831507 , -1.34775906, -1.270485  ,  0.96939671],
       ...,
       [-2.11510138, -1.24502561, -0.19650552, -0.52718478,  0.43719199],
       [-0.15391544,  0.53024927, -0.04052914,  1.41200019,  0.40162904],
       [ 0.81650862,  0.07611915,  0.33393636, -2.19190155, -0.31165281]])

In [14]:
col_index=regression.autoselection("stepwise","BIC_ll",print_message=False)
col_index

array([0, 2, 1, 4, 3])

In [15]:
col_index=regression.autoselection("backward","BIC_ll",print_message=False)
col_index

array([0, 1, 2, 3, 4])

## **Regularisation (Elastic net)**

We can also perform Elastic net or Lasso regression using coordinate descent with two parameters:
- Alpha parameter which stands for ridge regression, if alpha=1 then the model will perform lasso regression
- Lambda parameter which stands for lasso regression, if lambda =1 then the model will perform ordinary linear regression without biases
The combination of both gives Elastic net. 

It can be used to select variables or/and deal with multicollinearity

In [4]:
non_biased_params,biased_params=regression.fit_elnet(x,y,0.2,1,for_inference_lasso_params=False)

algorithm did  converge under 100 iterations (at 7 iterations)
algorithm did  converge under 100 iterations (at 2 iterations)


As you see, we did two models, the first vector is Ordinary Linear Regression on selected by Lasso parameters, the second is Lasso biased parameters.
Why do we use both? We do so for getting Information Criteria after which will be unbiased if we use OLS estimates on selected by Lasso variables.
As you see Lasso is simply an intermediate step to select variables.

Of course you can always use Lasso parameters for futher steps as significance, in this case you can use an argument for_inference_lasso_params=True 

In [5]:
print(non_biased_params.shape)
non_biased_params

(51,)


array([-2.48725770e-03,  2.50087703e+00,  1.30321026e+00,  5.03895156e-01,
       -4.94005962e-01, -3.40120953e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00])

In [6]:
print(biased_params.shape)
biased_params

(51,)


array([ 0.03981746,  0.84893142,  0.24170598,  0.04086831, -0.02837556,
       -1.07368168, -0.        ,  0.        ,  0.        , -0.        ,
       -0.        , -0.        , -0.        , -0.        , -0.        ,
       -0.        ,  0.        , -0.        , -0.        ,  0.        ,
       -0.        ,  0.        ,  0.        , -0.        , -0.        ,
       -0.        , -0.        , -0.        ,  0.        ,  0.        ,
       -0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
       -0.        ,  0.        ,  0.        ,  0.        , -0.        ,
       -0.        , -0.        ,  0.        ,  0.        ,  0.        ,
       -0.        , -0.        , -0.        , -0.        ,  0.        ,
       -0.        ])

In [7]:
regression.params

array([-2.48725770e-03,  2.50087703e+00,  1.30321026e+00,  5.03895156e-01,
       -4.94005962e-01, -3.40120953e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00])

### **Drawbacks and to do list**

**In this section, we discussed the machine learning approach of estimating the model parameters of Linear Regression.**

<span style="color:red"> However, we did not analyze the validity tests of normality and homoscedasticity of residuals, endogenous variables, etc. This section will be developed soon.</span>







