In [1]:
from _linear import LinearRegression
import _dgp as dg
from _metrics import mse_score, CrossValidation, accuracy_score,f1_score
from  _logistic import LogisticRegression
import numpy as np
import pandas as pd

## **Table of Contents**

- [Linear Regression](#linear-regression)
  - [Generate data and fit first model](#generate-data-and-fit-first-model)
  - [Evaluate model performance](#evaluate-model-performance)
  - [Automatic selection of variables](#automatic-selection-of-variables) 
  - [Drawbacks and to do list](#drawbacks-and-to-do-list)

- [Logistic Regression](#logistic-regression)
  - [Work with excel data (5 variables in total)](#work-with-excel-data-5-variables-in-total)
  - [Evaluate model performance](#evaluate-model-performance-logistic-regression)
  - [Automatic selection of variables](#automatic-selection-of-variables-logistic-regression)
  - [Generate 50 variables and autoselection](#Generate-50-variables-and-autoselection)



- [Logistic Regression OVR](#Logistic-Regression-OVR)
  - [Fit and predict model](#Fit-and-predict-model)
  - [Evaluate model performance](#Evaluate-model-performance-Log-OVR)
  - [Automatic selection of variables] incoming



- [Logistic Regression Softmax](#Logistic-Regression-Softmax)
  - [Fit and predict model](#Fit-and-predict-model-Log-SFTMX)
  - [Evaluate model performance](#Evaluate-model-performance-Log-SFTMX)
  - [Automatic selection of variables] incoming



## **Linear Regression**


### **Generate data and fit first model**

Generate data with no correlation with only 5 true variables among 50 variables

In [2]:
x,y=dg.gen(type=1)
#get matrix size to understand what is the data begind
print(x.shape)
print(y.shape)

(500, 50)
(500,)


Run linear regression and get B parameter

In [3]:
#use Newton Raphson optimisation which is the best optimiser for parametric models
regression=LinearRegression(solver="nr")
regression.fit(x,y)

algorithm did  converge under 100 iterations (at 2 iterations)


array([ 2.42381844e-03,  2.49952750e+00,  1.29656399e+00,  4.99478480e-01,
       -5.01533886e-01, -3.40439139e+00, -9.39497780e-03, -2.82955967e-03,
       -1.17953902e-03,  4.75613998e-03, -1.44040245e-03, -1.75564123e-03,
        5.00688745e-03,  1.97949432e-03, -5.43767255e-03, -1.59508750e-03,
       -6.39848243e-03,  2.77749583e-03, -5.09815896e-03,  5.05433398e-03,
       -9.10483347e-03,  4.09791000e-03,  6.14520698e-03, -4.60728163e-03,
        7.23154734e-03,  1.92603687e-03,  3.75567253e-03, -1.22499498e-02,
        1.47861446e-02, -4.62803482e-03, -3.71757779e-03, -5.87936236e-03,
       -3.45568902e-03, -7.53586929e-05, -3.44135793e-03,  5.44639193e-03,
        5.65291422e-03,  5.57109092e-04,  6.52229239e-03, -1.79268958e-03,
       -8.76816495e-03, -1.44621986e-03, -4.21944358e-03,  1.96297203e-03,
       -3.55328984e-04, -4.83641895e-03, -1.41648789e-03, -4.42780255e-04,
        6.04918154e-03, -1.17387669e-03,  2.33546252e-03])

As we used NR optimisation, it converged very fast, only in 2 iterations

if We want to interpret Linear regression results, we can do so by typing this command:

In [4]:
regression.get_inference().head()

Unnamed: 0,params,std,t value,p value
0,0.002424,0.00437,0.554635,0.5794
1,2.499528,0.004478,558.196071,0.0
2,1.296564,0.004311,300.753119,0.0
3,0.499478,0.004455,112.125108,0.0
4,-0.501534,0.004699,-106.727374,0.0


### **Evaluate model performance**

We can evaluate model performance by looking at Mean squared error

In [5]:
mse_score(y,regression.predict(x))

(500, 51) (51,)


0.008600718625232527

However, the best way to evaluate model performance remains Cross Validation for several reasons

In [6]:
list_of_mse=CrossValidation(Class_algorithm=regression,x=x,y=y,metrics_function=mse_score,nb_k_fold=6)
print(list_of_mse)

algorithm did  converge under 100 iterations (at 2 iterations)
(83, 51) (51,)
algorithm did  converge under 100 iterations (at 2 iterations)
(83, 51) (51,)
algorithm did  converge under 100 iterations (at 2 iterations)
(83, 51) (51,)
algorithm did  converge under 100 iterations (at 2 iterations)
(83, 51) (51,)
algorithm did  converge under 100 iterations (at 2 iterations)
(83, 51) (51,)
algorithm did  converge under 100 iterations (at 2 iterations)
(83, 51) (51,)
algorithm did  converge under 100 iterations (at 2 iterations)
(2, 51) (51,)
[0.013851551356972835, 0.010802905558895383, 0.009129841868098986, 0.009536296347368326, 0.009688375022766017, 0.014807781024683307, 0.007414858340186825]


we can average these results to get average model performance

In [23]:
np.mean(list_of_mse)

0.008594518986570114

### **Automatic selection of variables**

**However, it is not a good practice to perform linear regression using all variables.** In order to obtain the best subset, several strategies can be adopted:

- **VIF models:** These identify and drop collinear variables to avoid multicollinearity issues.
- **Lasso regression:** This technique can be used for feature selection by penalizing the absolute size of the regression coefficients.
- **Forward/Backward/Stepwise selection:** These are iterative methods for feature selection, where variables are added or removed based on their impact on model performance.

In this section we will use last suggestion


In [7]:
col_index=regression.autoselection("forward","BIC_ll",print_message=False)
col_index

array([ 4,  0,  1,  2,  3, 27, 26])

and we get the following columns of X Matrix , first 5 variables that are also 5 true variables that we defined in dgp

In [8]:
x[:,col_index]

array([[-0.35495852, -0.10742798,  1.06814236, ...,  0.65932685,
         0.79538825, -0.64744142],
       [-1.18003908, -1.03615151,  0.63520362, ...,  0.56610327,
         0.92546533, -0.98586324],
       [-0.01840873, -1.17697943,  0.54630373, ...,  1.18877165,
        -0.28457316,  0.28217394],
       ...,
       [-0.16563499, -1.76412014,  0.79010986, ...,  0.47383872,
        -1.60075924,  0.81340654],
       [-0.01219476, -0.88565528, -2.44117181, ...,  1.00075324,
         0.33056653,  0.83771006],
       [ 0.65889582,  0.70246245, -2.50925878, ..., -0.1538533 ,
         0.00882275,  0.48007461]])

In [8]:
col_index=regression.autoselection("stepwise","BIC_ll",print_message=False)
col_index

array([2, 0, 4, 1, 3])

In [9]:
col_index=regression.autoselection("backward","BIC_ll",print_message=False)
col_index

array([ 0,  1,  2,  3,  4, 26, 27])

### **Drawbacks and to do list**

**In this section, we discussed the machine learning approach of estimating the model parameters of Linear Regression.**

<span style="color:red"> However, we did not analyze the validity tests of normality and homoscedasticity of residuals, endogenous variables, etc. This section will be developed soon.</span>









## **Logistic Regression**

### **Work with excel data (5 variables in total)**

Load data and explore 

In [2]:
table=pd.read_csv("data/exampleLR.csv")

In [3]:
table.head()

Unnamed: 0.1,Unnamed: 0,V1,V2,V3,V4,V5,y
0,0,-0.560476,-0.601893,-0.995799,-0.820987,-0.511604,0
1,1,-0.230177,-0.993699,-1.039955,-0.307257,0.236938,0
2,2,1.558708,1.026785,-0.01798,-0.902098,-0.541589,1
3,3,0.070508,0.751061,-0.132175,0.627069,1.219228,0
4,4,0.129288,-1.509167,-2.549343,1.120355,0.174136,0


Always convert data to numpy array before starting the work

In [4]:
x,y=np.array(table.drop(["y","Unnamed: 0"],axis=1)),np.array(table["y"])

**Are classes balanced?**

- If yes, the use of accuracy is enough (our case).
- If not:
  - Use other metrics such as precision, recall, F1 score, etc.
  - Do over/undersampling.



In [9]:
table["y"].value_counts()

y
0    253
1    247
Name: count, dtype: int64

Try Gradient Descent algorithm

In [5]:
model=LogisticRegression(solver="gd")
model.fit(x,y)

algorithm did not converge under 100 iterations,so the calculated parameter is biased


array([-0.1248317 , -0.98031107,  1.93586459,  0.89773156,  0.7761705 ,
       -1.07133784])

**Algorithm could not converge:**
If you have this message, **NEVER** proceed to further steps as 
the estimated parameter is biased. 

Therefore, try other hyperparameters to aim  convergence.


In [6]:
model1=LogisticRegression(solver="gd",learning_rate=0.004,max_iteration=300)
model1.fit(x,y)

algorithm did  converge under 300 iterations (at 79 iterations)


array([-0.14211795, -1.10847242,  2.18459166,  1.02655283,  0.87139786,
       -1.20943036])

Try stochastic gradient descent with 68 as size of mini batch to see the performance

In [12]:
model2=LogisticRegression(solver="sgd",learning_rate=0.001,max_iteration=300,mini_batch_size=68,add_intercept=False)
model2.fit(x,y)

algorithm did  converge under 2400 iterations (at 2105 iterations)


array([-1.09991109,  2.1612286 ,  1.01235242,  0.85978906, -1.19105931])

With new learning rate algorithm converged so we will use results of model1 to make predictions

In [7]:
model1.predict(x)[0:10]

array([0, 0, 1, 1, 0, 1, 1, 0, 1, 1])

as well as get parameters significance

In [8]:
print(model1.get_inference())
print(model1.criteria)

     params       std   t value  p value
0 -0.142118  0.133005 -1.068517   0.2858
1 -1.108472  0.159209 -6.962360   0.0000
2  2.184592  0.222661  9.811274   0.0000
3  1.026553  0.163791  6.267456   0.0000
4  0.871398  0.144580  6.027095   0.0000
5 -1.209430  0.173253 -6.980720   0.0000
{'LL': -177.40450034974413, 'AIC_ll': 366.80900069948825, 'BIC_ll': 392.0966492900214}


### **Evaluate model performance** (Logistic Regression)

Evaluate model with accuracy and f1score 

In [9]:
CrossValidation(Class_algorithm=model1,x=x,y=y,metrics_function=accuracy_score,nb_k_fold=4)

algorithm did  converge under 300 iterations (at 110 iterations)
algorithm did  converge under 300 iterations (at 114 iterations)
algorithm did  converge under 300 iterations (at 86 iterations)
algorithm did  converge under 300 iterations (at 97 iterations)


[0.816, 0.816, 0.864, 0.84]

In [16]:
CrossValidation(Class_algorithm=model1,x=x,y=y,metrics_function=f1_score,nb_k_fold=4)

algorithm did  converge under 300 iterations (at 91 iterations)
algorithm did  converge under 300 iterations (at 115 iterations)
algorithm did  converge under 300 iterations (at 95 iterations)
algorithm did  converge under 300 iterations (at 105 iterations)


[0.8333333333333334,
 0.8428571428571429,
 0.8527131782945736,
 0.8113207547169812]

In [10]:
model1.params

array([-0.14211795, -1.10847242,  2.18459166,  1.02655283,  0.87139786,
       -1.20943036])

### **Automatic selection of variables** (Logistic Regression)

In [11]:
index_cols=model1.autoselection("forward","BIC_ll",print_message=False)
index_cols

array([1, 4, 0, 2, 3])

In [12]:
index_cols=model1.autoselection("backward","BIC_ll",print_message=False)
index_cols

array([0, 1, 2, 3, 4])

In [13]:
index_cols=model1.autoselection("stepwise","BIC_ll",print_message=False)
index_cols

array([2, 4, 1, 0, 3])

As you see in this case we get 5 variables which are 5 true variables , but this is also a generated data from excel file with 5 variables

We can see how this selection performs on another generation that will contain 5 true variables and 45 false variables

### **Generate 50 variables and autoselection**

In [23]:
x,y=dg.gen(1,"logistic")

In [24]:
model_new=LogisticRegression(solver="nr")
model_new.fit(x,y)

algorithm did  converge under 100 iterations (at 8 iterations)


array([-0.26252741,  2.91396988,  1.69675333,  0.66507672, -0.97284196,
       -3.98451213,  0.0873171 ,  0.30109892, -0.21457586, -0.02991893,
       -0.21174401,  0.29569178, -0.19071773, -0.10514113,  0.17350999,
        0.41672691, -0.05077318, -0.10087222, -0.19423697, -0.24548784,
        0.03522619, -0.27216917, -0.0375273 ,  0.12741222,  0.01210865,
        0.18402691, -0.02282985, -0.37788023,  0.18956194,  0.15211339,
        0.11459876,  0.18702506,  0.01996251,  0.27512159,  0.25615405,
        0.17383501, -0.02152307, -0.04251427,  0.099655  , -0.02654946,
        0.05423186, -0.18081417, -0.04177777, -0.13633977,  0.07637829,
        0.31260222, -0.09220879, -0.43657246, -0.06785007,  0.02477924,
        0.2520826 ])

In [25]:
model_new.autoselection("forward","BIC_ll",print_message=False)

array([4, 0, 1, 3, 2])

In [13]:
model_new.autoselection("stepwise","BIC_ll",print_message=False)

array([3, 2, 1, 4, 0])

In [14]:
model_new.autoselection("backward","BIC_ll",print_message=False)

array([0, 1, 2, 3, 4])

Despite having 50 variables in the dataset, the model successfully identified only 5 true variables. All 3 models gave same results which is a normal behaviour.

## **Logistic Regression OVR**

### **Fit and predict model**

Load data

In [3]:
table=pd.read_csv("data/exampleLRMulti.csv")

Get rid of unnamed columns

In [4]:
table.drop(table.columns[table.columns.str.contains('unnamed',case = False)],axis = 1, inplace = True)

One more time classes are distributed equally, so we can estimate model performance with accuracy score

In [28]:
table["column_Y"].value_counts()

column_Y
1    336
0    333
2    331
Name: count, dtype: int64

In [5]:
x,y=np.array(table.drop(["column_Y"],axis=1)),np.array(table["column_Y"])

Estimate Multiclass Logistic Regression with One vs Rest algorithm and Newton Raphson optimisation 

Since it's an OVR algorithm, it will run model 3 times with 3 columns of estimated parameter

In [6]:
modelOVR=LogisticRegression(solver="nr",multiclass="ovr")
modelOVR.fit(x,y)

algorithm did  converge under 100 iterations (at 5 iterations)
algorithm did  converge under 100 iterations (at 5 iterations)
algorithm did  converge under 100 iterations (at 5 iterations)


array([[-1.01498187, -1.45620846, -0.78718645],
       [ 0.34446018,  0.56751748, -0.79115877],
       [ 0.22977187, -0.65556071,  0.56300868],
       [ 0.05302252, -0.03242775, -0.03517929],
       [ 0.06803143,  0.15324563, -0.1736295 ],
       [-0.44905806,  0.79972167, -0.42289584],
       [ 0.02949896, -0.09727438,  0.07434896],
       [-0.64703364,  0.23914131,  0.39520493],
       [-0.03647586,  0.15863365, -0.09606587],
       [ 0.6808661 , -0.45149573, -0.23001672],
       [-0.01179362, -0.05067621,  0.03836354]])

We can also predict. Predictions are made always in the ascending order: 

    -first column will be 0 class
    -second column will be 1 class
    -third column will be 3 class


In [7]:
predictionsOVR=modelOVR.predict(x)
predictionsOVR 

array([[0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       ...,
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.]])

We will compare these predictions to true one hot encoded labels

In [12]:
modelOVR.true_labels_matrix

array([[0, 1, 0],
       [1, 0, 0],
       [0, 0, 1],
       ...,
       [0, 0, 1],
       [1, 0, 0],
       [0, 0, 1]])

Of course, we can always get probability (for OVR it is not normalised, but for predicted values it is normalised!!)

In [9]:
modelOVR.proba

array([[0.14660444, 0.94592762, 0.01074276],
       [0.26867955, 0.36961135, 0.19349834],
       [0.0519826 , 0.32131164, 0.59426132],
       ...,
       [0.02941952, 0.15974979, 0.77393711],
       [0.26880729, 0.33355668, 0.22603021],
       [0.03407864, 0.33402281, 0.64490065]])

In [10]:
modelOVR.get_inference(only_IC=True)

(1000, 3)
(3, 1000)


{'LL': 738.4500205744612,
 'AIC_ll': -1454.9000411489224,
 'BIC_ll': -1400.9147330801188}

In [11]:
modelOVR.autoselection("forward","BIC_ll")

algorithm did  converge under 100 iterations (at 4 iterations)
algorithm did  converge under 100 iterations (at 4 iterations)
(1000,)
(2, 1000)


ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 2 is different from 1000)

In [None]:
problem is that we take y original and not y matrix , so maybe the need to introduce matrix ? yes, it is the simplest, but what if not the best?

In [15]:
pd.get_dummies(pd.DataFrame(y))

Unnamed: 0,0
0,1
1,0
2,2
3,2
4,0
...,...
995,1
996,1
997,2
998,0


In [13]:
y=np.array(([1,0,0],[0,1,0],[0,0,1],[0,0,1]))
loga=np.log(np.array(([0.3792,0.3072,0.4263],[0,1,0],[0,0,1],[0,0,1])))

array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 1],
       [0, 0, 1]])

### **Evaluate model performance** (Log OVR)

We can evaluate accuracy

In [104]:
accuracy_score(predictionsOVR, modelOVR.true_labels_matrix)

0.695

But we want more intelligent evaluation, so we use cross validation 

In [107]:
acc_list=CrossValidation(Class_algorithm=modelOVR,x=x,y=modelOVR.true_labels_matrix,metrics_function=accuracy_score,nb_k_fold=8)

algorithm did  converge under 100 iterations (at 5 iterations)
algorithm did  converge under 100 iterations (at 5 iterations)
algorithm did  converge under 100 iterations (at 5 iterations)
algorithm did  converge under 100 iterations (at 5 iterations)
algorithm did  converge under 100 iterations (at 5 iterations)
algorithm did  converge under 100 iterations (at 5 iterations)
algorithm did  converge under 100 iterations (at 5 iterations)
algorithm did  converge under 100 iterations (at 5 iterations)
algorithm did  converge under 100 iterations (at 5 iterations)
algorithm did  converge under 100 iterations (at 5 iterations)
algorithm did  converge under 100 iterations (at 5 iterations)
algorithm did  converge under 100 iterations (at 5 iterations)
algorithm did  converge under 100 iterations (at 5 iterations)
algorithm did  converge under 100 iterations (at 5 iterations)
algorithm did  converge under 100 iterations (at 5 iterations)
algorithm did  converge under 100 iterations (at 5 iter

By averaging we get more "robust" accuracy 

In [108]:
print(acc_list)
np.mean(acc_list)

[0.68, 0.632, 0.68, 0.704, 0.696, 0.696, 0.624, 0.672]


0.6729999999999999

## **Logistic Regression Softmax**

**Now, let's see another way of solving multicase without running P independent classifiers-> logistic softmax**

### **Fit and predict model** (Log SFTMX)

In [115]:
modelSFT=LogisticRegression(solver="gd",multiclass="softmax")#dont use nr for the moment
modelSFT.fit(x,y)

algorithm did  converge under 100 iterations (at 50 iterations)


array([[ 0.04194506, -0.27091767,  0.22897262],
       [ 0.23740402,  0.34859258, -0.58599659],
       [ 0.08908508, -0.4494452 ,  0.36036012],
       [ 0.0417117 , -0.02254157, -0.01917013],
       [ 0.03704339,  0.08616513, -0.12320852],
       [-0.27315999,  0.56539992, -0.29223993],
       [ 0.01838198, -0.07586092,  0.05747895],
       [-0.47306334,  0.21607495,  0.2569884 ],
       [-0.03294754,  0.10874983, -0.07580229],
       [ 0.48652451, -0.33256412, -0.15396038],
       [-0.00460555, -0.02954451,  0.03415006]])

We can compare them to modelOVR parameters

In [112]:
modelOVR.params

array([[-1.01498187, -1.45620846, -0.78718645],
       [ 0.34446018,  0.56751748, -0.79115877],
       [ 0.22977187, -0.65556071,  0.56300868],
       [ 0.05302252, -0.03242775, -0.03517929],
       [ 0.06803143,  0.15324563, -0.1736295 ],
       [-0.44905806,  0.79972167, -0.42289584],
       [ 0.02949896, -0.09727438,  0.07434896],
       [-0.64703364,  0.23914131,  0.39520493],
       [-0.03647586,  0.15863365, -0.09606587],
       [ 0.6808661 , -0.45149573, -0.23001672],
       [-0.01179362, -0.05067621,  0.03836354]])

also predict

In [113]:
modelSFT.predict(x)

array([[0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       ...,
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.]])

### **Evaluate model performance** (Log SFTMX)

In [116]:
meansftmx=CrossValidation(Class_algorithm=modelSFT,x=x,y=modelSFT.true_labels_matrix,metrics_function=accuracy_score,nb_k_fold=4)
np.mean(meansftmx)

algorithm did  converge under 100 iterations (at 65 iterations)
algorithm did  converge under 100 iterations (at 66 iterations)
algorithm did  converge under 100 iterations (at 56 iterations)
algorithm did  converge under 100 iterations (at 62 iterations)


0.671

Further steps:

 -introduce Newton Raphson for softmax regression

 -explore further inference statistics ( log odds, inference for multinomial logistic regressions)

 -do automatic variable selection ( stepwise, backward..., lasso..)
 
 -do real project using all these developped algorithms



In [118]:
modelSFT=LogisticRegression(solver="nr",multiclass="softmax")#dont use nr for the moment
modelSFT.fit(x,y)

ValueError: not done yet,incoming