In [11]:
from _linear import LinearRegression
import _dgp as dg
from _metrics import mse_score, CrossValidation, accuracy_score,f1_score
from  _logistic import LogisticRegression
import numpy as np
import pandas as pd

## **Table of Contents**

- [Linear Regression](#linear-regression)
  - [Generate data and fit first model](#generate-data-and-fit-first-model)
  - [Evaluate model performance](#evaluate-model-performance)
  - [Automatic selection of variables](#automatic-selection-of-variables) 
  - [Drawbacks and to do list](#drawbacks-and-to-do-list)

- [Logistic Regression](#logistic-regression)
  - [Work with excel data (5 variables in total)](#work-with-excel-data-5-variables-in-total)
  - [Evaluate model performance](#evaluate-model-performance-logistic-regression)
  - [Automatic selection of variables](#automatic-selection-of-variables-logistic-regression)
  - [Generate 50 variables and autoselection](#Generate-50-variables-and-autoselection)



- [Logistic Regression OVR](#Logistic-Regression-OVR)
  - [Fit and predict model](#Fit-and-predict-model)
  - [Automatic selection of variables](#automatic-selection-of-variables-Log-OVR)
  - [Evaluate model performance](#Evaluate-model-performance-Log-OVR)



- [Logistic Regression Softmax](#Logistic-Regression-Softmax)
  - [Fit and predict model](#Fit-and-predict-model-Log-SFTMX)
  - [Evaluate model performance](#Evaluate-model-performance-Log-SFTMX)
  - [Automatic selection of variables](#automatic-selection-of-variables-Log-SFTMX)



## **Linear Regression**


### **Generate data and fit first model**

Generate data with no correlation with only 5 true variables among 50 variables

In [19]:
x,y=dg.gen(type=1)
#get matrix size to understand what is the data begind
print(x.shape)
print(y.shape)

(500, 50)
(500,)


Run linear regression and get B parameter

In [20]:
#use Newton Raphson optimisation which is the best optimiser for parametric models
regression=LinearRegression(solver="nr")
regression.fit(x,y)

algorithm did  converge under 100 iterations (at 2 iterations)


array([ 5.00655327e-03,  2.50082381e+00,  1.30074205e+00,  5.04948335e-01,
       -4.97982842e-01, -3.40690769e+00,  7.02894127e-03,  1.33090720e-03,
        1.56569681e-03, -1.47444277e-03,  2.47601399e-03,  1.20897024e-03,
       -4.88315140e-03,  3.25176864e-03,  3.64169932e-03,  9.85333779e-03,
       -9.13441909e-03, -5.45129963e-03,  1.93953831e-03,  4.12843286e-03,
        1.82160930e-03,  4.24143645e-03, -3.80682447e-03, -5.74690924e-03,
        8.47523959e-03,  4.97722735e-03, -3.73118438e-03,  1.47095850e-03,
       -7.20803307e-03,  1.07752324e-03, -5.76673274e-03, -1.73347510e-03,
       -5.03862378e-03, -4.76832106e-03, -3.81665323e-03,  5.61282066e-04,
       -2.48094489e-03, -2.61997771e-03,  3.52033472e-04, -7.28302011e-03,
       -3.99277905e-03,  2.65484122e-03, -3.34223776e-04,  1.23836303e-03,
        1.83873220e-03,  1.26646513e-03,  3.83905807e-03,  7.28205594e-04,
        6.07535846e-03,  5.12442499e-03,  6.35777391e-03])

As we used NR optimisation, it converged very fast, only in 2 iterations

if We want to interpret Linear regression results, we can do so by typing this command:

In [21]:
regression.get_inference().head()

Unnamed: 0,params,std,t value,p value
0,0.005007,0.004222,1.185729,0.2364
1,2.500824,0.004233,590.833936,0.0
2,1.300742,0.00403,322.782025,0.0
3,0.504948,0.004301,117.405398,0.0
4,-0.497983,0.004286,-116.180961,0.0


### **Evaluate model performance**

We can evaluate model performance by looking at Mean squared error

In [22]:
mse_score(y,regression.predict(x))

0.008061146019668532

However, the best way to evaluate model performance remains Cross Validation for several reasons

In [29]:
list_of_mse=CrossValidation(Class_algorithm=regression,x=x,y=y,metrics_function=mse_score,nb_k_fold=6)
print(list_of_mse)

algorithm did  converge under 100 iterations (at 2 iterations)
algorithm did  converge under 100 iterations (at 2 iterations)
algorithm did  converge under 100 iterations (at 2 iterations)
algorithm did  converge under 100 iterations (at 2 iterations)
algorithm did  converge under 100 iterations (at 2 iterations)
algorithm did  converge under 100 iterations (at 2 iterations)
algorithm did  converge under 100 iterations (at 2 iterations)
[0.011849304996587129, 0.01404096616414492, 0.0092667697494742, 0.008011126327296225, 0.010549941320124203, 0.009327920843915722, 0.005760201777866242]


we can average these results to get average model performance

In [24]:
np.mean(list_of_mse)

0.012661938624977375

### **Automatic selection of variables**

**However, it is not a good practice to perform linear regression using all variables.** In order to obtain the best subset, several strategies can be adopted:

- **VIF models:** These identify and drop collinear variables to avoid multicollinearity issues.
- **Lasso regression:** This technique can be used for feature selection by penalizing the absolute size of the regression coefficients.
- **Forward/Backward/Stepwise selection:** These are iterative methods for feature selection, where variables are added or removed based on their impact on model performance.

In this section we will use last suggestion


In [25]:
col_index=regression.autoselection("forward","BIC_ll",print_message=False)
col_index

array([4, 0, 1, 3, 2])

and we get the following columns of X Matrix , first 5 variables that are also 5 true variables that we defined in dgp

In [26]:
x[:,col_index]

array([[ 0.65152199,  2.06536844,  0.93297886, -1.31013114, -1.09217254],
       [ 0.13988077,  1.2367918 ,  0.27315847, -0.75786061,  0.36138793],
       [-0.80737236,  0.83622389,  1.41390816, -0.77262775, -0.20233063],
       ...,
       [-0.5243483 , -0.57644198,  0.05976386,  0.20421151, -1.15144094],
       [ 0.55245561,  0.48245167,  0.17420111, -1.4384277 ,  0.49908468],
       [-0.79357785, -3.16333649, -1.06427108, -0.67489721, -0.59786218]])

In [27]:
col_index=regression.autoselection("stepwise","BIC_ll",print_message=False)
col_index

array([0, 4, 2, 1, 3])

In [28]:
col_index=regression.autoselection("backward","BIC_ll",print_message=False)
col_index

array([0, 1, 2, 3, 4])

### **Drawbacks and to do list**

**In this section, we discussed the machine learning approach of estimating the model parameters of Linear Regression.**

<span style="color:red"> However, we did not analyze the validity tests of normality and homoscedasticity of residuals, endogenous variables, etc. This section will be developed soon.</span>









## **Logistic Regression**

### **Work with excel data (5 variables in total)**

Load data and explore 

In [30]:
table=pd.read_csv("data/exampleLR.csv")

In [31]:
table.head()

Unnamed: 0.1,Unnamed: 0,V1,V2,V3,V4,V5,y
0,0,-0.560476,-0.601893,-0.995799,-0.820987,-0.511604,0
1,1,-0.230177,-0.993699,-1.039955,-0.307257,0.236938,0
2,2,1.558708,1.026785,-0.01798,-0.902098,-0.541589,1
3,3,0.070508,0.751061,-0.132175,0.627069,1.219228,0
4,4,0.129288,-1.509167,-2.549343,1.120355,0.174136,0


Always convert data to numpy array before starting the work

In [32]:
x,y=np.array(table.drop(["y","Unnamed: 0"],axis=1)),np.array(table["y"])

**Are classes balanced?**

- If yes, the use of accuracy is enough (our case).
- If not:
  - Use other metrics such as precision, recall, F1 score, etc.
  - Do over/undersampling.



In [33]:
table["y"].value_counts()

y
0    253
1    247
Name: count, dtype: int64

Try Gradient Descent algorithm

In [34]:
model=LogisticRegression(solver="gd")
model.fit(x,y)

algorithm did not converge under 100 iterations,so the calculated parameter is biased


array([-0.1248317 , -0.98031107,  1.93586459,  0.89773156,  0.7761705 ,
       -1.07133784])

**Algorithm could not converge:**
If you have this message, **NEVER** proceed to further steps as 
the estimated parameter is biased. 

Therefore, try other hyperparameters to aim  convergence.


In [35]:
model1=LogisticRegression(solver="gd",learning_rate=0.004,max_iteration=300)
model1.fit(x,y)

algorithm did  converge under 300 iterations (at 79 iterations)


array([-0.14211795, -1.10847242,  2.18459166,  1.02655283,  0.87139786,
       -1.20943036])

Try stochastic gradient descent with 68 as size of mini batch to see the performance

In [37]:
model2=LogisticRegression(solver="sgd",learning_rate=0.001,max_iteration=300,mini_batch_size=68,add_intercept=False)
model2.fit(x,y)

algorithm did  converge under 2400 iterations (at 1192 iterations)


array([-1.04660621,  2.05785965,  0.9594837 ,  0.82034862, -1.1345406 ])

With new learning rate algorithm converged so we will use results of model1 to make predictions

In [38]:
model1.predict(x)[0:10]

array([0, 0, 1, 1, 0, 1, 1, 0, 1, 1])

as well as get parameters significance

In [39]:
print(model1.get_inference())
print(model1.criteria)

     params       std   t value  p value
0 -0.142118  0.133005 -1.068517   0.2858
1 -1.108472  0.159209 -6.962360   0.0000
2  2.184592  0.222661  9.811274   0.0000
3  1.026553  0.163791  6.267456   0.0000
4  0.871398  0.144580  6.027095   0.0000
5 -1.209430  0.173253 -6.980720   0.0000
{'LL': -177.40450034974413, 'AIC_ll': 366.80900069948825, 'BIC_ll': 392.0966492900214}


### **Evaluate model performance** (Logistic Regression)

Evaluate model with accuracy and f1score 

In [40]:
CrossValidation(Class_algorithm=model1,x=x,y=y,metrics_function=accuracy_score,nb_k_fold=4)

algorithm did  converge under 300 iterations (at 92 iterations)
algorithm did  converge under 300 iterations (at 93 iterations)
algorithm did  converge under 300 iterations (at 88 iterations)
algorithm did  converge under 300 iterations (at 136 iterations)


[0.848, 0.824, 0.88, 0.792]

In [41]:
CrossValidation(Class_algorithm=model1,x=x,y=y,metrics_function=f1_score,nb_k_fold=4)

algorithm did  converge under 300 iterations (at 93 iterations)
algorithm did  converge under 300 iterations (at 92 iterations)
algorithm did  converge under 300 iterations (at 107 iterations)
algorithm did  converge under 300 iterations (at 117 iterations)


[0.8489208633093526, 0.8214285714285714, 0.84375, 0.7857142857142857]

### **Automatic selection of variables** (Logistic Regression)

In [43]:
index_cols=model1.autoselection("forward","BIC_ll",print_message=False)
index_cols

array([1, 4, 0, 2, 3])

In [44]:
index_cols=model1.autoselection("backward","BIC_ll",print_message=False)
index_cols

array([0, 1, 2, 3, 4])

In [45]:
index_cols=model1.autoselection("stepwise","BIC_ll",print_message=False)
index_cols

array([0, 1, 4, 2, 3])

As you see in this case we get 5 variables which are 5 true variables , but this is also a generated data from excel file with 5 variables

We can see how this selection performs on another generation that will contain 5 true variables and 45 false variables

### **Generate 50 variables and autoselection**

In [52]:
x,y=dg.gen(1,"logistic")

In [53]:
model_new=LogisticRegression(solver="nr")
model_new.fit(x,y)

algorithm did  converge under 100 iterations (at 8 iterations)


array([-1.56837360e-01,  3.67883363e+00,  1.62341681e+00,  5.93719636e-01,
       -3.95606267e-01, -5.11437535e+00,  1.90318974e-01,  4.69442650e-01,
       -4.45125240e-01, -5.27976669e-01,  9.18777110e-03, -1.80486442e-01,
        2.37919462e-01,  3.55073682e-01,  3.66236488e-02, -6.38740704e-01,
       -6.15930693e-02, -4.76787299e-01,  3.38033983e-01, -2.19116579e-02,
       -2.84375692e-01,  2.88808199e-01,  9.87652589e-02,  2.06832868e-01,
       -1.16128640e-01, -1.52364107e-01, -1.16953540e-01, -2.93071059e-01,
        9.38366223e-02,  3.52271059e-01, -2.63197499e-02,  2.72861314e-01,
        1.89098584e-01,  2.35039450e-03,  3.11720949e-01,  2.02489918e-01,
        1.04633637e-01,  1.44181368e-01, -2.84392912e-01, -1.59237114e-01,
        1.21269223e-01,  2.84104618e-01, -4.97606259e-01, -5.07867942e-01,
        2.01110001e-01,  1.62674253e-01, -7.04591994e-02, -2.49935763e-01,
       -2.75760466e-01,  6.79887743e-02, -3.13260493e-01])

In [54]:
model_new.autoselection("forward","BIC_ll",print_message=False)

array([4, 0, 1, 3, 2])

In [55]:
model_new.autoselection("stepwise","BIC_ll",print_message=False)

array([3, 4, 0, 1, 2])

In [14]:
model_new.autoselection("backward","BIC_ll",print_message=False)

array([0, 1, 2, 3, 4])

Despite having 50 variables in the dataset, the model successfully identified only 5 true variables. All 3 models gave same results which is a normal behaviour.

**Let's now generalise this approach for multiclass classification using OVR algoritmn and then softmax regression**

## **Logistic Regression OVR**

### **Fit and predict model**

Load data

In [56]:
table=pd.read_csv("data/exampleLRMulti.csv")

Get rid of unnamed columns

In [57]:
table.drop(table.columns[table.columns.str.contains('unnamed',case = False)],axis = 1, inplace = True)

One more time classes are distributed equally, so we can estimate model performance with accuracy score

In [58]:
table["column_Y"].value_counts()

column_Y
1    336
0    333
2    331
Name: count, dtype: int64

In [59]:
x,y=np.array(table.drop(["column_Y"],axis=1)),np.array(table["column_Y"])

Estimate Multiclass Logistic Regression with One vs Rest algorithm and Newton Raphson optimisation 

Since it's an OVR algorithm, it will run model 3 times with 3 columns of estimated parameter

In [91]:
modelOVR=LogisticRegression(solver="nr",multiclass="ovr")
modelOVR.fit(x,y)

algorithm did  converge under 100 iterations (at 5 iterations)
algorithm did  converge under 100 iterations (at 5 iterations)
algorithm did  converge under 100 iterations (at 5 iterations)


array([[-1.01498187, -1.45620846, -0.78718645],
       [ 0.34446018,  0.56751748, -0.79115877],
       [ 0.22977187, -0.65556071,  0.56300868],
       [ 0.05302252, -0.03242775, -0.03517929],
       [ 0.06803143,  0.15324563, -0.1736295 ],
       [-0.44905806,  0.79972167, -0.42289584],
       [ 0.02949896, -0.09727438,  0.07434896],
       [-0.64703364,  0.23914131,  0.39520493],
       [-0.03647586,  0.15863365, -0.09606587],
       [ 0.6808661 , -0.45149573, -0.23001672],
       [-0.01179362, -0.05067621,  0.03836354]])

We can also predict. Predictions are made always in the ascending order: 

    -first column will be 0 class
    -second column will be 1 class
    -third column will be 2 class


In [92]:
predictionsOVR=modelOVR.predict(x)
predictionsOVR 

array([[0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       ...,
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.]])

We will compare these predictions to true one hot encoded labels

In [93]:
modelOVR.y

array([[0, 1, 0],
       [1, 0, 0],
       [0, 0, 1],
       ...,
       [0, 0, 1],
       [1, 0, 0],
       [0, 0, 1]])

Of course, we can always get probability (for OVR it is not normalised, but for predicted values it is normalised!!)

In [97]:
modelOVR.proba

array([[0.14660444, 0.94592762, 0.01074276],
       [0.26867955, 0.36961135, 0.19349834],
       [0.0519826 , 0.32131164, 0.59426132],
       ...,
       [0.02941952, 0.15974979, 0.77393711],
       [0.26880729, 0.33355668, 0.22603021],
       [0.03407864, 0.33402281, 0.64490065]])

get information criteria

In [95]:
modelOVR.get_inference(only_IC=True)

{'LL': -738.4500205744612,
 'AIC_ll': 1498.9000411489224,
 'BIC_ll': 1552.885349217726}

get more inference as significance tests etc

In [67]:
modelOVR.get_inference(only_IC=False)

ValueError: not done yet, incoming

it is ok as this part is not developed yet, incoming!

### **Evaluate model performance** (Log OVR)

We can evaluate accuracy

In [76]:
accuracy_score(predictionsOVR, modelOVR.y)

0.695

But we want more intelligent evaluation, so we use cross validation 

In [80]:
acc_list=CrossValidation(Class_algorithm=modelOVR,x=x,y=modelOVR.y,metrics_function=accuracy_score,nb_k_fold=4)

algorithm did  converge under 100 iterations (at 5 iterations)
algorithm did  converge under 100 iterations (at 5 iterations)
algorithm did  converge under 100 iterations (at 5 iterations)
algorithm did  converge under 100 iterations (at 5 iterations)
algorithm did  converge under 100 iterations (at 5 iterations)
algorithm did  converge under 100 iterations (at 5 iterations)
algorithm did  converge under 100 iterations (at 5 iterations)
algorithm did  converge under 100 iterations (at 5 iterations)
algorithm did  converge under 100 iterations (at 5 iterations)
algorithm did  converge under 100 iterations (at 5 iterations)
algorithm did  converge under 100 iterations (at 5 iterations)
algorithm did  converge under 100 iterations (at 5 iterations)


By averaging we get more "robust" accuracy 

In [81]:
print(acc_list)
np.mean(acc_list)

[0.72, 0.66, 0.684, 0.648]


0.678

### **Automatic selection of variables** (Log OVR)

We can also do variable selection as we did before, but in this case we will use generalised positive cross entropy as Log Likelihood 

In [98]:
modelOVR.autoselection("forward","BIC_ll",print_message=False)


array([4, 8, 0, 1, 6])

In [99]:
modelOVR.autoselection("backward","BIC_ll",print_message=False)


array([0, 1, 4, 6, 8])

In [100]:
col_index=modelOVR.autoselection("stepwise","BIC_ll",print_message=False)
col_index

array([6, 1, 8, 0, 4])

## **Logistic Regression Softmax**

**Now, let's see another way of solving multicase without running P independent classifiers-> logistic softmax**

### **Fit and predict model** (Log SFTMX)

In [82]:
modelSFT=LogisticRegression(solver="gd",multiclass="softmax")#dont use nr for the moment
modelSFT.fit(x,y)

algorithm did  converge under 100 iterations (at 50 iterations)


array([[ 0.04194506, -0.27091767,  0.22897262],
       [ 0.23740402,  0.34859258, -0.58599659],
       [ 0.08908508, -0.4494452 ,  0.36036012],
       [ 0.0417117 , -0.02254157, -0.01917013],
       [ 0.03704339,  0.08616513, -0.12320852],
       [-0.27315999,  0.56539992, -0.29223993],
       [ 0.01838198, -0.07586092,  0.05747895],
       [-0.47306334,  0.21607495,  0.2569884 ],
       [-0.03294754,  0.10874983, -0.07580229],
       [ 0.48652451, -0.33256412, -0.15396038],
       [-0.00460555, -0.02954451,  0.03415006]])

We can compare them to modelOVR parameters

In [83]:
modelOVR.params

array([[-1.00534408, -1.43558182, -0.78699103],
       [ 0.33985494,  0.55888912, -0.78313487],
       [-0.45253769,  0.79424587, -0.42085745],
       [-0.64951794,  0.23943091,  0.39746635],
       [ 0.22901814, -0.65214302,  0.55932286],
       [ 0.68185011, -0.44564961, -0.23015263]])

also predict

In [84]:
modelSFT.predict(x)

array([[0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       ...,
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.]])

### **Automatic selection of variables** (Log SFTMX)

In [87]:
modelSFT.autoselection("forward","BIC_ll",print_message=False)

array([4, 0, 1, 8, 6])

In [89]:
modelSFT.autoselection("stepwise","BIC_ll",print_message=False)

array([8, 0, 1, 4, 6])

In [90]:
modelSFT.autoselection("backward","BIC_ll",print_message=False)

array([0, 1, 4, 6, 8])

### **Evaluate model performance** (Log SFTMX)

In [86]:
meansftmx=CrossValidation(Class_algorithm=modelSFT,x=x,y=modelSFT.y,metrics_function=accuracy_score,nb_k_fold=4)
np.mean(meansftmx)

algorithm did  converge under 100 iterations (at 65 iterations)
algorithm did  converge under 100 iterations (at 59 iterations)
algorithm did  converge under 100 iterations (at 62 iterations)
algorithm did  converge under 100 iterations (at 62 iterations)


0.679

Further steps:


 -introduce Newton Raphson for softmax regression

 -add logistic softmax deep dive to K vs K-1 (with reference class) columns of B estimate

 

 -explore further inference statistics ( odds ratio, inference for multinomial logistic regressions, average marginal effects)

 -do regularisation ( lasso, ridge, elasticnet)
 
 -do real project using all these developed algorithms



In [118]:
modelSFT=LogisticRegression(solver="nr",multiclass="softmax")#dont use nr for the moment
modelSFT.fit(x,y)

ValueError: not done yet,incoming

this error is normal as it is not developed yet