In [1]:
import sys
sys.path.append('..')  
from _metrics import  CrossValidation, accuracy_score
from  _logistic import LogisticRegression
import numpy as np
import pandas as pd
import _dgp as dg


## **Data loading and preparation**

I generate data with 50 variables, among which there are 5 true first variables and target variable with 2 classes

In [2]:
np.random.seed(0)
x,y=dg.gen(type=1,regression="logistic")
columns=[f"column_{i}"for i in range(1,51)]

## **Estimation**

We can now apply simple Gradient descent optimisation algorithm to estimate B parameter of Logistic Regression 

In [3]:
model1=LogisticRegression(solver="gd",add_intercept=True,)
model1.fit(x,y)[0:5]

algorithm did not converge under 100 iterations,so the calculated parameter is biased


array([[-0.26201153],
       [ 1.85141827],
       [ 0.8778802 ],
       [ 0.3764929 ],
       [-0.27101373]])

**Algorithm could not converge:**
If you have this message, **NEVER** proceed to further steps as 
the estimated parameter is biased. 

Therefore, try other hyperparameters or optimisations to aim  convergence.


For example stochastic gradient descent with another learning rate

In [145]:
model2=LogisticRegression(solver="sgd",learning_rate=0.002,add_intercept=True,)
model2.fit(x,y)[0:5]

algorithm did not converge under 1600 iterations,so the calculated parameter is biased


array([[-0.35374579],
       [ 2.21888993],
       [ 1.08833334],
       [ 0.44701799],
       [-0.32078771]])

Or Newton Raphson, but before doing it, please verify that the inverse of x.T@x exists

In [146]:
np.linalg.det(x.T@x)

2.7458538617650583e+133

In [147]:
model3=LogisticRegression(solver="nr",learning_rate=0.001,add_intercept=True,)
model3.fit(x,y)[0:5]

algorithm did  converge under 100 iterations (at 7 iterations)


array([[-0.50411425],
       [ 2.81552631],
       [ 1.419166  ],
       [ 0.55959778],
       [-0.40066804]])

As we see, NR optimisation is much faster than simple GD, so we will continue using `model3` in the nest steps

## **Prediction**

The prediction is done by applying softmax activation function to the linear regression x@B:

In [148]:
predictions=model3.predict(x)
model3.proba[0:5]

array([[2.06658740e-03, 9.97933413e-01],
       [2.80964489e-01, 7.19035511e-01],
       [9.99366221e-01, 6.33778571e-04],
       [8.66956196e-01, 1.33043804e-01],
       [4.42909596e-02, 9.55709040e-01]])

With the first column corresponding to the probability of class 1 to be observed, the second column- of class 0.

Then we will compare both probabilities and assign the class based on the higher one

**When we predict the model, we have to evaluate the model "quality". This can be done using several approaches**:
 - Evaluating metrics such as Accuracy, F1 score etc ...
 - Evaluating information criteria such as AIC, BIC, adjusted R2 ...
 - Evaluating significance of B estimator which is a traditional econometrics approach.

In this secton we will discover first method, other two methods will be presented in the next section **"Model Interpretability"**

So, here we want to see if our predictions are correct, that is why we will compare real values with predicted values

In [149]:
#predicted values
predictions[0:10]

array([0., 0., 1., 1., 0., 1., 0., 0., 1., 0.])

In [150]:
#real values
y[0:10]

array([0, 1, 1, 0, 0, 0, 0, 1, 1, 0])

 Before doing the comparison, we have to see if classes {0,1} are balanced.

 If yes then the use of `accuracy_score` will be enough, otherwise we will use other metrics such as 
 `f1_score`or `recall_score`or `precision_score`

In [151]:
accuracy_score(model3.predict(x),y)

0.912

This one-shot evaluation is good, but not robust against heterogeneity in the data, so we will couple it with `CrossValidation` technique and then average CV scores

In [152]:
#nb of folds is 4 so we will run 4 models 
CV_scores=CrossValidation(Class_algorithm=model3,x=x,y=y,metrics_function=accuracy_score,nb_k_fold=4)
print(CV_scores)
np.mean(CV_scores)

algorithm did  converge under 100 iterations (at 8 iterations)
algorithm did  converge under 100 iterations (at 8 iterations)
algorithm did  converge under 100 iterations (at 8 iterations)
algorithm did  converge under 100 iterations (at 8 iterations)
[0.792, 0.808, 0.856, 0.824]


0.82

## **Model Interpretability**

The use of Logistic Regression is very popular because it allows to interpret coefficients as in Linear Regression. So we can do it too:

In [159]:
model3.get_inference(only_IC=False,ordered_columns_names=columns)

Unnamed: 0,coef,std,z value,p value,p value star,odds ratio
intercept_1,-0.5041,0.2,-2.56,0.011,**,0.6
column_1_1,2.8155,0.32,8.72,0.0,***,16.7
column_2_1,1.4192,0.24,6.0,0.0,***,4.13
column_3_1,0.5596,0.21,2.61,0.009,***,1.75
column_4_1,-0.4007,0.19,-2.1,0.037,**,0.67
column_5_1,-4.0397,0.45,-8.92,0.0,***,0.02
column_6_1,0.3089,0.2,1.52,0.129,,1.36
column_7_1,-0.7026,0.23,-3.11,0.002,***,0.5
column_8_1,-0.1147,0.2,-0.58,0.565,,0.89
column_9_1,0.0897,0.19,0.48,0.629,,1.09


In [154]:
model3.get_inference(only_IC=True)

{'LL': array([-113.97960799]),
 'AIC_ll': array([329.95921597]),
 'BIC_ll': array([544.90422899])}

In this table it can be seen that we modelled the probability of belonging to class 1

ALL 5 first variables which are true variables ( conditionally to our DGP) are significant, but there are some false variables that are significant too, thus we would want to select them automatically



We can also compare these results with official implementation of `statmodels`

In [155]:
import statsmodels.api as sm

logit_model = sm.Logit(pd.DataFrame(y), sm.add_constant(pd.DataFrame(x)))
result = logit_model.fit()
# Print summary of the model
print(result.summary())

Optimization terminated successfully.
         Current function value: 0.227959
         Iterations 9
                           Logit Regression Results                           
Dep. Variable:                      0   No. Observations:                  500
Model:                          Logit   Df Residuals:                      449
Method:                           MLE   Df Model:                           50
Date:                Sat, 06 Apr 2024   Pseudo R-squ.:                  0.6709
Time:                        19:38:19   Log-Likelihood:                -113.98
converged:                       True   LL-Null:                       -346.38
Covariance Type:            nonrobust   LLR p-value:                 1.303e-68
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.5041      0.197     -2.560      0.010      -0.890      -0.118
0              2.8155      0.

## **Model Selection**

In this section, we will try to find the best model (set of variables) that minimise AIC or BIC.

In order to have this set of variables, we will test different algorithms: 
- Backward regression
- Forward regression 
- Stepwise regression

This is done to minimise the risk of overfitting, reduce the model and find true variables that have a true impact on our target variable

In [156]:
index_cols=model3.autoselection("backward","BIC_ll",print_message=False)
index_cols

array([0, 1, 2, 4, 6])

In [157]:
index_cols=model3.autoselection("forward","BIC_ll",print_message=False)
index_cols

array([4, 0, 1, 2, 6])

In [158]:
index_cols=model3.autoselection("stepwise","BIC_ll",print_message=False)
index_cols

array([6, 4, 0, 1, 2])

As you see, all three methods give the same result that includes 4/5 true variables. Unfortunately the model selected one false variable and omitted one true variable that can be considered as mix of overfitting and underfitting