# 3. Linear Model

In this notebook, I start the inferential analysis by performing a Logistic Regression as a base model.

In [1]:
import pandas as pd
import numpy as np

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

from sklearn.model_selection import train_test_split, StratifiedKFold

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import roc_auc_score

In [2]:
df_train = pd.read_csv("Working_datasets/processed_train.csv")
df_train.head()

Unnamed: 0,TARGET_FLAG,AGE,HOME_VAL,OLDCLAIM,CAR_AGE,Doctor,Home Maker,Lawyer,Manager,Professional,Student,Panel Truck,Sports Car,Van,PCA_FACTOR
0,0.0,60.0,1.0,8.403128,18.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,-0.972189
1,0.0,43.0,12.457811,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.125707
2,0.0,35.0,11.729576,10.563336,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.612868
3,0.0,51.0,12.63216,1.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-3.027922
4,0.0,50.0,12.404616,9.863551,17.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.80756


## Premodelling process

At this stage, it is performed standardization which is required for many machine learning estimators. In this case in particular, it helps to reach the optimal point quickier for the Logistic Regression.

In [3]:
y_variable = 'TARGET_FLAG'

y = df_train[y_variable].values
X = df_train.drop(columns=y_variable)

In [4]:
numeric_features = ['AGE',
                    'HOME_VAL',
                    'OLDCLAIM',
                    'CAR_AGE',
                    'PCA_FACTOR']

numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),],
    remainder='passthrough')

I divide the data set into training and testing. I also use k-fold to search over hyper-parameters

In [5]:
random_state = 123
n_splits = 4

X_training, X_testing, y_training, y_testing \
= train_test_split(X, y, test_size=0.1, random_state=random_state, stratify=y)

kf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)

## Model definition
I use Logistic Regression as the baseline model as it is the most basic model to perform binary classification.

In [6]:
pipe = Pipeline(steps=[('preprocessor', preprocessor),                # In order to perform preprocessing
                      ('clf', LogisticRegression(random_state=1,      # In order to instanciate a model
                                                solver='liblinear',
                                                max_iter=300))])

## Model Training
I use GridSearchCV to find the best hyperparameters and I define AUC as performing metric. 

In [7]:
param_grid = dict(clf__penalty = ['l1', 'l2'],
                  clf__C       = np.logspace(-2, 3, 100))

grid = GridSearchCV(pipe, 
                    param_grid=param_grid,
                    cv=kf, 
                    n_jobs=1, 
                    verbose=1,
                    scoring='roc_auc',
                    return_train_score=True)
grid.fit(X_training, y_training)
print(grid.best_score_)

Fitting 4 folds for each of 200 candidates, totalling 800 fits
0.711842245357979


## Model evaluation

I evaluate the model on the test set. The baseline model is not really accurate. It is required a better model or more feature engineering or optimization of paramaters.

In [8]:
y_pred = grid.predict(X_testing)
roc_auc_score(y_testing, y_pred)

0.558663801072287

## Summary
I use Logistic Regression as the baseline model and I use gridsearch to find the best hyperparameters. It is noted the following:
* It is performed standarization in numerical covariables to avoid problems in the optimization algorithm
* It is used K-fold to find the best hyperparameters. 
* AUC is 71% in the training set and 55% in the testing set. It is required stronger regularization.