# 4. Selected Model

It is performed xgboost classifier as it is a model that tends to perform better in practical cases.

In [1]:
import os

import pandas as pd
import numpy as np

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

from sklearn.model_selection import train_test_split, StratifiedKFold

from xgboost import XGBClassifier
from sklearn.model_selection import RandomizedSearchCV

from sklearn.metrics import roc_auc_score

In [2]:
df_train = pd.read_csv("Working_datasets/processed_train.csv")
df_train.head()

Unnamed: 0,TARGET_FLAG,AGE,HOME_VAL,OLDCLAIM,CAR_AGE,Doctor,Home Maker,Lawyer,Manager,Professional,Student,Panel Truck,Sports Car,Van,PCA_FACTOR
0,0.0,60.0,1.0,8.403128,18.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,-0.972189
1,0.0,43.0,12.457811,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.125707
2,0.0,35.0,11.729576,10.563336,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.612868
3,0.0,51.0,12.63216,1.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-3.027922
4,0.0,50.0,12.404616,9.863551,17.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.80756


## Premodelling process


At this stage, it is performed standardization which is required for many machine learning estimators. In this case in particular, it helps to reach the optimal point quickier for the xgboost.

In [3]:
y_variable = 'TARGET_FLAG'

y = df_train[y_variable].values
X = df_train.drop(columns=y_variable)

In [4]:
numeric_features = ['AGE',
                    'HOME_VAL',
                    'OLDCLAIM',
                    'CAR_AGE',
                    'PCA_FACTOR']

numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),],
    remainder='passthrough')

I divide the data set into training and testing. I also use k-fold to search over hyper-parameters

In [5]:
random_state = 1234
n_splits = 4

X_training, X_testing, y_training, y_testing \
= train_test_split(X, y, test_size=0.1, random_state=random_state, stratify=y)

kf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)

## Model definition
I use xgboost classifier to ameliorate the performance of the classification problem.

In [6]:
model = XGBClassifier(objective='binary:logistic',
                      use_label_encoder=False,
                      eval_metric='auc',
                      random_state='1234')

pipe = Pipeline(steps=[('preprocessor', preprocessor),
                      ('clf', model)])

## Model training

I use RandomizedSearchCV to find the best hyperparameters and I define AUC as performing metric. 

In [7]:
param_grid  = dict(clf__n_estimators  = np.linspace(100, 300, 6).astype(int),
                           clf__alpha         = [0.01, 0.05, 0.1, 0.3, 0.5, 1, 10],
                           clf__lambda         = [0.01, 0.05, 0.1, 0.3, 0.5, 1])        

grid = RandomizedSearchCV(pipe,
                          param_distributions=param_grid,
                          cv=kf,
                          verbose=1,
                          scoring='roc_auc',
                          random_state=42,
                          n_iter=100,
                         )
grid.fit(X_training, y_training)
print(grid.best_score_)

Fitting 4 folds for each of 100 candidates, totalling 400 fits
0.7027768641802272


## Model evaluation

We evaluate the model on the never seen validation set. We got an improvement compared to the baseline. Further feature engineering could help increasing the F1 score.

In [8]:
y_pred = grid.predict(X_testing)
roc_auc_score(y_testing, y_pred)

0.5788462131016208

## Generation of predictions
Finally, it is created a csv file with the answers.

In [9]:
df_test = pd.read_csv("Working_datasets/processed_test.csv")
X = df_test.drop(columns=y_variable)
y_pred = grid.predict(X_testing)
Submission = pd.DataFrame({'y_pred': y_pred})

In [10]:
new_dir = 'Submission'

if not os.path.exists(new_dir):
    os.mkdir(new_dir)
    
Submission.to_csv(os.path.join(new_dir, 'test_w_answers.csv'), index=None)

## Summary
Xgboost is perform to ameliorate the performance of the baseline model. It is also used K-folding to find the best hyperparameters. In this case, I do not use gridsearch; instead, I used randomizedsearch as it quickier to converge and it is recommended in the literature. As a final comments,
* Xgboost improves the performance of the model but their optimization is more expensive in computer power.
* Overfitting was reduced but it stills very large, it is necessary to perform more sophisticated feature engineering to find the right variables or the right transformation for this problem.