# Descartes Underwriting Test : Auto Insurance
Anyssa Diouf 
_____

* The goal of this project is to predict whether an auto insurer will have to pay claims (the target TARGET_FLAG), and predict their amounts (denoted by the target TARGET_AMT). 

* This code returns the value of the performance of the algorithms tested and generates a csv file with the predictions. The values of the performance of the algorithms tested are also presented.

* I didn't have the time to look at the features importance of the model I used. I would have done it using the SHAP package, as in [here](https://towardsdatascience.com/explain-your-model-with-the-shap-values-bc36aac4de3d).

In [34]:
import os

import numpy as np
import pandas as pd

from sklearn.compose import ColumnTransformer, make_column_selector, make_column_transformer
from sklearn.metrics import log_loss, make_scorer, matthews_corrcoef, mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OrdinalEncoder
from xgboost import XGBClassifier, XGBRegressor

In [35]:
path_data = "data/"

In [36]:
# Loading data
df_test = pd.read_csv(os.path.join(path_data, "test_auto.csv"), index_col='INDEX')
df_train = pd.read_csv(os.path.join(path_data, "train_auto.csv"), index_col='INDEX')

mean_auto = pd.read_csv(os.path.join(path_data, "MEAN_AUTO.csv"))
shell_auto = pd.read_csv(os.path.join(path_data, "SHELL_AUTO.csv"))

One can see that indexes (from the INDEX column) in mean_auto and shell_auto only match the ones in test_auto, the test dataset. As they do not match the INDEX column of the training dataset, we will not use those features. 
In the following, only the variables in df_train will be used.

## 1. Preprocessing raw features

In [37]:
df_train[['INCOME', 'HOME_VAL', 'BLUEBOOK', 'OLDCLAIM']].sample(3).T

INDEX,9471,7153,6566
INCOME,$0,"$10,359",
HOME_VAL,$0,"$75,321","$367,206"
BLUEBOOK,"$7,270","$16,570","$17,580"
OLDCLAIM,"$2,280",$0,"$43,663"


In [38]:
# Turning money amounts strings to floats
def floatmoney(s): 
    if type(s) == str:
        return float(s.replace('$','').replace(',',''))
    else:
        return s

def preprocess(df):
    dollar_vars = ['INCOME', 'HOME_VAL', 'BLUEBOOK', 'OLDCLAIM']
    for var in dollar_vars:
        df[var] = [floatmoney(s) for s in df[var].tolist()]

    # Turning NaN's to a character equivalent for the Ordinal Encoder that we'll later use
    for var in df.select_dtypes(exclude='number').columns: 
        df[var] = df[var].fillna('unknown')
    return df

df_train = preprocess(df_train)

In [40]:
df_train[['INCOME', 'HOME_VAL', 'BLUEBOOK', 'OLDCLAIM']].sample(3).T

INDEX,6721,6124,9636
INCOME,25535.0,35120.0,44949.0
HOME_VAL,99326.0,0.0,185547.0
BLUEBOOK,18270.0,17700.0,15520.0
OLDCLAIM,30146.0,0.0,0.0


## 2. Model definition

We are using the XGBoost model: it is notoriously performant and can handle missing values/NaN's relevantly. <br>
We first start by defining the estimators and creating a pipeline for them.

In [41]:
# Encoding strings 
ordinal_encoder = make_column_transformer(
    (OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=np.nan),
     make_column_selector(dtype_exclude='number')),
    remainder='passthrough')

# Use of XGBoost to classify TARGET_FLAG and regress TARGET_AMT
xgb_model_clf = XGBClassifier(objective='binary:logistic', use_label_encoder=False, eval_metric='logloss')
xgb_model_reg = XGBRegressor(use_label_encoder=False)

# Pipeline
pipe_clf = make_pipeline(ordinal_encoder, xgb_model_clf)
pipe_reg = make_pipeline(ordinal_encoder, xgb_model_reg)

We then train each pipeline. We use a k-fold cross validation on the training dataset to (k=5) to get a general idea of the model performance. We use the Matthews correlation coefficient as a  metric for the classification, and the root of the mean squared error/RMSE for the regression.

In [42]:
# Defining explainatory variables and targets 
X = df_train.drop(columns=['TARGET_FLAG', 'TARGET_AMT'])
y_clf = df_train['TARGET_FLAG']
y_reg = df_train['TARGET_AMT']

In [43]:
# Cross validation for the classification
cross_val_score(estimator=pipe_clf, 
                X=X, 
                y=y_clf, 
                cv=5, 
                scoring=make_scorer(matthews_corrcoef, greater_is_better=True))

array([0.37015116, 0.43846707, 0.37436551, 0.39569546, 0.37427354])

In [46]:
# Cross validation for the regression
scores_reg = cross_val_score(estimator=pipe_reg, 
                X=X, 
                y=y_reg, 
                cv=5, 
                scoring=make_scorer(mean_squared_error, greater_is_better=False))
np.sqrt(-scores_reg)

array([4467.5149588 , 5233.98837826, 4111.70002891, 5014.40489307,
       5730.6166709 ])

These scores aren't the best. Had I had more time, I would have try to improve them by optimising hyperparameters for example. 

In [49]:
# Training
pipe_clf.fit(X, y_clf)
pipe_reg.fit(X, y_reg);

## 3. Predictions on the test set

In [50]:
# Preprocessing 
df_test = preprocess(df_test)
df_test[['INCOME', 'HOME_VAL', 'BLUEBOOK', 'OLDCLAIM']].sample(3).T

INDEX,5908,560,3013
INCOME,,3865.0,52411.0
HOME_VAL,0.0,0.0,158113.0
BLUEBOOK,11310.0,16600.0,11340.0
OLDCLAIM,0.0,6089.0,0.0


Defining explanatory variables and targets :

In [55]:
X_test = df_test.drop(columns=['TARGET_FLAG', 'TARGET_AMT'])

y_flag = pipe_clf.predict(X_test)
y_amount = pipe_reg.predict(X_test)*y_flag # y_flag is used as an indicator function (puts the amount to zero when needed)

Predicting :

In [58]:
df_results = pd.DataFrame(zip(y_flag, y_amount), columns=['TARGET_FLAG', 'TARGET_AMT'], index=X_test.index)
df_results.sample(3)

Unnamed: 0_level_0,TARGET_FLAG,TARGET_AMT
INDEX,Unnamed: 1_level_1,Unnamed: 2_level_1
3246,1,3040.458496
731,0,0.0
4863,0,0.0


Exporting as csv :

In [60]:
df_results.to_csv("data/predictions.csv")