## Porto Seguro’s Safe Driver Prediction

Porto Seguro, one of Brazil’s largest auto and homeowner insurance companies, completely agrees. Inaccuracies in car insurance company’s claim predictions raise the cost of insurance for good drivers and reduce the price for bad ones.

In this competition, you’re challenged to build a model that predicts the probability that a driver will initiate an auto insurance claim in the next year. While Porto Seguro has used machine learning for the past 20 years, they’re looking to Kaggle’s machine learning community to explore new, more powerful methods. A more accurate prediction will allow them to further tailor their prices, and hopefully make auto insurance coverage more accessible to more drivers.

### Data Description

In this competition, you will predict the probability that an auto insurance policy holder files a claim.
In the train and test data, features that belong to similar groupings are tagged as such in the feature names (e.g., ind, reg, car, calc). In addition, feature names include the postfix bin to indicate binary features and cat to indicate categorical features. Features without these designations are either continuous or ordinal. Values of -1 indicate that the feature was missing from the observation. The target columns signifies whether or not a claim was filed for that policy holder.

## Starters code Adaboost
In this notebook, we are going to use a ensemble machine learning, Random Forest model to predict wheteher a driver makes an insurance claim in the next year. 

In [22]:
import numpy as np
import subprocess
import pandas as pd

from IPython.display import Image

from collections import Counter

from sklearn.datasets import make_classification
from sklearn.cross_validation import train_test_split
from sklearn.metrics import log_loss, accuracy_score

# classifiers
from sklearn.ensemble import AdaBoostClassifier


In [15]:
#### LOADING DATA ####

### TRAIN DATA
train_data = pd.read_csv('train.csv', na_values='-1')\
                        
## Filling the missing data NAN with median of the column
train_data_nato_median = pd.DataFrame()
for column in train_data.columns:
    train_data_nato_median[column] = train_data[column].fillna(train_data[column].median())

train_data = train_data_nato_median.copy()

### TEST DATA
test_data = pd.read_csv('test.csv', na_values='-1')
## Filling the missing data NAN with mean of the column
test_data_nato_median = pd.DataFrame()
for column in test_data.columns:
    test_data_nato_median[column] = test_data[column].fillna(test_data[column].median())
    
test_data = test_data_nato_median.copy()
test_data_id = test_data.pop('id')

In [16]:
## Identifying Categorical data
column_names = train_data.columns
categorical_column = column_names[column_names.str[10] == 'c']

## Changing categorical columns to category data type
def int_to_categorical(data):
    """ 
    changing columns to catgorical data type
    """
    for column in categorical_column:
        data[column] =  data[column].astype('category')

In [17]:
## Creating list of train and test data and converting columns of interest to categorical type
datas = [train_data,test_data]

for data in datas:
    int_to_categorical(data)

test_data.dtypes

ps_ind_01            int64
ps_ind_02_cat     category
ps_ind_03            int64
ps_ind_04_cat     category
ps_ind_05_cat     category
ps_ind_06_bin        int64
ps_ind_07_bin        int64
ps_ind_08_bin        int64
ps_ind_09_bin        int64
ps_ind_10_bin        int64
ps_ind_11_bin        int64
ps_ind_12_bin        int64
ps_ind_13_bin        int64
ps_ind_14            int64
ps_ind_15            int64
ps_ind_16_bin        int64
ps_ind_17_bin        int64
ps_ind_18_bin        int64
ps_reg_01          float64
ps_reg_02          float64
ps_reg_03          float64
ps_car_01_cat     category
ps_car_02_cat     category
ps_car_03_cat     category
ps_car_04_cat     category
ps_car_05_cat     category
ps_car_06_cat     category
ps_car_07_cat     category
ps_car_08_cat     category
ps_car_09_cat     category
ps_car_10_cat     category
ps_car_11_cat     category
ps_car_11          float64
ps_car_12          float64
ps_car_13          float64
ps_car_14          float64
ps_car_15          float64
p

In [5]:
## Decribing categorical variables
# def decribe_Categorical(x):
#     """ 
#     Function to decribe Categorical data
#     """
#     from IPython.display import display, HTML
#     display(HTML(x[x.columns[x.dtypes =="category"]].describe().to_html))

# decribe_Categorical(train_data) 

In [8]:
### FUNCTION TO CREATE DUMMIES COLUMNS FOR CATEGORICAL VARIABLES
def creating_dummies(data):
    """creating dummies columns categorical varibles
    """
    for column in categorical_column:
        dummies = pd.get_dummies(data[column],prefix=column)
        data = pd.concat([data,dummies],axis =1)
        ## dropping the original columns ##
        data.drop([column],axis=1,inplace= True)

In [18]:
### CREATING DUMMIES FOR CATEGORICAL VARIABLES  
for column in categorical_column:
        dummies = pd.get_dummies(train_data[column],prefix=column)
        train_data = pd.concat([train_data,dummies],axis =1)
        train_data.drop([column],axis=1,inplace= True)


for column in categorical_column:
        dummies = pd.get_dummies(test_data[column],prefix=column)
        test_data = pd.concat([test_data,dummies],axis =1)
        test_data.drop([column],axis=1,inplace= True)

print(train_data.shape)
print(test_data.shape)


(595212, 220)
(892816, 218)


In [19]:
#Define covariates in X and dependent variable in y
X = train_data.iloc[:,2:] ## FEATURE DATA
y= train_data.target ### LABEL DATA

### CHECKING DIMENSIONS
print(X.shape)
print(y.shape)


(595212, 218)
(595212,)


In [24]:
#### SPLITTING DATA INTO TRAIN AND TEST SETS
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.2,
                                                    random_state=13)


In [25]:
## Label's Distribution
print("Train label distribution:")
print(Counter(y_train))

print("\nTest label distribution:")
print(Counter(y_test))

Train label distribution:
Counter({0: 458844, 1: 17325})

Test label distribution:
Counter({0: 114674, 1: 4369})


In [26]:
## Adaboost Machine learning model
adaboost = AdaBoostClassifier(
    base_estimator=DecisionTreeClassifier(max_depth=1),
    algorithm='SAMME',
    n_estimators=1000,
    random_state=seed)

# train classifier
adaboost.fit(X_train, y_train)

# calculate predictions
adaboost_y_pred = adaboost.predict(X_test)
adaboost_y_pred_prob = adaboost.predict_proba(X_test)

# evaluate
adaboost_accuracy = accuracy_score(y_test, adaboost_y_pred)
adaboost_logloss = log_loss(y_test, adaboost_y_pred_prob)

print("== AdaBoost ==")
print("Accuracy: {0:.2f}".format(adaboost_accuracy))
print("Log loss: {0:.2f}".format(adaboost_logloss))

== AdaBoost ==
Accuracy: 0.96
Log loss: 0.55


In [27]:
print('True labels:')
print(y_test[:5,])
print('\nPredicted labels:')
print(adaboost_y_pred[:5,])
print('\nPredicted probabilities:')
print(adaboost_y_pred_prob[:5,])

True labels:
240034    0
122822    0
242279    0
594193    0
146126    0
Name: target, dtype: int64

Predicted labels:
[0 0 0 0 0]

Predicted probabilities:
[[ 0.58466896  0.41533104]
 [ 0.583991    0.416009  ]
 [ 0.58478572  0.41521428]
 [ 0.58519744  0.41480256]
 [ 0.58364119  0.41635881]]


In [28]:
#### Predicition on test data ####
y_pred_RF_prob = adaboost.predict_proba(test_data)
pred_values= pd.DataFrame(y_pred_RF_prob)

submission_simple_adaboost= pd.DataFrame()
submission_simple_adaboost['id'] = test_data_id

submission_simple_adaboost['target'] = pd.DataFrame(pred_values.iloc[:,1])
submission_simple_adaboost = submission_simple_adaboost.set_index('id')

submission_simple_adaboost.columns
submission_simple_adaboost.head()
## Write to CSV
submission_simple_adaboost.to_csv("Simple Adaboost.csv")