# Kaggle Insurance Competition

Link: [Insurance Competition](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction)

Goal: we want to predict the **probability** of an insurance claim being made for any given driver given a set of features.

In [1]:
PATH = '/Users/alexhoward/data/insurance_kaggle/'

Initialisation:

In [2]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

from fastai.imports import * # fastai.imports imports range of different libraries e.g. pandas
from fastai.structured import *

from pandas_summary import DataFrameSummary
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier # Random Forest Class's we'll use.
from IPython.display import display

from sklearn import metrics

## Data Cleaning:

In [4]:
df_raw = pd.read_csv(f'{PATH}train.csv')

In [10]:
df_raw.head()

Unnamed: 0,id,target,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
0,7,0,2,2,5,1,0,0,1,0,...,9,1,5,8,0,1,1,0,0,1
1,9,0,1,1,7,0,0,0,0,1,...,3,1,1,9,0,1,1,0,1,0
2,13,0,5,4,9,1,0,0,0,1,...,4,2,7,7,0,1,1,0,1,0
3,16,0,0,1,2,0,0,1,0,0,...,2,2,4,9,0,0,0,0,0,0
4,17,0,0,2,0,1,0,1,0,0,...,3,1,1,3,0,0,0,1,1,0


From the competition information we find:
- Features from similar groupings are tagged as such with calc/ind/reg etc.
- _bin -> binary feature
- _cat -> categorical feature
- Features without suffix bin/cat are either ordinal or numeric

Helpfully, with the numeric codes, this data is already in a good format to feed into the random forest model.

In [11]:
df, y, nas = proc_df(df_raw, 'target')

## Setting up Model Testing:

Create validation set:

In [12]:
set_rf_samples(1000)

In [13]:
reset_rf_samples()

In [14]:
def split_vals(a,n): return a[:n].copy(), a[n:].copy() # Note use .copy() to create a new object in memory

n_valid = 1000  # Specify validation set size
n_trn = len(df)-n_valid # Resulting train set size
raw_train, raw_valid = split_vals(df_raw, n_trn)
X_train, X_valid = split_vals(df, n_trn)
y_train, y_valid = split_vals(y, n_trn)

# Training data: X_train
# Training response: y_train

# Validation set: X_valid

Defining a loss function:

In [15]:
def rmse(x,y): return math.sqrt(((x-y)**2).mean()) # Loss function as per Kaggle

def print_score(m):
    res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_valid), y_valid),
                m.score(X_train, y_train), m.score(X_valid, y_valid)]
    if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
    print(res)

Note this isn't exactly how the competition will be judged, but we hope it'll approximate this.

## Building Model:

In [16]:
RF = RandomForestRegressor(n_estimators = 50, n_jobs = -1, min_samples_leaf = 500)
RF.fit(X_train,y_train)
print_score(RF)

KeyboardInterrupt: 

In [None]:
RF.predict(X_train)

Okay, this looks really shit, let's actually put it back into Kaggle and see what we get...

In [22]:
test_set = pd.read_csv('/Users/alexhoward/data/insurance_kaggle/test.csv')

In [None]:
test_set.head()

In [57]:
test_preds = RF.predict(test_set)

ValueError: Number of features of the model must match the input. Model n_features is 58 and input n_features is 59 

In [None]:
test_preds

In [24]:
test_set['target'] = test_preds

In [25]:
submission = test_set[['id','target']]

In [26]:
submission.to_csv('submission2.csv', index = False)

Problems I've encountered:
- The r^2 metric is below zero...
- This improves when we put it on Kaggle
- My score actually got worse when I added way more data initially, so tuned the parameters to min_samples_leaf = 500, should hopefully give a more accurate estimate from more averaging

Note that using a RF makes sense here because we average over a number of similar outcomes (0/1) which gives an empirical probability to input.

Now what we want to do is turn categories into sets of indicator variables (one-hot encoding), and see if this helps our RF predictor:

In [38]:
new_df.ps_ind_01.nunique()

8

In [46]:
for column in df_raw.columns: df_raw[column] = df_raw[column].astype('category')

In [47]:
new_df, y, nas = proc_df(df_raw, 'target', max_n_cat = 10)

In [8]:
def split_vals(a,n): return a[:n].copy(), a[n:].copy() # Note use .copy() to create a new object in memory

n_valid = 1000  # Specify validation set size
n_trn = len(new_df)-n_valid # Resulting train set size
raw_train, raw_valid = split_vals(df_raw, n_trn)
X_train, X_valid = split_vals(new_df, n_trn)
y_train, y_valid = split_vals(y, n_trn)

# Training data: X_train
# Training response: y_train

# Validation set: X_valid

In [49]:
reset_rf_samples()

In [50]:
RF = RandomForestRegressor(n_estimators = 200, n_jobs = -1, min_samples_leaf = 500)
RF.fit(X_train,y_train)
print_score(RF)

[0.1854402222012665, 0.20862542849126414, 0.020399728958585106, 0.008190470030909669]


In [20]:
new_df.columns

Index(['id', 'ps_ind_01', 'ps_ind_02_cat', 'ps_ind_03', 'ps_ind_04_cat',
       'ps_ind_05_cat', 'ps_ind_06_bin', 'ps_ind_07_bin', 'ps_ind_08_bin',
       'ps_ind_09_bin', 'ps_ind_10_bin', 'ps_ind_11_bin', 'ps_ind_12_bin',
       'ps_ind_13_bin', 'ps_ind_14', 'ps_ind_15', 'ps_ind_16_bin',
       'ps_ind_17_bin', 'ps_ind_18_bin', 'ps_reg_01', 'ps_reg_02', 'ps_reg_03',
       'ps_car_01_cat', 'ps_car_02_cat', 'ps_car_03_cat', 'ps_car_04_cat',
       'ps_car_05_cat', 'ps_car_06_cat', 'ps_car_07_cat', 'ps_car_08_cat',
       'ps_car_09_cat', 'ps_car_10_cat', 'ps_car_11_cat', 'ps_car_11',
       'ps_car_12', 'ps_car_13', 'ps_car_14', 'ps_car_15', 'ps_calc_01',
       'ps_calc_02', 'ps_calc_03', 'ps_calc_04', 'ps_calc_05', 'ps_calc_06',
       'ps_calc_07', 'ps_calc_08', 'ps_calc_09', 'ps_calc_10', 'ps_calc_11',
       'ps_calc_12', 'ps_calc_13', 'ps_calc_14', 'ps_calc_15_bin',
       'ps_calc_16_bin', 'ps_calc_17_bin', 'ps_calc_18_bin', 'ps_calc_19_bin',
       'ps_calc_20_bin'],
      dtyp

In [54]:
for column in test_set.columns: test_set[column] = test_set[column].astype('category')

In [56]:
for n,c in test_set.items(): numericalize(test_set, c, n, 10)

Argh, new test set has a different number of columns -- to fix this, we'll add it to bottom of previous data frame before we start analysis on it. Eugh.