# Yacht Insurance Claims Data 
##### NOTEBOOK 3

**Problem Statement:** What is the likelihood that a yacht insurance policy has at least 1 claim within five years?

**Contents:**

___
## Import libraries and read in data

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score,GridSearchCV, RepeatedStratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, recall_score, f1_score 
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
from collections import Counter
from sklearn.linear_model import LogisticRegression, ElasticNetCV, LogisticRegressionCV
from sklearn.inspection import permutation_importance
from matplotlib import pyplot

In [2]:
np.random.seed(42)

In [3]:
combined = pd.read_csv('../datasets/combined2.csv')

combined.head()

Unnamed: 0,Years Exp.,Year Built,Length,Hull Limit,# Engines,num_claims,Age,policy_length,New/Renl/Endt/Canc/Flat_endt,New/Renl/Endt/Canc/Flat_endt-canc,...,Mooring County_sarasota,Mooring County_sinaloa,Mooring County_skagit,Mooring County_sonora,Mooring County_south pacific,Mooring County_st. johns,Mooring County_st. lucie,Mooring County_ventura,Mooring County_volusia,Mooring County_whatcom
0,2.0,1997.0,63.0,500000.0,2.0,0.0,73,1759.0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,22.0,2006.0,61.0,1275000.0,2.0,0.0,69,1772.0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,30.0,2001.0,48.0,400000.0,2.0,0.0,78,1760.0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,20.0,1973.0,32.0,35000.0,0.0,0.0,44,1760.0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,30.0,1989.0,43.0,200000.0,1.0,0.0,70,1757.0,0,0,...,0,0,0,0,0,0,0,0,0,0


---

# MODELING: Multiclass Classification

**BELOW:** Initially we wanted to see if we could do a multiclass classification model to predict whether a boat might have 0, 1, 2, or 3 claims. Unfortunately, with so few examples for our models to train on for 2 or 3 claims, we decided to move forward with just binary (having 0 or at least 1 claim).

#### Establish a baseline

0 claims = 92%<br>
1 claim = 6.6%<br>
2 claims = 1%<br>
3 claims = 0.2%<br>

In [4]:
combined['num_claims'].value_counts()

0.0    5836
1.0     421
2.0      68
3.0      15
Name: num_claims, dtype: int64

In [5]:
combined['num_claims'].value_counts(normalize=True)

0.0    0.920505
1.0    0.066404
2.0    0.010726
3.0    0.002366
Name: num_claims, dtype: float64

In [6]:
combined['num_claims'].astype(int)

0       0
1       0
2       0
3       0
4       0
       ..
6335    0
6336    0
6337    0
6338    0
6339    0
Name: num_claims, Length: 6340, dtype: int64

### Train/Test split

In [7]:
X = combined.drop(columns=['num_claims'])
y = combined['num_claims']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42, stratify=y)

## 4 MODELS: KNN, Logistic Regression, Random Forest, Extra Trees
---

### StandardScaler

In [8]:
ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

### Instantiate and fit 4 models

In [9]:
knn = KNeighborsClassifier()
knn.fit(X_train_sc, y_train)
knn_pred = knn.predict(X_test_sc)

lr = LogisticRegression(max_iter=500,random_state=42)
lr.fit(X_train_sc, y_train)
lr_pred = lr.predict(X_test_sc)

rf = RandomForestClassifier(n_estimators=100,random_state=42)
rf.fit(X_train_sc, y_train)
rf_pred = rf.predict(X_test_sc)

et = ExtraTreesClassifier(n_estimators=100,random_state=42)
et.fit(X_train_sc, y_train)
et_pred = et.predict(X_test_sc)

### Get model results

In [10]:
print(classification_report(y_test, knn_pred, digits=3))

              precision    recall  f1-score   support

         0.0      0.927     0.988     0.956      1167
         1.0      0.278     0.060     0.098        84
         2.0      0.500     0.143     0.222        14
         3.0      1.000     0.667     0.800         3

    accuracy                          0.916      1268
   macro avg      0.676     0.464     0.519      1268
weighted avg      0.879     0.916     0.891      1268



In [11]:
print(classification_report(y_test, lr_pred, digits=3))

              precision    recall  f1-score   support

         0.0      0.923     1.000     0.960      1167
         1.0      0.000     0.000     0.000        84
         2.0      0.000     0.000     0.000        14
         3.0      1.000     1.000     1.000         3

    accuracy                          0.923      1268
   macro avg      0.481     0.500     0.490      1268
weighted avg      0.852     0.923     0.886      1268



  _warn_prf(average, modifier, msg_start, len(result))


In [12]:
print(classification_report(y_test, rf_pred, digits=3))

              precision    recall  f1-score   support

         0.0      0.939     0.996     0.967      1167
         1.0      0.643     0.107     0.184        84
         2.0      1.000     1.000     1.000        14
         3.0      1.000     1.000     1.000         3

    accuracy                          0.937      1268
   macro avg      0.896     0.776     0.788      1268
weighted avg      0.921     0.937     0.915      1268



In [13]:
print(classification_report(y_test, et_pred, digits=3))

              precision    recall  f1-score   support

         0.0      0.940     0.987     0.963      1167
         1.0      0.440     0.131     0.202        84
         2.0      0.933     1.000     0.966        14
         3.0      1.000     1.000     1.000         3

    accuracy                          0.931      1268
   macro avg      0.828     0.780     0.783      1268
weighted avg      0.907     0.931     0.913      1268



**INTERPRETATION:** Looking at these scores we can tell that the model is overfitting to 2 and 3 claims. The models are very good at identifying these minority classes but that's because 

---
# MODELING: Binary Classification

**Set up for all models**<br>
1. Create binary class column<br>
2. Define X, y<br>
3. Scale X, y<br>
4. Train, test, split<br>

**Normal Modeling**<br>
1. Test models (KNN, Random Forest,  ExtraTrees, Logistic Regression, LinearSVM)<br>
2. Get micro-f1 scores for each model and add to a table to compare<br>

**With OverSampling**<br>
1. Instantiate RandomOverSampler<br>
2. Fit training data on oversampler<br>
3. Test same models w/ same parameters<br>
4. Get f1 scores and add to a table to compare<br>

**With OverSampling and Undersampling**<br>
1. Instantiate RandomOverSampler, fit.
2. Instantiate UnderOverSampler, fit.
3. Test same models w/ same parameters<br>
4. Get f1 scores and add to a table to compare<br>

**With SMOTE**<br>
1. Instantiate SMOTE, fit.<br>
2. Test same models w/ same parameters<br>
3. Get f1 scores and add to a table to compare<br>

#### Establish a baseline

0 claims = 92%<br>
At least 1 claim = 7.9%<br>

In [14]:
# Create a column for binary classification
combined['binary'] = [1 if x > 0 else 0 for x in combined['num_claims']]
combined['binary'].value_counts(normalize=True)

0    0.920505
1    0.079495
Name: binary, dtype: float64

**BELOW:** We decided the most important metric to track would be recall. This is because we are interested in predicting claims, the minority class. We concluded we would rather wrongly predict a boat would have a claim and it not than wrongly predict a boat would not have a claim and it have one. In addition, we are tracking accuracy to keep an eye on the model's bias/variance, and f1 score (a combination of precision and recall).

In [15]:
# Create an empty df to input scores from models
binary_scores = pd.DataFrame(columns=['Accuracy', 'Recall','Weighted F1 Score'])
binary_scores

Unnamed: 0,Accuracy,Recall,Weighted F1 Score


### Prep the models for testing
---

### Train/test split

In [16]:
X = combined.drop(columns=['num_claims','binary'])
y = combined['binary']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42, stratify=y)

### Scale the data

In [17]:
ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

## 4 MODELS: KNN, Logistic Regression, Random Forest, Extra Trees
---

In [18]:
# Using the scaled data, instantiate, fit, and generate predictions for the 4 models

knn = KNeighborsClassifier()
knn.fit(X_train_sc, y_train)
knn_pred = knn.predict(X_test_sc)

lr = LogisticRegression(random_state=42)
lr.fit(X_train_sc, y_train)
lr_pred = lr.predict(X_test_sc)

rf = RandomForestClassifier(n_estimators=100,random_state=42)
rf.fit(X_train_sc, y_train)
rf_pred = rf.predict(X_test_sc)

et = ExtraTreesClassifier(n_estimators=100,random_state=42)
et.fit(X_train_sc, y_train)
et_pred = et.predict(X_test_sc)

In [19]:
# Add the three scores we're tracking to the results data table

regular_knn = pd.Series(data=[accuracy_score(y_test, knn_pred), recall_score(y_test, knn_pred),
                              f1_score(y_test, knn_pred ,average='weighted')], index=binary_scores.columns, name = 'KNN(plain)')

regular_lr = pd.Series(data=[accuracy_score(y_test, lr_pred), recall_score(y_test, lr_pred),
                              f1_score(y_test, lr_pred ,average='weighted')], index=binary_scores.columns, name = 'LR(plain)')

regular_rf = pd.Series(data=[accuracy_score(y_test, rf_pred), recall_score(y_test, rf_pred),
                              f1_score(y_test, rf_pred ,average='weighted')], index=binary_scores.columns, name = 'RF(plain)')

regular_et = pd.Series(data=[accuracy_score(y_test, et_pred), recall_score(y_test, et_pred),
                              f1_score(y_test, et_pred ,average='weighted')], index=binary_scores.columns, name = 'ET(plain)')

binary_scores = binary_scores.append([regular_knn, regular_lr, regular_rf, regular_et])

binary_scores

Unnamed: 0,Accuracy,Recall,Weighted F1 Score
KNN(plain),0.913249,0.128713,0.893392
LR(plain),0.919558,0.0,0.881779
RF(plain),0.936909,0.227723,0.918876
ET(plain),0.925868,0.257426,0.912519


**INTERPRETATION:** While the accuracy score seems high this is misleading. Our majority class baseilne is 92% so essentially these models all predict the majority class but aren't great at predicting claims.

## WITH OVERSAMPLING

**BELOW:** In order to combat the small proportion of minority class compared to majority class, we are testing out a few methods of over and undersampling. First, we will begin by oversampling the minority class only.

### Check the imbalance of the two classes

In [20]:
counter = Counter(y_train)
counter

Counter({0: 4669, 1: 403})

### Instantiate RandomOverSampler

**BELOW:** The randomoversampler simply randomly chooses values from the training dataset to duplicate until you reach the chosen ratio. We chose to sample enough times to bring the minority class to 20% of the majority class.

In [21]:
# ref: https://beckernick.github.io/oversampling-modeling/

over = RandomOverSampler(sampling_strategy=0.2, random_state=42)
X_over, y_over = over.fit_resample(X_train_sc,y_train)

In [22]:
over_counter = Counter(y_over)
over_counter

Counter({0: 4669, 1: 933})

### Fit the models

In [23]:
knn = KNeighborsClassifier()
knn.fit(X_over, y_over)
knn_pred = knn.predict(X_test_sc)

lr = LogisticRegression(random_state=42)
lr.fit(X_over, y_over)
lr_pred = lr.predict(X_test_sc)

rf = RandomForestClassifier(n_estimators=100,random_state=42)
rf.fit(X_over, y_over)
rf_pred = rf.predict(X_test_sc)

et = ExtraTreesClassifier(n_estimators=100,random_state=42)
et.fit(X_over, y_over)
et_pred = et.predict(X_test_sc)

### Add the results

In [24]:
over_knn = pd.Series(data=[accuracy_score(y_test, knn_pred), recall_score(y_test, knn_pred),
                              f1_score(y_test, knn_pred ,average='weighted')], index=binary_scores.columns, name = 'KNN(oversample)')

over_lr = pd.Series(data=[accuracy_score(y_test, lr_pred), recall_score(y_test, lr_pred),
                              f1_score(y_test, lr_pred ,average='weighted')], index=binary_scores.columns, name = 'LR(oversample)')

over_rf = pd.Series(data=[accuracy_score(y_test, rf_pred), recall_score(y_test, rf_pred),
                              f1_score(y_test, rf_pred ,average='weighted')], index=binary_scores.columns, name = 'RF(oversample)')

over_et = pd.Series(data=[accuracy_score(y_test, et_pred), recall_score(y_test, et_pred),
                              f1_score(y_test, et_pred ,average='weighted')], index=binary_scores.columns, name = 'ET(oversample)')

binary_scores = binary_scores.append([over_knn, over_lr, over_rf, over_et])

binary_scores

Unnamed: 0,Accuracy,Recall,Weighted F1 Score
KNN(plain),0.913249,0.128713,0.893392
LR(plain),0.919558,0.0,0.881779
RF(plain),0.936909,0.227723,0.918876
ET(plain),0.925868,0.257426,0.912519
KNN(oversample),0.867508,0.336634,0.876073
LR(oversample),0.896688,0.059406,0.876649
RF(oversample),0.932965,0.257426,0.917971
ET(oversample),0.926656,0.237624,0.911725


**INTERPRETATION:** This method seemed to improve the recall in our KNN, LR, and RF models, but we decided to test another method to increase them even more.

### WITH SMOTE OVERSAMPLING AND RANDOM OVER SAMPLING

**BELOW:** For this method we decided to both over sample and undersample. First, we oversampled the minority class again, but we used SMOTE. Instead of simply duplicating randomly from the minority class like above, SMOTE creates fake observations for the minority class by using k-nearest-neighbors to find similar observations and then slightly tweaking them. We decided to increase the ratio to 10% for the minority class through oversampling. Then we used RandomUnderSampler to undersample the majority class and bring the ratio to 2:1 between majority and minority classes.

### Set up a pipeline with SMOTE and undersampling

In [25]:
# Set up the pipeline
# ref: https://pypi.org/project/imbalanced-learn/
# ref: https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/

over = SMOTE(sampling_strategy=0.2,random_state=42)
under = RandomUnderSampler(sampling_strategy=0.5,random_state=42)
steps = [('o', over), ('u', under)]
pipeline = Pipeline(steps=steps)

In [26]:
# fit the pipeline
X_sm_und, y_sm_und = pipeline.fit_resample(X_train_sc, y_train)

In [27]:
# Find the ratio after over/undersampling
counter_5 = Counter(y_sm_und)
counter_5

Counter({0: 1866, 1: 933})

### Fit the models

In [28]:
knn = KNeighborsClassifier()
knn.fit(X_sm_und, y_sm_und)
knn_pred = knn.predict(X_test_sc)

lr = LogisticRegression(random_state=42)
lr.fit(X_sm_und, y_sm_und)
lr_pred = lr.predict(X_test_sc)

rf = RandomForestClassifier(n_estimators=100,random_state=42)
rf.fit(X_sm_und, y_sm_und)
rf_pred = rf.predict(X_test_sc)

et = ExtraTreesClassifier(n_estimators=100,random_state=42)
et.fit(X_sm_und, y_sm_und)
et_pred = et.predict(X_test_sc)

### Add the results

In [29]:
smote_knn = pd.Series(data=[accuracy_score(y_test, knn_pred), recall_score(y_test, knn_pred),
                              f1_score(y_test, knn_pred ,average='weighted')], index=binary_scores.columns, name = 'KNN(SMOTE/Under)')

smote_lr = pd.Series(data=[accuracy_score(y_test, lr_pred), recall_score(y_test, lr_pred),
                              f1_score(y_test, lr_pred ,average='weighted')], index=binary_scores.columns, name = 'LR(SMOTE/Under)')

smote_rf = pd.Series(data=[accuracy_score(y_test, rf_pred), recall_score(y_test, rf_pred),
                              f1_score(y_test, rf_pred ,average='weighted')], index=binary_scores.columns, name = 'RF(LR(SMOTE/Under))')

smote_et = pd.Series(data=[accuracy_score(y_test, et_pred), recall_score(y_test, et_pred),
                              f1_score(y_test, et_pred ,average='weighted')], index=binary_scores.columns, name = 'ET(LR(SMOTE/Under))')

binary_scores = binary_scores.append([smote_knn, smote_lr, smote_rf, smote_et])

binary_scores

Unnamed: 0,Accuracy,Recall,Weighted F1 Score
KNN(plain),0.913249,0.128713,0.893392
LR(plain),0.919558,0.0,0.881779
RF(plain),0.936909,0.227723,0.918876
ET(plain),0.925868,0.257426,0.912519
KNN(oversample),0.867508,0.336634,0.876073
LR(oversample),0.896688,0.059406,0.876649
RF(oversample),0.932965,0.257426,0.917971
ET(oversample),0.926656,0.237624,0.911725
KNN(SMOTE/Under),0.742902,0.455446,0.79622
LR(SMOTE/Under),0.809148,0.287129,0.836138


**INTERPRETATION:** It looks like my best models use a combination of oversampling using SMOTE and Random Under Sampling.<br>
Specifically, Extra Trees and Random Forest performed best (highest recall score).<br>
*Reminder Baseline:<br> 
0 claims = 92%<br>
At least 1 claim = 7.9%<br>*

---

## Look at training/testing scores of models

In [30]:
print('KNN(SMOTE/UNDER) Training Results:')
print(knn.score(X_sm_und, y_sm_und))
print(recall_score(y_sm_und, knn.predict(X_sm_und)))

print('\nRF(SMOTE/UNDER) Testing Results:')
print(knn.score(X_test_sc, y_test))
print(recall_score(y_test, knn.predict(X_test_sc)))

KNN(SMOTE/UNDER) Training Results:
0.8742408002858164
0.9003215434083601

RF(SMOTE/UNDER) Testing Results:
0.7429022082018928
0.45544554455445546


In [31]:
print('LR(SMOTE/UNDER) Training Results:')
print(lr.score(X_sm_und, y_sm_und))
print(recall_score(y_sm_und, lr.predict(X_sm_und)))

print('\nLR(SMOTE/UNDER) Testing Results:')
print(lr.score(X_test_sc, y_test))
print(recall_score(y_test, lr.predict(X_test_sc)))

LR(SMOTE/UNDER) Training Results:
0.7402643801357628
0.4844587352625938

LR(SMOTE/UNDER) Testing Results:
0.8091482649842271
0.2871287128712871


In [32]:
print('RF(SMOTE/UNDER) Training Results:')
print(rf.score(X_sm_und, y_sm_und))
print(recall_score(y_sm_und, rf.predict(X_sm_und)))

print('\nRF(SMOTE/UNDER) Testing Results:')
print(rf.score(X_test_sc, y_test))
print(recall_score(y_test, rf.predict(X_test_sc)))

RF(SMOTE/UNDER) Training Results:
1.0
1.0

RF(SMOTE/UNDER) Testing Results:
0.9266561514195584
0.27722772277227725


In [33]:
et1_training_score = et.score(X_sm_und, y_sm_und)
et1_recall_training_score = recall_score(y_sm_und, et.predict(X_sm_und))

et1_testing_score = et.score(X_test_sc, y_test)
et1_recall_testing_score = recall_score(y_test, et.predict(X_test_sc)) 

print('ET(SMOTE/UNDER) FIRST Training Results:')
print(et1_training_score)
print(et1_recall_training_score)

print('\nET(SMOTE/UNDER) FIRST Testing Results:')
print(et1_testing_score)
print(et1_recall_testing_score)

ET(SMOTE/UNDER) FIRST Training Results:
1.0
1.0

ET(SMOTE/UNDER) FIRST Testing Results:
0.9085173501577287
0.3564356435643564


In [34]:
et.predict(X_test_sc)[:50]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0])

In [35]:
et.predict(X_test_sc)[:50]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0])

In [36]:
et.predict_proba(X_test_sc)[:50]

array([[0.86, 0.14],
       [0.63, 0.37],
       [0.99, 0.01],
       [1.  , 0.  ],
       [0.99, 0.01],
       [0.76, 0.24],
       [1.  , 0.  ],
       [0.95, 0.05],
       [0.8 , 0.2 ],
       [0.98, 0.02],
       [0.96, 0.04],
       [0.96, 0.04],
       [0.77, 0.23],
       [0.89, 0.11],
       [0.84, 0.16],
       [0.96, 0.04],
       [0.52, 0.48],
       [0.86, 0.14],
       [0.94, 0.06],
       [0.93, 0.07],
       [0.79, 0.21],
       [0.88, 0.12],
       [0.73, 0.27],
       [0.96, 0.04],
       [0.09, 0.91],
       [0.98, 0.02],
       [0.98, 0.02],
       [0.82, 0.18],
       [0.82, 0.18],
       [0.9 , 0.1 ],
       [0.93, 0.07],
       [1.  , 0.  ],
       [0.  , 1.  ],
       [0.82, 0.18],
       [0.61, 0.39],
       [0.97, 0.03],
       [0.62, 0.38],
       [0.68, 0.32],
       [0.59, 0.41],
       [0.93, 0.07],
       [0.92, 0.08],
       [0.86, 0.14],
       [0.93, 0.07],
       [0.98, 0.02],
       [0.98, 0.02],
       [0.97, 0.03],
       [0.98, 0.02],
       [0.8 ,

In [37]:
binary_scores

Unnamed: 0,Accuracy,Recall,Weighted F1 Score
KNN(plain),0.913249,0.128713,0.893392
LR(plain),0.919558,0.0,0.881779
RF(plain),0.936909,0.227723,0.918876
ET(plain),0.925868,0.257426,0.912519
KNN(oversample),0.867508,0.336634,0.876073
LR(oversample),0.896688,0.059406,0.876649
RF(oversample),0.932965,0.257426,0.917971
ET(oversample),0.926656,0.237624,0.911725
KNN(SMOTE/Under),0.742902,0.455446,0.79622
LR(SMOTE/Under),0.809148,0.287129,0.836138


## Logistic Regression with Regularization and SMOTE/Undersampling

*I hadn't included any regularization in my logistic regression model and wanted to see how it did.*

In [38]:
logreg_cv = LogisticRegressionCV(Cs=10, cv=5, penalty="l1", solver="liblinear", random_state=42)
logreg_cv.fit(X_sm_und, y_sm_und)
logreg_cv_pred = logreg_cv.predict(X_test_sc)

print('LRCV(SMOTE/UNDER) Training Results:')
print(logreg_cv.score(X_sm_und, y_sm_und))
print(recall_score(y_sm_und, logreg_cv.predict(X_sm_und)))

print('\nLRCV(SMOTE/UNDER) Testing Results:')
print(logreg_cv.score(X_test_sc, y_test))
print(recall_score(y_test, logreg_cv.predict(X_test_sc)))

LRCV(SMOTE/UNDER) Training Results:
0.7424080028581637
0.48017148981779206

LRCV(SMOTE/UNDER) Testing Results:
0.8201892744479495
0.297029702970297


In [39]:
# Adding results to the results table
logregcv_et = pd.Series(data=[accuracy_score(y_test, logreg_cv_pred), recall_score(y_test, logreg_cv_pred),
                              f1_score(y_test, logreg_cv_pred ,average='weighted')], index=binary_scores.columns, name = 'LRCV(SMOTE/Under)')

binary_scores = binary_scores.append([logregcv_et])

binary_scores

Unnamed: 0,Accuracy,Recall,Weighted F1 Score
KNN(plain),0.913249,0.128713,0.893392
LR(plain),0.919558,0.0,0.881779
RF(plain),0.936909,0.227723,0.918876
ET(plain),0.925868,0.257426,0.912519
KNN(oversample),0.867508,0.336634,0.876073
LR(oversample),0.896688,0.059406,0.876649
RF(oversample),0.932965,0.257426,0.917971
ET(oversample),0.926656,0.237624,0.911725
KNN(SMOTE/Under),0.742902,0.455446,0.79622
LR(SMOTE/Under),0.809148,0.287129,0.836138


**INTERPRETATION:** It didn't perform well so we won't be using it moving forward.

## Look at feature importance from best models

In [40]:
# # Ref: https://towardsdatascience.com/interpreting-random-forest-and-other-black-box-models-like-xgboost-80f9cc4a3c38

# knn_smote_under_feature_imp = pd.DataFrame({'Variable':X.columns,
#               'Importance':et.feature_importances_}).sort_values('Importance', ascending=False)

# knn_smote_under_feature_imp[:30]

In [41]:
# Ref: https://towardsdatascience.com/interpreting-random-forest-and-other-black-box-models-like-xgboost-80f9cc4a3c38

et_smote_under_feature_imp = pd.DataFrame({'Variable':X.columns,
              'Importance':et.feature_importances_}).sort_values('Importance', ascending=False)

et_smote_under_feature_imp[:30]



Unnamed: 0,Variable,Importance
6,policy_length,0.077404
9,New/Renl/Endt/Canc/Flat_new,0.065364
10,New/Renl/Endt/Canc/Flat_renl,0.041816
1,Year Built,0.041571
5,Age,0.041542
2,Length,0.041432
3,Hull Limit,0.040715
0,Years Exp.,0.036133
24,Occupation_other,0.03104
29,Occupation_retired,0.023524


In [42]:
# Compare to feature importance of second best model

rf_smote_under_feature_imp = pd.DataFrame({'Variable':X.columns,
              'Importance':rf.feature_importances_}).sort_values('Importance', ascending=False)

top_30 = rf_smote_under_feature_imp[:30]

In [43]:
# Get only top 30 important features to use in an updated model 
post_rf_model_features = [x for x in top_30['Variable']]

## Redo models again using only top 30 features

In [44]:
binary_scores2 = pd.DataFrame(columns=['Accuracy', 'Recall','Weighted F1 Score'])
binary_scores2

Unnamed: 0,Accuracy,Recall,Weighted F1 Score


In [45]:
X = combined[post_rf_model_features]
y = combined['binary']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42, stratify=y)

In [46]:
ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

In [47]:
over = SMOTE(sampling_strategy=0.2, random_state=42)
under = RandomUnderSampler(sampling_strategy=0.5, random_state=42)
steps = [('o', over), ('u', under)]
pipeline = Pipeline(steps=steps)

X_sm_und, y_sm_und = pipeline.fit_resample(X_train_sc,y_train)

In [48]:
knn = KNeighborsClassifier()
knn.fit(X_sm_und, y_sm_und)
knn_pred = knn.predict(X_test_sc)

lr = LogisticRegression(random_state=42)
lr.fit(X_sm_und, y_sm_und)
lr_pred = lr.predict(X_test_sc)

rf = RandomForestClassifier(n_estimators=100,random_state=42)
rf.fit(X_sm_und, y_sm_und)
rf_pred = rf.predict(X_test_sc)

et = ExtraTreesClassifier(n_estimators=100,random_state=42)
et.fit(X_sm_und, y_sm_und)
et_pred = et.predict(X_test_sc)

logreg_cv = LogisticRegressionCV(Cs=10, cv=5, penalty="l1", solver="liblinear", random_state=42)
logreg_cv.fit(X_sm_und, y_sm_und)
logreg_cv_pred = logreg_cv.predict(X_test_sc)

In [49]:
smote_knn = pd.Series(data=[accuracy_score(y_test, knn_pred), recall_score(y_test, knn_pred),
                              f1_score(y_test, knn_pred ,average='weighted')], index=binary_scores2.columns, name = 'KNN(SMOTE/Under)')

smote_lr = pd.Series(data=[accuracy_score(y_test, lr_pred), recall_score(y_test, lr_pred),
                              f1_score(y_test, lr_pred ,average='weighted')], index=binary_scores2.columns, name = 'LR(SMOTE/Under)')

smote_rf = pd.Series(data=[accuracy_score(y_test, rf_pred), recall_score(y_test, rf_pred),
                              f1_score(y_test, rf_pred ,average='weighted')], index=binary_scores2.columns, name = 'RF(SMOTE/Under))')

smote_et = pd.Series(data=[accuracy_score(y_test, et_pred), recall_score(y_test, et_pred),
                              f1_score(y_test, et_pred ,average='weighted')], index=binary_scores2.columns, name = 'ET(SMOTE/Under))')

smote_logreg_cv = pd.Series(data=[accuracy_score(y_test, logreg_cv_pred), recall_score(y_test, logreg_cv_pred),
                              f1_score(y_test, logreg_cv_pred ,average='weighted')], index=binary_scores2.columns, name = 'LOGREG(SMOTE/Under))')

binary_scores2 = binary_scores2.append([smote_knn, smote_lr, smote_rf, smote_et, smote_logreg_cv])

print('New_Scores')
binary_scores2

New_Scores


Unnamed: 0,Accuracy,Recall,Weighted F1 Score
KNN(SMOTE/Under),0.731861,0.524752,0.789556
LR(SMOTE/Under),0.876183,0.287129,0.87958
RF(SMOTE/Under)),0.929022,0.29703,0.917493
ET(SMOTE/Under)),0.908517,0.316832,0.903354
LOGREG(SMOTE/Under)),0.894322,0.217822,0.887735


In [50]:
print('Old_Scores')
binary_scores[-5:]

Old_Scores


Unnamed: 0,Accuracy,Recall,Weighted F1 Score
KNN(SMOTE/Under),0.742902,0.455446,0.79622
LR(SMOTE/Under),0.809148,0.287129,0.836138
RF(LR(SMOTE/Under)),0.926656,0.277228,0.914426
ET(LR(SMOTE/Under)),0.908517,0.356436,0.905384
LRCV(SMOTE/Under),0.820189,0.29703,0.843597


**INTERPRETATION:** Scores improved after using only the top 30 most important features from the Random Forest model.

In [51]:
print('KNN(SMOTE/UNDER) SECOND Training Results:')
print(knn.score(X_sm_und, y_sm_und))
print(recall_score(y_sm_und, knn.predict(X_sm_und)))

print('\nRF(SMOTE/UNDER) SECOND Testing Results:')
print(knn.score(X_test_sc, y_test))
print(recall_score(y_test, knn.predict(X_test_sc)))

KNN(SMOTE/UNDER) SECOND Training Results:
0.8567345480528761
0.9035369774919614

RF(SMOTE/UNDER) SECOND Testing Results:
0.7318611987381703
0.5247524752475248


In [52]:
print('LR(SMOTE/UNDER) SECOND Training Results:')
print(lr.score(X_sm_und, y_sm_und))
print(recall_score(y_sm_und, lr.predict(X_sm_und)))

print('\nLR(SMOTE/UNDER) SECOND Testing Results:')
print(lr.score(X_test_sc, y_test))
print(recall_score(y_test, lr.predict(X_test_sc)))

LR(SMOTE/UNDER) SECOND Training Results:
0.684172918899607
0.2347266881028939

LR(SMOTE/UNDER) SECOND Testing Results:
0.8761829652996845
0.2871287128712871


In [53]:
print('RF(SMOTE/UNDER) SECOND Training Results:')
print(rf.score(X_sm_und, y_sm_und))
print(recall_score(y_sm_und, rf.predict(X_sm_und)))

print('\nRF(SMOTE/UNDER) SECOND Testing Results:')
print(rf.score(X_test_sc, y_test))
print(recall_score(y_test, rf.predict(X_test_sc)))

RF(SMOTE/UNDER) SECOND Training Results:
1.0
1.0

RF(SMOTE/UNDER) SECOND Testing Results:
0.9290220820189274
0.297029702970297


In [54]:
print('ET(SMOTE/UNDER) SECOND Training Results:')
print(et.score(X_sm_und, y_sm_und))
print(recall_score(y_sm_und, et.predict(X_sm_und)))

print('\nET(SMOTE/UNDER) SECOND Testing Results:')
print(et.score(X_test_sc, y_test))
print(recall_score(y_test, et.predict(X_test_sc)))

ET(SMOTE/UNDER) SECOND Training Results:
1.0
1.0

ET(SMOTE/UNDER) SECOND Testing Results:
0.9085173501577287
0.31683168316831684


---
## Gridsearch over KNN to fine tune model

In [55]:
# Ref for gridsearching for recall score: https://stackoverflow.com/questions/49035011/get-precison-model-through-gridsearchcv-for-recall-optimization
knn_params = {
    'n_neighbors':range(2, 5),
    'metric': ['euclidean', 'manhattan']
}

knn_gridsearch = GridSearchCV(KNeighborsClassifier(), knn_params, cv=5, 
                              verbose=1, scoring='recall')

In [56]:
knn_gridsearch.fit(X_sm_und, y_sm_und);

Fitting 5 folds for each of 6 candidates, totalling 30 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    2.7s finished


In [57]:
print(knn_gridsearch.best_score_)
print(knn_gridsearch.best_params_)

0.8199470990742338
{'metric': 'euclidean', 'n_neighbors': 3}


In [58]:
print('KNN Gridsearch Training Results:')
print(accuracy_score(y_sm_und, knn_gridsearch.predict(X_sm_und)))
print(knn_gridsearch.score(X_sm_und, y_sm_und))

print('\nKNN Gridsearch Testing Results:')
print(accuracy_score(y_test, knn_gridsearch.predict(X_test_sc)))
print(knn_gridsearch.score(X_test_sc, y_test))


KNN Gridsearch Training Results:
0.9046087888531619
0.9506966773847803

KNN Gridsearch Testing Results:
0.7413249211356467
0.48514851485148514


In [59]:
# Save the best model
knn_model = knn_gridsearch

## Gridsearch over RF to fine tune model

In [60]:
rf_params = {
    'n_estimators': [50,75,100],
    'max_depth': [None, 1, 2, 3, 4, 5],
}
rf_gs = GridSearchCV(RandomForestClassifier(), param_grid=rf_params, cv=5, 
                     verbose= 1, scoring= 'recall')
rf_gs.fit(X_sm_und, y_sm_und)
print(rf_gs.best_score_)
rf_gs.best_params_

Fitting 5 folds for each of 18 candidates, totalling 90 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  90 out of  90 | elapsed:   18.2s finished


0.6633201081018918


{'max_depth': None, 'n_estimators': 50}

In [61]:
print('RF Gridsearch Training Results:')
print(accuracy_score(y_sm_und, rf_gs.predict(X_sm_und)))
print(rf_gs.score(X_sm_und, y_sm_und))


print('\nRF Gridsearch Testing Results:')
print(accuracy_score(y_test, rf_gs.predict(X_test_sc)))
print(rf_gs.score(X_test_sc, y_test))


RF Gridsearch Training Results:
0.9996427295462665
0.9989281886387996

RF Gridsearch Testing Results:
0.9203470031545742
0.2871287128712871


## Gridsearch over ET to fine tune model

In [62]:
et_params = {
    'n_estimators': [100, 150, 200],
    'max_depth': [None, 1, 2, 3, 4, 5],
}
et_gs = GridSearchCV(ExtraTreesClassifier(), param_grid=et_params, cv=5, 
                     verbose= 1, scoring= 'recall')
et_gs.fit(X_sm_und, y_sm_und)
print(et_gs.best_score_)
et_gs.best_params_

Fitting 5 folds for each of 18 candidates, totalling 90 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  90 out of  90 | elapsed:   33.4s finished


0.7652003910068427


{'max_depth': None, 'n_estimators': 150}

In [63]:
print('ET Gridsearch Training Results:')
print(accuracy_score(y_sm_und, et_gs.predict(X_sm_und)))
print(et_gs.score(X_sm_und, y_sm_und))


print('\nET Gridsearch Testing Results:')
print(accuracy_score(y_test, et_gs.predict(X_test_sc)))
print(et_gs.score(X_test_sc, y_test))


ET Gridsearch Training Results:
1.0
1.0

ET Gridsearch Testing Results:
0.9006309148264984
0.3069306930693069


**INTERPRETATION:** While KNN had the best scores, it is a less interpretable model than Random Forest with the second best scores.  It was important for the client to be able to see the featre importance. In other words, he wanted to know which values of a boat or owner are most influential when predicting the likelihood of a claim.



---

In [64]:
# # https://machinelearningmastery.com/calculate-feature-importance-with-python/

# results = permutation_importance(knn_gridsearch, X_sm_und, y_sm_und, scoring='recall')
# # get importance
# importance = results.importances_mean
# # summarize feature importance
# for i,v in enumerate(importance):
#     print('Feature: %0d, Score: %.5f' % (i,v))
# # # plot feature importance
# # pyplot.bar([x for x in range(len(importance))], importance)
# # pyplot.show()

In [65]:
# # Reminder of best params for RF
# rf_gs.best_params_

# # Instantiate model w/ best params in order to pull the feature importances
# new_rf = RandomForestClassifier(max_depth=None, n_estimators=75,random_state=42)
# new_rf.fit(X_sm_und, y_sm_und)
# new_rf_pred = new_rf.predict(X_test_sc)

# new_rf_feature_imp = pd.DataFrame({'Variable':X.columns,
#               'Importance':new_rf.feature_importances_}).sort_values('Importance', ascending=False)

# new_rf_feature_imp

---
## Fine tuning best model: Random Forest

In [66]:
# Create a table to input results of best model once I start playing around with it again
final_results_table = pd.DataFrame(columns=['Accuracy', 'Recall'])

In [67]:
# Reminder of best params for RF
rf_gs.best_params_

{'max_depth': None, 'n_estimators': 50}

In [68]:
best_model_rf = RandomForestClassifier(max_depth=None, n_estimators=75,random_state=42)
best_model_rf.fit(X_sm_und, y_sm_und)
bm_rf_pred = best_model_rf.predict(X_test_sc)

In [69]:
rf_first_train = pd.Series(data=[best_model_rf.score(X_sm_und, y_sm_und), recall_score(y_sm_und, best_model_rf.predict(X_sm_und))],
                                 index=final_results_table.columns, name = 'RF(Training)')
                                 
rf_first_test = pd.Series(data=[best_model_rf.score(X_test_sc, y_test), recall_score(y_test, best_model_rf.predict(X_test_sc))],
                               index=final_results_table.columns, name = 'RF(Testing)')

final_results_table = final_results_table.append([rf_first_train, rf_first_test])

final_results_table

Unnamed: 0,Accuracy,Recall
RF(Training),1.0,1.0
RF(Testing),0.928233,0.306931


**INTERPRETATION:** Model is overfit, I will try to play around with the feature selection below.

In [70]:
best_rf_feature_imp = pd.DataFrame({'Variable':X.columns,
              'Importance':best_model_rf.feature_importances_}).sort_values('Importance', ascending=False)

best_rf_feature_imp

Unnamed: 0,Variable,Importance
0,policy_length,0.142156
2,Age,0.095411
1,Hull Limit,0.093279
3,Length,0.09312
4,Year Built,0.089054
6,Years Exp.,0.074954
5,New/Renl/Endt/Canc/Flat_new,0.045659
7,New/Renl/Endt/Canc/Flat_renl,0.033095
8,# Engines,0.032026
17,Builder_other,0.028039


## Test best model with 'Policy Length' and 'New/Renl/Endt/Canc/Flat' removed

In [71]:
removals = [col for col in combined if col.startswith('New/Renl/Endt/Canc/Flat')]
removals.append('policy_length')
removals.append('num_claims')
removals.append('binary')
removals

['New/Renl/Endt/Canc/Flat_endt',
 'New/Renl/Endt/Canc/Flat_endt-canc',
 'New/Renl/Endt/Canc/Flat_new',
 'New/Renl/Endt/Canc/Flat_renl',
 'policy_length',
 'num_claims',
 'binary']

In [72]:
X = combined.drop(columns=removals)
y = combined['binary']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42, stratify=y)

ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

over = SMOTE(sampling_strategy=0.2,random_state=42)
under = RandomUnderSampler(sampling_strategy=0.5,random_state=42)
steps = [('o', over), ('u', under)]
pipeline = Pipeline(steps=steps)

X_sm_und, y_sm_und = pipeline.fit_resample(X_train_sc, y_train)

rf = RandomForestClassifier(max_depth=None, n_estimators=75,random_state=42)
rf.fit(X_sm_und, y_sm_und)
rf_pred = rf.predict(X_test_sc)

In [73]:
rf_second_train = pd.Series(data=[rf.score(X_sm_und, y_sm_und), recall_score(y_sm_und, rf.predict(X_sm_und))],
                                 index=final_results_table.columns, name = 'RF_Feature_Drop(Training)')
                                 
rf_second_test = pd.Series(data=[rf.score(X_test_sc, y_test), recall_score(y_test, rf.predict(X_test_sc))],
                               index=final_results_table.columns, name = 'RF_Feature_Drop(Testing)')

final_results_table = final_results_table.append([rf_second_train, rf_second_test])

final_results_table

Unnamed: 0,Accuracy,Recall
RF(Training),1.0,1.0
RF(Testing),0.928233,0.306931
RF_Feature_Drop(Training),0.968203,0.943194
RF_Feature_Drop(Testing),0.888801,0.376238


In [74]:
rf_feature_imp = pd.DataFrame({'Variable':X.columns,
              'Importance':rf.feature_importances_}).sort_values('Importance', ascending=False)

rf_feature_imp

Unnamed: 0,Variable,Importance
3,Hull Limit,0.085680
2,Length,0.081108
5,Age,0.080706
1,Year Built,0.075903
0,Years Exp.,0.073682
...,...,...
84,Mooring County_marin,0.000363
17,Occupation_manager,0.000345
111,Mooring County_whatcom,0.000240
27,Occupation_software developer,0.000216


## Try it after keeping only the top 30 features from above

In [75]:
# Get the first 30 rows of the important features
new_top_30 = rf_feature_imp[:30]

# Make a list of the column names of the top 30
new_top_30_features = [x for x in new_top_30['Variable']]

new_top_30_features

['Hull Limit',
 'Length',
 'Age',
 'Year Built',
 'Years Exp.',
 'Occupation_other',
 'Occupation_not reported',
 '# Engines',
 'Occupation_retired',
 'Hull Type_monohull sail',
 'Married yes/no_not reported',
 'Married yes/no_yes',
 'Hull Type_motoryacht',
 'Builder_other',
 'Mooring County_other',
 'Hull Type_multihull sail',
 'Mooring County_monroe',
 'Hull Type_trawler',
 'Mooring County_pinellas',
 'Mooring County_caribbean',
 'Mooring County_miami-dade',
 'Mooring County_bcs',
 'Mooring County_broward',
 'Mooring County_san diego',
 'Hull Type_sportfisher',
 'Power Type_inboard',
 'Mooring County_los angeles',
 'Mooring County_palm beach',
 'Occupation_business owner',
 'Occupation_executive']

In [76]:
X = combined[new_top_30_features]
y = combined['binary']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42, stratify=y)

ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

over = SMOTE(sampling_strategy=0.2,random_state=42)
under = RandomUnderSampler(sampling_strategy=0.5,random_state=42)
steps = [('o', over), ('u', under)]
pipeline = Pipeline(steps=steps)

X_sm_und, y_sm_und = pipeline.fit_resample(X_train_sc, y_train)

rf = RandomForestClassifier(max_depth=None, n_estimators=75,random_state=42)
rf.fit(X_sm_und, y_sm_und)
rf_pred = rf.predict(X_test_sc)

In [77]:
rf_third_train = pd.Series(data=[rf.score(X_sm_und, y_sm_und), recall_score(y_sm_und, rf.predict(X_sm_und))],
                                 index=final_results_table.columns, name = 'RF_Top_30(Training)')
                                 
rf_third_test = pd.Series(data=[rf.score(X_test_sc, y_test), recall_score(y_test, rf.predict(X_test_sc))],
                               index=final_results_table.columns, name = 'RF_Top_30(Testing)')

final_results_table = final_results_table.append([rf_third_train, rf_third_test])

final_results_table

Unnamed: 0,Accuracy,Recall
RF(Training),1.0,1.0
RF(Testing),0.928233,0.306931
RF_Feature_Drop(Training),0.968203,0.943194
RF_Feature_Drop(Testing),0.888801,0.376238
RF_Top_30(Training),0.966059,0.944266
RF_Top_30(Testing),0.891956,0.415842


---
## Try gridsearch again with top 30 features

In [78]:
rf_params = {
    'n_estimators': [50,75,100],
    'max_depth': [None, 1, 2, 3, 4, 5],
}
rf_gs = GridSearchCV(RandomForestClassifier(), param_grid=rf_params, cv=5, 
                     verbose= 1, scoring= 'recall')
rf_gs.fit(X_sm_und, y_sm_und)
print(rf_gs.best_score_)
rf_gs.best_params_

Fitting 5 folds for each of 18 candidates, totalling 90 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  90 out of  90 | elapsed:   21.8s finished


0.6559024782933702


{'max_depth': None, 'n_estimators': 50}

**INTERPRETATION:** We got the same best parameters ({'max_depth': None, 'n_estimators': 75}) so we don't need to run the model again.

In [79]:
# Look at the final results again
final_results_table

Unnamed: 0,Accuracy,Recall
RF(Training),1.0,1.0
RF(Testing),0.928233,0.306931
RF_Feature_Drop(Training),0.968203,0.943194
RF_Feature_Drop(Testing),0.888801,0.376238
RF_Top_30(Training),0.966059,0.944266
RF_Top_30(Testing),0.891956,0.415842


In [80]:
final_rf_feature_imp = pd.DataFrame({'Variable':X.columns,
              'Importance':rf.feature_importances_}).sort_values('Importance', ascending=False)

final_rf_feature_imp

Unnamed: 0,Variable,Importance
2,Age,0.123628
0,Hull Limit,0.121111
1,Length,0.118789
3,Year Built,0.112728
4,Years Exp.,0.104244
13,Builder_other,0.038598
7,# Engines,0.035358
5,Occupation_other,0.032266
11,Married yes/no_yes,0.029602
14,Mooring County_other,0.027427


**INTERPRETATION:** 

---
## Conclusion:

In this notebook we...