# Assignment 11: Split Data
Split Data\
Ismail Abdo Elmaliki\
CS 502 - Predictive Analytics\
Capitol Technology University\
Professor Frank Neugebauer\
March 17, 2022

# Table of Contents
*Data Understanding*
- Info and Head
- Skew

*Feature Engineering*
- Rename columns
- Encoding Category Features
- Resolving Positive Skewness

*Prediction Model*
- Random Forest Classifier - Setting up function
- Results Evaluation
- **Confusion Matrix**
- **Undersampling Data**
- **Confusion Matrix (Post Undersampling)**
- **Hyperparameter Tuning**
- **Prediction and Confusion Matrix Wrap Up**

*Conclusion*

*References*

**NOTE**: Anything bolded within Table of Contents indicates new content explicitly for assignment 11

## Data Understanding

### Info and Head
Taking a look at the data at a high level here are some observations:
- `Agency` is of object type -> will need to apply feature engineering and change to numerical values
- `Agency Type` is of object type -> will need to apply feature engineering and change to numerical values
- `Distribution Channel` is of object type -> will need to apply feature engineering and change to numerical values
- `Product Name` is of object type -> will need to apply feature engineering and change to numerical values
- `Claim` is of object type -> will need to apply feature engineering and change to numerical values
- `Destination` is of object type -> will need to apply feature engineering and change to numerical values
- `Gender` is of object type -> will need to apply feature engineering and change to numerical values; also there are missing values which will need to be filled

In [986]:
import pandas as pd
import numpy as np

df = pd.read_csv('travel_insurance.csv')
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63326 entries, 0 to 63325
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Agency                63326 non-null  object 
 1   Agency Type           63326 non-null  object 
 2   Distribution Channel  63326 non-null  object 
 3   Product Name          63326 non-null  object 
 4   Claim                 63326 non-null  object 
 5   Duration              63326 non-null  int64  
 6   Destination           63326 non-null  object 
 7   Net Sales             63326 non-null  float64
 8   Commision (in value)  63326 non-null  float64
 9   Gender                18219 non-null  object 
 10  Age                   63326 non-null  int64  
dtypes: float64(2), int64(2), object(7)
memory usage: 5.3+ MB


Unnamed: 0,Agency,Agency Type,Distribution Channel,Product Name,Claim,Duration,Destination,Net Sales,Commision (in value),Gender,Age
0,CBH,Travel Agency,Offline,Comprehensive Plan,No,186,MALAYSIA,-29.0,9.57,F,81
1,CBH,Travel Agency,Offline,Comprehensive Plan,No,186,MALAYSIA,-29.0,9.57,F,71
2,CWT,Travel Agency,Online,Rental Vehicle Excess Insurance,No,65,AUSTRALIA,-49.5,29.7,,32
3,CWT,Travel Agency,Online,Rental Vehicle Excess Insurance,No,60,AUSTRALIA,-39.6,23.76,,32
4,CWT,Travel Agency,Online,Rental Vehicle Excess Insurance,No,79,ITALY,-19.8,11.88,,41


### Skew
We can also see below that all numerical type columns are positively skewed. Specifically the columns `Duration`, `Net Sales`, `Commision (in value)`, and `Age`. This is something to keep in mind as feature engineering is applied.

In [987]:
df.skew()

  df.skew()


Duration                23.179617
Net Sales                3.272373
Commision (in value)     4.032269
Age                      2.987710
dtype: float64

## Feature Engineering

### Rename columns
Let's start by renaming columns, making sure they're all lowercase.

In [988]:
df.rename(
    columns={
        'Agency': 'agency', 
        'Agency Type': 'agency_type', 
        'Distribution Channel': 'distribution', 
        'Product Name': 'product_name',
        'Claim': 'claim',
        'Duration': 'duration',
        'Destination': 'destination',
        'Net Sales': 'net_sales',
        'Commision (in value)': 'commision',
        'Gender': 'gender',
        'Age': 'age'}, 
    inplace=True)
df.head()

Unnamed: 0,agency,agency_type,distribution,product_name,claim,duration,destination,net_sales,commision,gender,age
0,CBH,Travel Agency,Offline,Comprehensive Plan,No,186,MALAYSIA,-29.0,9.57,F,81
1,CBH,Travel Agency,Offline,Comprehensive Plan,No,186,MALAYSIA,-29.0,9.57,F,71
2,CWT,Travel Agency,Online,Rental Vehicle Excess Insurance,No,65,AUSTRALIA,-49.5,29.7,,32
3,CWT,Travel Agency,Online,Rental Vehicle Excess Insurance,No,60,AUSTRALIA,-39.6,23.76,,32
4,CWT,Travel Agency,Online,Rental Vehicle Excess Insurance,No,79,ITALY,-19.8,11.88,,41


### Encoding Category Features
Based on the unique values of each categorical column, we'll apply the following encoding for each column:
- `agency`: we'll apply frequency encoding since all values have unique frequency digits
- `agency_type`: representing 0 for airlines, and 1 for travel agency
- `distribution_channel`: representing 0 for offline, and 1 for online
- `product_name`: binary encoding; a better option than one-hot encoding because too many columns will be created based on the number of values.
- `claim`: 0 for no, 1 for yes
- `destination`: binary encoding; a better option than one-hot encoding because too many columns will be created based on the number of values.
- `gender`: creating two columns (one-hot encoding), one for `male` and one for `female`. this would also address missing values at the same time

In [989]:
print(df['agency'].unique()) 
print(df['agency'].value_counts())
print(df['agency_type'].unique())
print(df['distribution'].unique())
print(df['product_name'].unique())
print(df['claim'].unique())
print(df['destination'].unique())
print(df['gender'].unique())

['CBH' 'CWT' 'JZI' 'KML' 'EPX' 'C2B' 'JWT' 'RAB' 'SSI' 'ART' 'CSR' 'CCR'
 'ADM' 'LWC' 'TTW' 'TST']
EPX    35119
CWT     8580
C2B     8267
JZI     6329
SSI     1056
JWT      749
RAB      725
LWC      689
TST      528
KML      392
ART      331
CCR      194
CBH      101
TTW       98
CSR       86
ADM       82
Name: agency, dtype: int64
['Travel Agency' 'Airlines']
['Offline' 'Online']
['Comprehensive Plan' 'Rental Vehicle Excess Insurance' 'Value Plan'
 'Basic Plan' 'Premier Plan' '2 way Comprehensive Plan' 'Bronze Plan'
 'Silver Plan' 'Annual Silver Plan' 'Cancellation Plan'
 '1 way Comprehensive Plan' 'Ticket Protector' '24 Protect' 'Gold Plan'
 'Annual Gold Plan' 'Single Trip Travel Protect Silver'
 'Individual Comprehensive Plan' 'Spouse or Parents Comprehensive Plan'
 'Annual Travel Protect Silver' 'Single Trip Travel Protect Platinum'
 'Annual Travel Protect Gold' 'Single Trip Travel Protect Gold'
 'Annual Travel Protect Platinum' 'Child Comprehensive Plan'
 'Travel Cruise Protect' '

### Encoding Categorical Features (Continued)
After applying encoding to our categorical features, we can now see that all of our columns have numerical values!

In [990]:
# installation instructions for category_encoders can be found here: https://github.com/scikit-learn-contrib/category_encoders
from category_encoders import BinaryEncoder

frequencies = df.groupby('agency').size()
df['agency'] = df['agency'].map(frequencies)

df['agency_type'] = (df['agency_type'] == 'Travel Agency').astype(int)
df['distribution'] = (df['distribution'] == 'Online').astype(int)
df['claim'] = (df['claim'] == 'Yes').astype(int)
df['male'] = (df['gender'] == 'M').astype(int)
df['female'] = (df['gender'] == 'F').astype(int)
df.drop(columns='gender', inplace=True)


encoder = BinaryEncoder(cols=['product_name', 'destination'])
data_encoded = encoder.fit_transform(df)
df = data_encoded
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63326 entries, 0 to 63325
Data columns (total 23 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   agency          63326 non-null  int64  
 1   agency_type     63326 non-null  int64  
 2   distribution    63326 non-null  int64  
 3   product_name_0  63326 non-null  int64  
 4   product_name_1  63326 non-null  int64  
 5   product_name_2  63326 non-null  int64  
 6   product_name_3  63326 non-null  int64  
 7   product_name_4  63326 non-null  int64  
 8   claim           63326 non-null  int64  
 9   duration        63326 non-null  int64  
 10  destination_0   63326 non-null  int64  
 11  destination_1   63326 non-null  int64  
 12  destination_2   63326 non-null  int64  
 13  destination_3   63326 non-null  int64  
 14  destination_4   63326 non-null  int64  
 15  destination_5   63326 non-null  int64  
 16  destination_6   63326 non-null  int64  
 17  destination_7   63326 non-null 

### Resolving Positive Skewness
As mentioned during our data understanding earlier, the column values `Duration`, `Net Sales`, `Commision (in value)`, and `Age` are highly positively skewed. So we'll need to resolve that by applying Winorization before moving onto our prediction model.

In [991]:
from scipy.stats.mstats import winsorize

temp_df = df.copy()
temp_df['duration'] = winsorize(temp_df['duration'], (0.1, 0.2))
temp_df['net_sales'] = winsorize(temp_df['net_sales'], (0.1, 0.2))
temp_df['commision'] = winsorize(temp_df['commision'], (0.1, 0.26))
temp_df['age'] = winsorize(temp_df['age'], (0.1, 0.153))

print(temp_df['duration'].skew()) # skew value of 0.496
print(temp_df['net_sales'].skew()) # skew value of 0.444
print(temp_df['commision'].skew()) # skew value of 0.485
print(temp_df['age'].skew()) # skew value of 0.449

df['duration'] = winsorize(df['duration'], (0.1, 0.2))
df['net_sales'] = winsorize(df['net_sales'], (0.1, 0.2))
df['commision'] = winsorize(df['commision'], (0.1, 0.26))
df['age'] = winsorize(df['age'], (0.1, 0.153))

0.49614889854199307
0.44424846894896575
0.48540294373837195
0.44853294623002843


## Prediction Model
Alas, we're done with the feature engineering portion. We can now move on to creating a prediction model for this dataset.

### Random Forest Classifier - Setting up function
We'll get started with setting up the Random Forest classifier model. To prevent redundant effort in this notebook with predicting training data and/or testing data, a function is created to handle all the logic. An analysis of the results will be covered shortly.

In [992]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import sklearn.metrics as metric
from IPython.display import display

numerical_features = ['duration', 'net_sales', 'commision', 'age']
categorical_features = [
    'agency', 
    'agency_type', 
    'distribution', 
    'product_name_0', 
    'product_name_1', 
    'product_name_2', 
    'product_name_3', 
    'product_name_4', 
    'destination_0', 
    'destination_1', 
    'destination_2', 
    'destination_3', 
    'destination_4', 
    'destination_5', 
    'destination_6', 
    'destination_7', 
    'male', 
    'female'
] 

X = df[numerical_features + categorical_features]
Y = df['claim']

def rfPredictAndShowScores(test_size: float = 0.10, use_test_data: bool = False):
    x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=100)
    rf = RandomForestClassifier()
    y_predicted = []
    if use_test_data:
        rf.fit(x_test, y_test)
        y_predicted = rf.predict(x_test)
        accuracy_rf = np.round(metric.accuracy_score(y_true=y_test, y_pred=y_predicted), decimals=3)
        precision_rf = np.round(metric.precision_score(y_true=y_test, y_pred=y_predicted), decimals=3)
        recall_rf = np.round(metric.recall_score(y_true=y_test, y_pred=y_predicted), decimals=3)
        display(pd.Series(data=rf.feature_importances_, index=x_test.columns).sort_values(ascending=False).round(3))
    else:
        rf.fit(x_train, y_train)
        y_predicted = rf.predict(x_train)
        accuracy_rf = np.round(metric.accuracy_score(y_true=y_train, y_pred=y_predicted), decimals=3)
        precision_rf = np.round(metric.precision_score(y_true=y_train, y_pred=y_predicted), decimals=3)
        recall_rf = np.round(metric.recall_score(y_true=y_train, y_pred=y_predicted), decimals=3)
        display(pd.Series(data=rf.feature_importances_, index=x_train.columns).sort_values(ascending=False).round(3))
    print('Accuracy:', accuracy_rf)
    print('Precision:', precision_rf)
    print('Recall', recall_rf)

In [993]:
def classificationMetrics(y_test, y_pred):
    accuracy = np.round(metric.accuracy_score(y_true=y_test, y_pred=y_pred), decimals=3)
    precision = np.round(metric.precision_score(y_true=y_test, y_pred=y_pred), decimals=3)
    recall = np.round(metric.recall_score(y_true=y_test, y_pred=y_pred), decimals=3)

    return { 'accuracy': accuracy, 'precision': precision, 'recall': recall }

### Results Evaluation
Our random forest classifier model has high `accuracy`, but that metric isn't enough to determine our model's performance given our prediction is a classification problem.

Hence, we'll make sure to include both `precision` and `recall` metrics. 

**Precision** in this case will quanity the *correct positive predictions made* whereas **recall** will quantify the *number of correct positive predictions made out of all positive predictions that could have been made (taking into account true positive and false negatives)* (Brownlee, 2020).

Looking at the results, we can see our precision is high but our recall isn't as high. Our recall is higher with our testing data too versus our training data. Most likely, what may be contributing to a low recall value is the fact that we have imbalanced classification within our data since most values of `claim` are `No` instead of `Yes`.

Another observation is the feature importance for both training and test data. We can see that features that are most relevant for predicting `claim` are `duration` and `age`.

In [994]:
print('Training data stats')
rfPredictAndShowScores()

print('\nTesting data stats')
rfPredictAndShowScores(use_test_data=True)

Training data stats


duration          0.403
age               0.244
net_sales         0.159
commision         0.050
agency            0.016
destination_5     0.016
destination_4     0.015
destination_7     0.013
destination_3     0.012
destination_6     0.011
female            0.008
product_name_4    0.008
product_name_2    0.007
male              0.007
product_name_3    0.007
product_name_1    0.007
destination_2     0.007
agency_type       0.005
distribution      0.004
product_name_0    0.002
destination_1     0.001
destination_0     0.000
dtype: float64

Accuracy: 0.993
Precision: 0.967
Recall 0.529

Testing data stats


duration          0.300
age               0.239
net_sales         0.159
commision         0.087
product_name_1    0.020
product_name_2    0.019
agency            0.018
female            0.016
male              0.016
product_name_4    0.015
destination_7     0.015
destination_6     0.015
destination_4     0.015
destination_5     0.013
product_name_3    0.012
destination_3     0.012
destination_2     0.011
agency_type       0.007
destination_1     0.005
product_name_0    0.004
distribution      0.001
destination_0     0.000
dtype: float64

Accuracy: 0.997
Precision: 1.0
Recall 0.819


### Confusion Matrix
To get an even more vivid picture of why accuracy alone shouldn't be relied upon, we've setup a confusion matrix. Here's the breakdown of the results based on our test data:
- `True Negatives`: 12,442 - the actual value is Claim Unfiled, and the predicted value is Claim Unfiled
- `False Positives`: 33 - the actual value is Claim Unfiled, and the predicted value is Claim Filed
- `False Negatives`: 187 - the actual value is Claim Filed, and the predicted value is Claim Unfiled
- `True Positives`: 4 - the actual value is Claim Filed, and the predicted value is Claim Filed

We can seee based on these results that imbalanced classification truly plays a role. Compared to total `Claims Not Filed` (12,475), `Claims Filed` (191) is truly imbalanced. 

In [995]:
from sklearn.metrics import confusion_matrix

def displayConfusionMatrix(x_test, y_test, model):
    threshold = 0.5
    y_pred_prob = model.predict_proba(x_test)[:, 1]
    y_pred = (y_pred_prob > threshold).astype(int)
    matrix = confusion_matrix(y_test, y_pred)
    matrix_df = pd.DataFrame(matrix, index=["Obs Claim Unfiled", "Obs Claim Filed"], columns=["Pred Claim Unfiled", "Pred Claim Filed"])
    print(matrix_df)

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.20, random_state=100)
rf = RandomForestClassifier()
rf.fit(x_train, y_train)
displayConfusionMatrix(x_test, y_test, rf)

                   Pred Claim Unfiled  Pred Claim Filed
Obs Claim Unfiled               12444                31
Obs Claim Filed                   187                 4


### Undersampling Data
In order to address the shortcomings of our model, we'll proceeed with applying undersampling.

We can see after applying undersampling, `Claim Filed` is at 927, while `Claim Not Filed` is down to 1158 - 20% more than `Claim Filed`. The reason for going with 80% as our sampling strategy (20% more for `Claim Not Failed`) is just so we can ensure accuracy doesn't end up being much less than our prior model setup.

In [996]:
# Install instructions here: https://imbalanced-learn.org/stable/install.html#getting-started
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter

# RandomUnderSampler(sampling_strategy=0.80)
rus = RandomUnderSampler(sampling_strategy=0.80)
x_rus, y_rus = rus.fit_resample(X, Y)

print('Original Y:', Counter(Y))
print('Y after undersampling:', Counter(y_rus))

x_train, x_test, y_train, y_test = train_test_split(x_rus, y_rus, test_size=0.20, random_state=100)

Original Y: Counter({0: 62399, 1: 927})
Y after undersampling: Counter({0: 1158, 1: 927})


### Confusion Matrix (Post Undersampling)
Taking a look at both classification metrics and the confusion matrix after undersampling, we can clearly see different results.

For accuracy, precision, and recall, they have all decreased. But that should be expected and nothing we should be too concerned about for now given that our previous model prediction encountered imbalanced classification.

Looking at the confusing matrix, here's what we have:
- `True Negatives`: 198 - the actual value is Claim Unfiled, and the predicted value is Claim Unfiled
- `False Positives`: 38 - the actual value is Claim Unfiled, and the predicted value is Claim Filed
- `False Negatives`: 70 - the actual value is Claim Filed, and the predicted value is Claim Unfiled
- `True Positives`: 111 - the actual value is Claim Filed, and the predicted value is Claim Filed

We can clearly see the effect of undersampling reduced the number of false negatives and increased the number of true positives! 

However, our work isn't done here yet. We'll need to try tuning our model using hyperparameter tuning.

In [997]:
print('Random Forest')
model = RandomForestClassifier()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
print(classificationMetrics(y_test, y_pred))
displayConfusionMatrix(x_test, y_test, model)

Random Forest
{'accuracy': 0.707, 'precision': 0.671, 'recall': 0.641}
                   Pred Claim Unfiled  Pred Claim Filed
Obs Claim Unfiled                 179                57
Obs Claim Filed                    65               116


### Hyperparameter Tuning
**NOTE**: Running this will take about 5-6 minutes more or less.

In order to make the most of hyperparameter tuning with the `RandomForestClassifier` predictive model, we'll want to select the appropriate hyperparameters while applying a k fold using `RepeatedStratifiedKFold` (Brownlee, 2020), then utilizing grid search to determine the best score for `precision` and `recall`.

After getting the best scores, we notice both `precision` and `recall` scores perform best with the following hyperparameters: `class_weight: 'balanced', n_estimators: 500`. However it seems that for `max_features` `recall` performs best with `sqrt` whereas `precision` performs best with `log2`.

Ideally, it would be best to strive between a great `recall` and great `precision` scores. We'll cover that next.

In [998]:
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV

model = RandomForestClassifier()
n_estimators = [100, 500, 1000]
class_weight = ['balanced', None]
max_features = ['sqrt', 'log2']

grid = {
    'n_estimators': n_estimators,
    'class_weight': class_weight,
    'max_features': max_features
}

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='recall',error_score=0)
grid_result = grid_search.fit(x_train, y_train)
print("Best recall: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='precision',error_score=0)
grid_result = grid_search.fit(x_train, y_train)
print("Best precision: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Best recall: 0.685471 using {'class_weight': 'balanced', 'max_features': 'log2', 'n_estimators': 1000}
Best precision: 0.699760 using {'class_weight': 'balanced', 'max_features': 'sqrt', 'n_estimators': 100}


### Hyperparameter Tuning (Continued)
With give hyperparamter tuning another shot, but this time with `f1` score. F1 would be an ideal classification metric to strive for because it's a mean of both recall and precision. Since we'd prefer an ideal balance of both precision and recall, F1 score would be the way to go.

We can see that we get the best F1 score with the following hyperparameters:
- `class_weight: None`
- `max_features: sqrt`
- `n_estimators: 1000`

Now that we have the best hyperparamters to utilize for our prediction model, we'll wrap up soon with a final prediction and confusion matrix evalution!

In [999]:
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='f1',error_score=0)
grid_result = grid_search.fit(x_train, y_train)
print("Best recall: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Best recall: 0.691261 using {'class_weight': 'balanced', 'max_features': 'log2', 'n_estimators': 1000}


### Prediction and Confusion Matrix Wrap Up
While not perfect nor much improvement compared to our previous confusion matrix, it was worth going through the effort of hyperparameter tuning.

From the confusion matrix, we can see our true negatives went from 177 to 176, while our true positives went from 116 to 117.

In [1001]:
model = RandomForestClassifier(n_estimators=1000, max_features='sqrt')
model.fit(x_train, y_train)

displayConfusionMatrix(x_test, y_test, model)

                   Pred Claim Unfiled  Pred Claim Filed
Obs Claim Unfiled                 176                60
Obs Claim Filed                    64               117


# Conclusion
Bringing it altogether, we've done the following to successfully create a classification model:
- Understand the data
- Apply feature engineering to all categorical columns
- Address skewness for continuous columns
- Utilized Random Forester to create a classification model for our target `claim`
- Analyze and notate observations of model performance via metrics such as accuracy, precision, and recall
- Utilize a confusion matrix on why accuracy alone isn't enough
- Utilize hyperparamter tuning and cross validation to get the best F1 score - a balance of both precision and recall

# References


Brownlee, J. (2019, June 20). *Classification Accuracy is Not Enough: More Performance Measures You*\
&emsp; *Can Use*. Machine Learning Mastery. Retrieved March 10, 2022, from\
&emsp; https://machinelearningmastery.com/classification-accuracy-is-not-enough-more-performance-measures-you-can-use/ 

Brownlee, J. (2020, January 15). *Random Oversampling and Undersampling for Imbalanced Classification*\
&emsp; Machine Learning Mastery. Retrieved March 18, 2022, from\
&emsp; https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/ 