# Assignment 10: Binary Classification Model
Binary Classification Model\
Ismail Abdo Elmaliki\
CS 502 - Predictive Analytics\
Capitol Technology University\
Professor Frank Neugebauer\
March 9, 2022

# Table of Contents
*Data Understanding*
- Info and Head
- Skew

*Feature Engineering*
- Rename columns
- Encoding Category Features
- Resolving Positive Skewness

*Prediction Model*
- Random Forest Classifier - Setting up function
- Results Evaluation

*Conclusion*

*References*



## Data Understanding

### Info and Head
Taking a look at the data at a high level here are some observations:
- `Agency` is of object type -> will need to apply feature engineering and change to numerical values
- `Agency Type` is of object type -> will need to apply feature engineering and change to numerical values
- `Distribution Channel` is of object type -> will need to apply feature engineering and change to numerical values
- `Product Name` is of object type -> will need to apply feature engineering and change to numerical values
- `Claim` is of object type -> will need to apply feature engineering and change to numerical values
- `Destination` is of object type -> will need to apply feature engineering and change to numerical values
- `Gender` is of object type -> will need to apply feature engineering and change to numerical values; also there are missing values which will need to be filled

In [235]:
import pandas as pd
import numpy as np

df = pd.read_csv('travel_insurance.csv')
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63326 entries, 0 to 63325
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Agency                63326 non-null  object 
 1   Agency Type           63326 non-null  object 
 2   Distribution Channel  63326 non-null  object 
 3   Product Name          63326 non-null  object 
 4   Claim                 63326 non-null  object 
 5   Duration              63326 non-null  int64  
 6   Destination           63326 non-null  object 
 7   Net Sales             63326 non-null  float64
 8   Commision (in value)  63326 non-null  float64
 9   Gender                18219 non-null  object 
 10  Age                   63326 non-null  int64  
dtypes: float64(2), int64(2), object(7)
memory usage: 5.3+ MB


Unnamed: 0,Agency,Agency Type,Distribution Channel,Product Name,Claim,Duration,Destination,Net Sales,Commision (in value),Gender,Age
0,CBH,Travel Agency,Offline,Comprehensive Plan,No,186,MALAYSIA,-29.0,9.57,F,81
1,CBH,Travel Agency,Offline,Comprehensive Plan,No,186,MALAYSIA,-29.0,9.57,F,71
2,CWT,Travel Agency,Online,Rental Vehicle Excess Insurance,No,65,AUSTRALIA,-49.5,29.7,,32
3,CWT,Travel Agency,Online,Rental Vehicle Excess Insurance,No,60,AUSTRALIA,-39.6,23.76,,32
4,CWT,Travel Agency,Online,Rental Vehicle Excess Insurance,No,79,ITALY,-19.8,11.88,,41


### Skew
We can also see below that all numerical type columns are positively skewed. Specifically the columns `Duration`, `Net Sales`, `Commision (in value)`, and `Age`. This is something to keep in mind as feature engineering is applied.

In [236]:
df.skew()

  df.skew()


Duration                23.179617
Net Sales                3.272373
Commision (in value)     4.032269
Age                      2.987710
dtype: float64

## Feature Engineering

### Rename columns
Let's start by renaming columns, making sure they're all lowercase.

In [237]:
df.rename(
    columns={
        'Agency': 'agency', 
        'Agency Type': 'agency_type', 
        'Distribution Channel': 'distribution', 
        'Product Name': 'product_name',
        'Claim': 'claim',
        'Duration': 'duration',
        'Destination': 'destination',
        'Net Sales': 'net_sales',
        'Commision (in value)': 'commision',
        'Gender': 'gender',
        'Age': 'age'}, 
    inplace=True)
df.head()

Unnamed: 0,agency,agency_type,distribution,product_name,claim,duration,destination,net_sales,commision,gender,age
0,CBH,Travel Agency,Offline,Comprehensive Plan,No,186,MALAYSIA,-29.0,9.57,F,81
1,CBH,Travel Agency,Offline,Comprehensive Plan,No,186,MALAYSIA,-29.0,9.57,F,71
2,CWT,Travel Agency,Online,Rental Vehicle Excess Insurance,No,65,AUSTRALIA,-49.5,29.7,,32
3,CWT,Travel Agency,Online,Rental Vehicle Excess Insurance,No,60,AUSTRALIA,-39.6,23.76,,32
4,CWT,Travel Agency,Online,Rental Vehicle Excess Insurance,No,79,ITALY,-19.8,11.88,,41


### Encoding Category Features
Based on the unique values of each categorical column, we'll apply the following encoding for each column:
- `agency`: we'll apply frequency encoding since all values have unique frequency digits
- `agency_type`: representing 0 for airlines, and 1 for travel agency
- `distribution_channel`: representing 0 for offline, and 1 for online
- `product_name`: binary encoding; a better option than one-hot encoding because too many columns will be created based on the number of values.
- `claim`: 0 for no, 1 for yes
- `destination`: binary encoding; a better option than one-hot encoding because too many columns will be created based on the number of values.
- `gender`: creating two columns (one-hot encoding), one for `male` and one for `female`. this would also address missing values at the same time

In [238]:
print(df['agency'].unique()) 
print(df['agency'].value_counts())
print(df['agency_type'].unique())
print(df['distribution'].unique())
print(df['product_name'].unique())
print(df['claim'].unique())
print(df['destination'].unique())
print(df['gender'].unique())

['CBH' 'CWT' 'JZI' 'KML' 'EPX' 'C2B' 'JWT' 'RAB' 'SSI' 'ART' 'CSR' 'CCR'
 'ADM' 'LWC' 'TTW' 'TST']
EPX    35119
CWT     8580
C2B     8267
JZI     6329
SSI     1056
JWT      749
RAB      725
LWC      689
TST      528
KML      392
ART      331
CCR      194
CBH      101
TTW       98
CSR       86
ADM       82
Name: agency, dtype: int64
['Travel Agency' 'Airlines']
['Offline' 'Online']
['Comprehensive Plan' 'Rental Vehicle Excess Insurance' 'Value Plan'
 'Basic Plan' 'Premier Plan' '2 way Comprehensive Plan' 'Bronze Plan'
 'Silver Plan' 'Annual Silver Plan' 'Cancellation Plan'
 '1 way Comprehensive Plan' 'Ticket Protector' '24 Protect' 'Gold Plan'
 'Annual Gold Plan' 'Single Trip Travel Protect Silver'
 'Individual Comprehensive Plan' 'Spouse or Parents Comprehensive Plan'
 'Annual Travel Protect Silver' 'Single Trip Travel Protect Platinum'
 'Annual Travel Protect Gold' 'Single Trip Travel Protect Gold'
 'Annual Travel Protect Platinum' 'Child Comprehensive Plan'
 'Travel Cruise Protect' '

### Encoding Categorical Features (Continued)
After applying encoding to our categorical features, we can now see that all of our columns have numerical values!

In [239]:
# installation instructions for category_encoders can be found here: https://github.com/scikit-learn-contrib/category_encoders
from category_encoders import BinaryEncoder

frequencies = df.groupby('agency').size()
df['agency'] = df['agency'].map(frequencies)

df['agency_type'] = (df['agency_type'] == 'Travel Agency').astype(int)
df['distribution'] = (df['distribution'] == 'Online').astype(int)
df['claim'] = (df['claim'] == 'Yes').astype(int)
df['male'] = (df['gender'] == 'M').astype(int)
df['female'] = (df['gender'] == 'F').astype(int)
df.drop(columns='gender', inplace=True)


encoder = BinaryEncoder(cols=['product_name', 'destination'])
data_encoded = encoder.fit_transform(df)
df = data_encoded
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63326 entries, 0 to 63325
Data columns (total 23 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   agency          63326 non-null  int64  
 1   agency_type     63326 non-null  int64  
 2   distribution    63326 non-null  int64  
 3   product_name_0  63326 non-null  int64  
 4   product_name_1  63326 non-null  int64  
 5   product_name_2  63326 non-null  int64  
 6   product_name_3  63326 non-null  int64  
 7   product_name_4  63326 non-null  int64  
 8   claim           63326 non-null  int64  
 9   duration        63326 non-null  int64  
 10  destination_0   63326 non-null  int64  
 11  destination_1   63326 non-null  int64  
 12  destination_2   63326 non-null  int64  
 13  destination_3   63326 non-null  int64  
 14  destination_4   63326 non-null  int64  
 15  destination_5   63326 non-null  int64  
 16  destination_6   63326 non-null  int64  
 17  destination_7   63326 non-null 

### Resolving Positive Skewness
As mentioned during our data understanding earlier, the column values `Duration`, `Net Sales`, `Commision (in value)`, and `Age` are highly positively skewed. So we'll need to resolve that by applying Winorization before moving onto our prediction model.

In [240]:
from scipy.stats.mstats import winsorize

temp_df = df.copy()
temp_df['duration'] = winsorize(temp_df['duration'], (0.1, 0.2))
temp_df['net_sales'] = winsorize(temp_df['net_sales'], (0.1, 0.2))
temp_df['commision'] = winsorize(temp_df['commision'], (0.1, 0.26))
temp_df['age'] = winsorize(temp_df['age'], (0.1, 0.153))

print(temp_df['duration'].skew()) # skew value of 0.496
print(temp_df['net_sales'].skew()) # skew value of 0.444
print(temp_df['commision'].skew()) # skew value of 0.485
print(temp_df['age'].skew()) # skew value of 0.449

df['duration'] = winsorize(df['duration'], (0.1, 0.2))
df['net_sales'] = winsorize(df['net_sales'], (0.1, 0.2))
df['commision'] = winsorize(df['commision'], (0.1, 0.26))
df['age'] = winsorize(df['age'], (0.1, 0.153))

0.49614889854199307
0.44424846894896575
0.48540294373837195
0.44853294623002843


## Prediction Model
Alas, we're done with the feature engineering portion. We can now move on to creating a prediction model for this dataset.

### Random Forest Classifier - Setting up function
We'll get started with setting up the Random Forest classifier model. To prevent redundant effort in this notebook with predicting training data and/or testing data, a function is created to handle all the logic. An analysis of the results will be covered shortly.

In [241]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import sklearn.metrics as metric
from IPython.display import display

numerical_features = ['duration', 'net_sales', 'commision', 'age']
categorical_features = [
    'agency', 
    'agency_type', 
    'distribution', 
    'product_name_0', 
    'product_name_1', 
    'product_name_2', 
    'product_name_3', 
    'product_name_4', 
    'destination_0', 
    'destination_1', 
    'destination_2', 
    'destination_3', 
    'destination_4', 
    'destination_5', 
    'destination_6', 
    'destination_7', 
    'male', 
    'female'
] 

X = df[numerical_features + categorical_features]
Y = df['claim']

def rfPredictAndShowScores(test_size: float = 0.10, use_test_data: bool = False):
    x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=100)
    rf = RandomForestClassifier()
    y_predicted = []
    if use_test_data:
        rf.fit(x_test, y_test)
        y_predicted = rf.predict(x_test)
        accuracy_rf = np.round(metric.accuracy_score(y_true=y_test, y_pred=y_predicted), decimals=3)
        precision_rf = np.round(metric.precision_score(y_true=y_test, y_pred=y_predicted), decimals=3)
        recall_rf = np.round(metric.recall_score(y_true=y_test, y_pred=y_predicted), decimals=3)
        display(pd.Series(data=rf.feature_importances_, index=x_test.columns).sort_values(ascending=False).round(3))
    else:
        rf.fit(x_train, y_train)
        y_predicted = rf.predict(x_train)
        accuracy_rf = np.round(metric.accuracy_score(y_true=y_train, y_pred=y_predicted), decimals=3)
        precision_rf = np.round(metric.precision_score(y_true=y_train, y_pred=y_predicted), decimals=3)
        recall_rf = np.round(metric.recall_score(y_true=y_train, y_pred=y_predicted), decimals=3)
        display(pd.Series(data=rf.feature_importances_, index=x_train.columns).sort_values(ascending=False).round(3))
    print('Accuracy:', accuracy_rf)
    print('Precision:', precision_rf)
    print('Recall', recall_rf)

### Results Evaluation
Our random forest classifier model has high `accuracy`, but that metric isn't enough to determine our model's performance given our prediction is a classification problem.

Hence, we'll make sure to include both `precision` and `recall` metrics. 

**Precision** in this case will quanity the *correct positive predictions made* whereas **recall** will quantify the *number of correct positive predictions made out of all positive predictions that could have been made (taking into account true positive and false negatives)* (Brownlee, 2020).

Looking at the results, we can see our precision is high but our recall isn't as high. Our recall is higher with our testing data too versus our training data. Most likely, what may be contributing to a low recall value is the fact that we have imbalanced classification within our data since most values of `claim` are `No` instead of `Yes`.

Another observation is the feature importance for both training and test data. We can see that features that are most relevant for predicting `claim` are `duration` and `age`.

In [242]:
print('Training data stats')
rfPredictAndShowScores()

print('\nTesting data stats')
rfPredictAndShowScores(use_test_data=True)

Training data stats


duration          0.398
age               0.242
net_sales         0.162
commision         0.052
agency            0.017
destination_5     0.016
destination_4     0.015
destination_7     0.014
destination_3     0.012
destination_6     0.011
female            0.008
product_name_4    0.008
product_name_2    0.008
male              0.008
product_name_3    0.007
destination_2     0.007
product_name_1    0.006
agency_type       0.005
distribution      0.003
product_name_0    0.001
destination_1     0.001
destination_0     0.000
dtype: float64

Accuracy: 0.993
Precision: 0.978
Recall 0.522

Testing data stats


duration          0.306
age               0.228
net_sales         0.168
commision         0.084
product_name_2    0.020
agency            0.019
product_name_1    0.019
product_name_4    0.018
female            0.016
destination_7     0.016
destination_6     0.016
male              0.015
destination_5     0.015
destination_4     0.014
product_name_3    0.012
destination_3     0.011
destination_2     0.008
agency_type       0.006
destination_1     0.005
product_name_0    0.003
distribution      0.001
destination_0     0.000
dtype: float64

Accuracy: 0.997
Precision: 1.0
Recall 0.819


# Conclusion
Bringing it altogether, we've done the following to successfully create classification model:
- Understand the data
- Apply feature engineering to all categorical columns
- Address skewness for continuous columns
- Utilized Random Forester to create a classification model for our target `claim`
- Analyze and notate observations of model performance via metrics such as accuracy, precision, and recall

# References


Brownlee, J. (2019, June 20). Classification Accuracy is Not Enough: More Performance Measures You\
&emsp; Can Use. Machine Learning Mastery. Retrieved March 10, 2022, from\
&emsp; https://machinelearningmastery.com/classification-accuracy-is-not-enough-more-performance-measures-you-can-use/ 