# Machine Learning Project

##### The goal of this project is to create a model that can predict for whether a customer can claim for Travel Insurance or not.

#### Problem Statement
##### Insurance companies take risks over customers. Risk management is a very important aspect of the insurance industry. Insurers consider every quantifiable factor to develop profiles of high and low insurance risks. Insurers collect vast amounts of information about policyholders and analyse the data. As a Data scientist in an insurance company, you need to analyse the available data and predict whether to approve the insurance or not.


##### The Steps performed will be mentioned as we go through the project.

In [1]:
# Basic Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Cross-Validation
from sklearn.model_selection import train_test_split

# LabelEncoding
from sklearn.preprocessing import LabelEncoder

# Evaluation
from sklearn.metrics import classification_report

# Scaling
from sklearn.preprocessing import MinMaxScaler

# Ridge, Lasso
from sklearn.linear_model import Ridge, Lasso

# Logistic Regression
from sklearn.linear_model import LogisticRegression

# Decision Tree
from sklearn.tree import DecisionTreeClassifier

# GridSearchCV
from sklearn.model_selection import GridSearchCV

# Boosting, RandomForest
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier
from xgboost import XGBClassifier

# Ensemble
from sklearn.ensemble import VotingClassifier, BaggingClassifier

# Feature Selection
from sklearn.feature_selection import chi2, SelectKBest

# SVM
from sklearn.svm import LinearSVC, SVC

# Skewness
from scipy.stats import skew

# Over and Under Sampling
from imblearn.over_sampling import RandomOverSampler 
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter

# Pickle
import pickle

# Ignore Warnings
import warnings
warnings.filterwarnings("ignore")

In [2]:
# Reading the data and viewing a small part of it to get some understanding of the data.

df = pd.read_csv("E:/data.csv")
print(df.shape)
df.head(8)

FileNotFoundError: [Errno 2] File data.csv does not exist: 'data.csv'

In [None]:
# We will get a list of the number of unique values for each column

df.nunique()

In [None]:
# We will check for null values and the Dtype of each feature.

df.info()

In [None]:
((df.isnull().sum())*100)/len(df)

<b>71% of the Gender column have null values.
    <br>We will drop the column as there does not seem to be any other feature that could help us with filling in the missing data.

In [None]:
df.drop("Gender", axis=1, inplace=True)

In [None]:
# Having a look at all the unique values of each feature.

for cols in df:
    print("\n{:20} - {}" .format(cols.title(), df[cols].unique()))

In [None]:
# Checking for correlation

df.corr()

<b>We will also drop the ID column.
    <br>Each value is unique and does not seem to affect the data.

In [None]:
df.drop("ID", axis=1, inplace=True)

In [None]:
# Having a look at how many claims and non-claims are present in the dataset.
print(df["Claim"].value_counts(), "\n")
(df["Claim"].value_counts()*100)/len(df)

<b>We can see that there is a huge imbalance between the claims and non-claims.<br>
    We will build a baseline model before we perform Over Sampline and Under Sampling.

In [None]:
# Finding out how many customers have their age input as over 100yrs old

len(df[df["Age"] > 100])

<b>The below information from online states that a customer for Travel Insurance is regarded as a Senior citizen from the 71 years and above. And while some companies offer Travel Insurance up to a certain age, others do not have any restriction.</b><br>
![image.png](attachment:68e9f85a-48ff-4f6c-86c2-9dfc4a07f3ad.png)

![image.png](attachment:04fcbe6b-46fa-49f0-bddd-0db8b1e35534.png)

<b>Values above 100 years would most likely be outliers. However, we would need to see how to consider them (or change them) without effecting the data. One possible method is to take the mean of all customers above the age of 70, and replace the values over 100yrs with the new mean value.</b><br>

In [None]:
# Over here, create a variable to calculate the mean of all Senior customers.

mean_senior = df["Age"][df["Age"] > 70].mean()

<b>We will now separate the categorical and numerical data.

In [None]:
df.nunique()

<b>Apart from the target, "Claim", there are two more features that are bivariate - "Agency Type" and "Distribution Channel".
    <br>We could look to perform Hot Encoding on them.

<b>We will separate the Categorical and Numerical features, and explore them further.

In [None]:
cat = ["Agency", "Agency Type", "Distribution Channel", "Product Name", "Destination"]
num = ["Duration", "Net Sales", "Commision (in value)", "Age"]

In [None]:
for cols in cat:
    if (cols == "Product Name") or (cols == "Destination"):
        plt.figure(figsize=(20,30))
        sns.countplot(data=df, hue=df["Claim"], y=cols)
    else:
        plt.figure(figsize=(12,12))
        sns.countplot(data=df, hue=df["Claim"], x=cols)
    plt.xticks(rotation=90)
    plt.show()

In [None]:
for cols in num:
    plt.figure(figsize=(8,8))
    sns.boxplot(data=df, x="Claim", y=cols)
    plt.show()

<b>We would need to manage only some of the outliers, and not all as it could lead to a lot of data loss. Apart from Age, another would be Duration. From the information below, we could replace all values in duration that are greater than 360, wtih 360.</b><br>

![image.png](attachment:1cc86e1a-13a3-451c-9858-f52e3ec01858.png)
http://www.insurancepandit.com/travel/individual_travel_health_insurance.php

In [None]:
df.describe()

In [None]:
for cols in num:
    skew_cols = skew(df[cols])
    print("{:<25} : {}" .format(cols, skew_cols))
    plt.figure(figsize=(8,8))
    sns.distplot(df[cols])
    plt.show()

In [None]:
for cols in num:
    print("\n", cols)
    print(df[cols].value_counts().sort_index())

<b>There is some skewness within the data. This will be handled later on.

In [None]:
# The entries where the duration is -ve, we will drop those rows.

duration = df[df["Duration"] < 0].index
df.drop(duration, inplace=True)

In [None]:
df[(df["Net Sales"] < 0) & (df["Claim"] == 0)]

In [None]:
df[(df["Net Sales"] < 0) & (df["Claim"] == 1)]

<hr>

### LabelEncoding, One Hot Encoding, Frequency Encoding

##### Label Encoding -> Each unique categorical value for a feature is replaced with a discrete number.
##### One Hot Encoding -> A separate column is created for each unique categorical value from a feature.
##### Frequency Encoding -> Finding out the frequency of each categorical unique value from a feature.

<b>To get a better model, we would be running this file a few times as the model would have either one type of encoding, or a combination of all. Accordingly, the best models will be selected.<br>
Once selected, if a particular encoding type lowered the scores, we will change the block code type to 'Raw'.

In [None]:
# Label Encoding

for cols in cat:
    le = LabelEncoder()
    df[cols] = le.fit_transform(df[cols])

df.head(8)

<hr>

In [None]:
X = df.drop("Claim", axis=1)
y = df["Claim"]

### Train, Test, Split
<b>Function to train, test, and split

In [None]:
def tts(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
    
    return X_train, X_test, y_train, y_test

<hr>

### Fit and Predict
<b>Function to fit and predict the model, and to display the report

In [None]:
def model_sel(model, X_train, X_test, y_train, y_test):
    
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    return classification_report(y_test, y_pred)

<hr>

### All Models
<b>Function where all the models will be defined and then passed to 'model_sel' for the model to be created.

In [None]:
def models(X_train, y_train, X_test="None", y_test="None", sampled="No"):
    if sampled == "No":
        X_train, X_test, y_train, y_test = tts(X_train, y_train)
    else:
        pass
    
    lr = LogisticRegression()
    dtc = DecisionTreeClassifier()
    abc = AdaBoostClassifier(n_estimators=100)
    gbc = GradientBoostingClassifier(n_estimators=100)
    xbc = XGBClassifier(n_estimators=200, reg_alpha=1)
    rfc = RandomForestClassifier()
    lsvc = LinearSVC(random_state=1)
    svc = SVC(random_state=1)
    print("{} \n {}\n" .format("LOGISTIC REGRESSION", model_sel(lr, X_train, X_test, y_train, y_test)))
    print("{} \n {}\n" .format("DECISION TREE", model_sel(dtc, X_train, X_test, y_train, y_test)))
    print("{} \n {}\n" .format("ADABOOST", model_sel(abc, X_train, X_test, y_train, y_test)))
    print("{} \n {}\n" .format("GRADIENT BOOST", model_sel(gbc, X_train, X_test, y_train, y_test)))
    print("{} \n {}\n" .format("XGBOOST", model_sel(xbc, X_train, X_test, y_train, y_test)))
    print("{} \n {}\n" .format("RANDOM FOREST", model_sel(rfc, X_train, X_test, y_train, y_test)))
    print("{} \n {}\n" .format("LINEAR SVM", model_sel(lsvc, X_train, X_test, y_train, y_test)))
    print("{} \n {}\n" .format("SVM", model_sel(svc, X_train, X_test, y_train, y_test)))
    
    return lr, abc, gbc, xbc, rfc, lsvc, svc

<hr>

### Manual Under Sampling
<b>We will match the number of non-claims to claims. Below are the steps<br>
    1. Get the count of undersampled and oversampled Claims.<br>
    2. Create new variable that will randomly select the same number of oversampled Claims as there is undersampled.<br>
    3. Concatenate the two into a numpy array.<br>
    4. Create a new DataFrame taking the indexes from the concatenated array.<br>
    5. Use this DataFrame to run the models.<br>

In [None]:
def sampling(df):
    min_claim = len(df[df["Claim"] == 1])
    min_claim_ind = df[df["Claim"] == 1].index
    
    maj_claim_ind = df[df["Claim"] == 0].index
    
    random_major = np.random.choice(maj_claim_ind, min_claim, replace=False)
    
    sample_ind = np.concatenate([min_claim_ind, random_major])
    
    under_sample = df.loc[sample_ind]
    
    # print(sns.countplot(data=under_sample, x="Claim"))
    
    X = under_sample.loc[:, df.columns!="Claim"]
    y = under_sample.loc[:, df.columns=="Claim"]
    
    lr, abc, gbc, xbc, rfc, lsvc, svc = models(X, y)
    return lr, abc, gbc, xbc, rfc, lsvc, svc, X, y

<hr>

### Over Sampling

<b>The number of minority values will be made to equal the number of majority values.

In [None]:
def over_sample():
    X = df.drop("Claim", axis=1)
    y = df["Claim"]
    print(Counter(y))
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
    oversample = RandomOverSampler(sampling_strategy='minority')
    X_over, y_over = oversample.fit_resample(X_train, y_train)
    print(Counter(y_over))
    
    lr, abc, gbc, xbc, rfc, lsvc, svc = models(X_over, y_over, X_test, y_test, "Yes")
    return lr, abc, gbc, xbc, rfc, lsvc, svc, X_over, y_over

<hr>

### Under Sampling

<b>The number of majority values will be reduced down to equal the number of minority values.

In [None]:
def under_sample():
    X = df.drop("Claim", axis=1)
    y = df["Claim"]
    print(Counter(y))
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
    undersample = RandomUnderSampler(sampling_strategy='majority')
    X_under, y_under = undersample.fit_resample(X_train, y_train)
    print(Counter(y_under))
    
    lr, abc, gbc, xbc, rfc, lsvc, svc = models(X_under, y_under, X_test, y_test, "Yes")
    return lr, abc, gbc, xbc, rfc, lsvc, svc, X_under, y_under

<hr>

### GridSearchCV
<b>By passing the model along with parameters that it can carry, this function will iterate using the model parameters, and deliver the best model.

In [None]:
def gridsearch(model, paramater, X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

    gscv = GridSearchCV(estimator=model, param_grid=parameter)
    gscv.fit(X_train, y_train)
    y_pred = gscv.predict(X_test)
    print(classification_report(y_test, y_pred))
    print(gscv.best_estimator_)
    return gscv

<hr><hr>

### First Baseline Models
<b>We will build four models - No Sampling, Manual Under Sampling, Over Sampled, Under Sampled.<br>

In [None]:
# Without Sampling

lr, abc, gbc, xbc, rfc, lscv, svc = models(X, y)

In [None]:
# With manual Under Sampling

lr_sample, abc_sample, gbc_sample, xbc_sample, rfc_sample, lsvc_sample, svc_sample, X, y = sampling(df)

In [None]:
# Over Sampled

lr_over, abc, gbc_over, xbc_over, rfc_over, lsvc_over, svc_over, X, y = over_sample()

In [None]:
# Under Sampled

lr_under, abc_under, gbc_under, xbc_under, rfc_under, lsvc_under, svc_under, X, y = under_sample()

#### Result
<b>The scores are all zero for the base model without Sampling.</b><br>
<b>For all the sampling models, the scores increased drastically. Over Sampled models produced the best results.</b><br>
<b>Going forward, we will not run the models where no sampling is done.</b><br>

<hr>

### Outliers

<b>As mentioned earlier, those over 100yrs will be replaced by the mean of Senior aged customers, and where the Duration is more that 360 will be replaced by 360.

In [None]:
df["Age"][df["Age"] > 60] = mean_senior

In [None]:
df["Duration"][df["Duration"] > 360] = 360

In [None]:
X = df.drop("Claim", axis=1)
y = df["Claim"]

In [None]:
# lr_out, abc_out, gbc_out, xbc_out, rfc_out, lsvc_out, svc_out = models(X, y)

In [None]:
# Manual Under Sampling and Outliers

lr_out_sample, abc_out_sample, gbc_out_sample, xbc_out_sample, rfc_out_sample, lsvc_out_sample, svc_out_sample, X, y = sampling(df)

In [None]:
# Over Sampling and Outliers

lr_out_over, abc_out_over, gbc_out_over, xbc_out_over, rfc_out_over, lsvc_out_over, svc_out_over, X, y = over_sample()

In [None]:
# Under Sampling and Outliers

lr_out_under, abc_out_under, gbc_out_under, xbc_out_under, rfc_out_under, lsvc_out_under, svc_out_under, X, y = under_sample()

<hr>

### Skewness

In [None]:
print("{:<15} : {}" .format("Duration", skew(df["Duration"])))
print("{:<15} : {}" .format("Commision (in value)", skew(df["Commision (in value)"])))
print("{:<15} : {}" .format("Age", skew(df["Age"]))) 

In [None]:
df["Duration"] = np.sqrt(df["Duration"])
df["Commision (in value)"] = np.sqrt(df["Commision (in value)"])
df["Age"] = np.sqrt(df["Age"])

In [None]:
print("{:<15} : {}" .format("Duration", skew(df["Duration"])))
print("{:<15} : {}" .format("Commision (in value)", skew(df["Commision (in value)"])))
print("{:<15} : {}" .format("Age", skew(df["Age"]))) 

In [None]:
X = df.drop("Claim", axis=1)
y = df["Claim"]

In [None]:
# lr_skew, abc_skew, gbc_skew, xbc_skew, rfc_skew, lsvc_skew, svc_skew = models(X, y)

In [None]:
# Manual Under Sampling and Skewing

lr_skew_sample, abc_skew_sample, gbc_skew_sample, xbc_skew_sample, rfc_skew_sample, lsvc_skew_sample, svc_skew_sample, X, y = sampling(df)

In [None]:
# Over Sampling and Skewing

lr_skew_over, abc_skew_over, gbc_skew_over, xbc_skew_over, rfc_skew_over, lsvc_skew_over, svc_skew_over, X, y = over_sample()

In [None]:
# Under Sampling and Skewing

lr_skew_under, abc_skew_under, gbc_skew_under, xbc_skew_under, rfc_skew_under, lsvc_skew_under, svc_skew_under, X, y = under_sample()

<hr>

### Performing Chi-squared test

In [None]:
len(df.columns)

In [None]:
X = df.drop("Claim", axis=1)
y = df["Claim"]

In [None]:
X_cols = []

for col in X:
    X_cols.append(col)

In [None]:
def model_chi(model, X):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
    
    chi_test = SelectKBest(score_func=chi2, k=6)

    X_train_chi = chi_test.fit_transform(X_train, y_train)
    X_test_chi = chi_test.transform(X_test)
    
    model.fit(X_train_chi, y_train)
    y_pred = model.predict(X_test_chi)
    
    print(classification_report(y_test, y_pred))
    
    num = 0
    
    for each in chi_test.scores_:
        print("{:2} {:20} - {}" .format(num, X_cols[num], each))
        num += 1

In [None]:
def model_new(X):
    lr = LogisticRegression()
    dtc = DecisionTreeClassifier(criterion="entropy")
    abc = AdaBoostClassifier(n_estimators=100)
    gbc = GradientBoostingClassifier(n_estimators=100)
    xbc = XGBClassifier(n_estimators=200, reg_alpha=1)
    rfc = RandomForestClassifier()
    print("{} \n {}\n" .format("LOGISTIC REGRESSION", model_chi(lr,X)))
    print("{} \n {}\n" .format("DECISION TREE", model_sel(dtc,X)))
    print("{} \n {}\n" .format("ADABOOST", model_chi(abc,X)))
    print("{} \n {}\n" .format("GRADIENT BOOST", model_chi(gbc,X)))
    print("{} \n {}\n" .format("XGBOOST", model_chi(xbc,X)))
    print("{} \n {}\n" .format("RANDOM FOREST", model_chi(rfc,X)))
    
    return lr, abc, gbc, xbc, rfc

lr_chi, abc_chi, gbc_chi, xbc_chi, rfc_chi = model_new(X)

<b>RandomForest still has the best score.

In [None]:
X = df.drop("Claim", axis=1)
y = df["Claim"]

In [None]:
X_cols = []

for col in X:
    X_cols.append(col)

In [None]:
# lr_chi, abc_chi, gbc_chi, xbc_chi, rfc_chi = model_new(X)

<hr>

### Scaling

In [None]:
df_old = df.copy(deep=True)

In [None]:
mm = MinMaxScaler()

X = df.drop("Claim", axis=1)
cols = X.columns.to_list()
df[cols] = mm.fit_transform(df[cols])

In [None]:
df.head()

In [None]:
X = df.drop("Claim", axis=1)
y = df["Claim"]

In [None]:
# lr_scale, abc_scale, gbc_scale, xbc_scale, rfc_scale, lsvc_scale, svc_scale = models(X, y)

In [None]:
# Manual Under Sampling and Scalling 

lr_scale_sample, abc_scale_sample, gbc_scale_sample, xbc_scale_sample, rfc_scale_sample, lsvc_scale_sample, svc_scale_sample, X, y = sampling(df)

In [None]:
# Over Sampling and Scalling

lr_scale_over, abc_scale_over, gbc_scale_over, xbc_scale_over, rfc_scale_over, lsvc_scale_over, svc_scale_over, X, y = over_sample()

In [None]:
# Under Sampling and Scalling

lr_scale_under, abc_scale_under, gbc_scale_under, xbc_scale_under, rfc_scale_under, lsvc_scale_under, svc_scale_under, X, y = under_sample()

<hr>

### Saving best model in a file through Pickle

In [None]:
file = open("TravelInsurance.ser", "wb")
pickle.dump(rfc_under, file)
file.close()

<hr>

### Below blocks of code are not in use anymore

<hr>

<hr><hr>

# Conclusion

<b>The overall project went through changes from start till the end.</b><br>

#### Version 1.0
<b>Most time was spent on this version. Here is all that was done - <br>
    1) Read and analyzed dataset.<br>
    2) Removed 'Gender' as it had 71% null values.<br>
    3) Performed Label Encoding.<br>
    4) Created defintions for fitting and predicting models.<br>
    5) Skewness, Outliers, Scaling, Chi-Squared Test, Boosting.</b><br><br>
<b><u>Result - </u>The scores achieved for each and every model in this version was zero (as you can see below). A different approach was required.</b><br><br>

![image.png](attachment:3a522b43-d816-4c61-a1c7-55692cbd9263.png)
![image.png](attachment:e58ec734-aee1-4f40-8ad7-1b04b6f34369.png)

    
#### Version 2.0
<b>From this version onwards, Sampling techniques were added. This helped increase the score value greatly. The definition added was 'sampling(df)'. This technique manually applied undersampling. Some of the best scores achieved are shown below. Also, updates we done to Boosting. Along with other models, they were added to Bagging Classifier with parameters, and then passed to GridSearchCV.<br>
It is important to note that Sampling should only be done on the Training data, and not on the entire dataset.</b><br>

<b>Adaboost Baseline Sampling</b><br>
![image.png](attachment:82a02fa2-fefc-4cf7-80a4-1ff0089495c2.png)

<b>RandomForest Skew Sampling</b><br>
![image.png](attachment:c6725cfd-50c0-48d4-b026-c9683b0dc488.png)

<b>XGBoost and RandomForest Scaling Sampling</b><br>
![image.png](attachment:a62bc6e8-9010-48fa-8ca9-c0c3e2180621.png)
    
<b>Gradient Boosting GridSearch Sampling</b><br>
![image.png](attachment:28a9777a-034b-4445-8a69-0013f8f92662.png)
    
<b>LinearSVC Baseline Sampling</b><br>
![image.png](attachment:3261fb5d-d2e8-4163-a65f-7a0d14a7c603.png)
    
<b>LinearSVC Outliers Sampling</b><br>
![image.png](attachment:b13aa67d-1cec-4235-a3ea-3b94ee8800a6.png)
   

#### Final Version
<b>Here, we added the function 'under_sample()' and 'over_sample()'. All Boosting, Bagging, and GridSearch code blocks were changed to Raw in this version. Reason being that Over Sampling greatly increased the score values right from the Baseline models (screenshot below) onwards, especially for DecisionTree and RandomForest.</b><br>

![image.png](attachment:b4af8420-ef6a-4781-af2c-5f232141d3a1.png)

![image.png](attachment:cd91e3b6-bd15-4e6b-9d59-09488a3d2577.png)

<b>Overall, RandomForest produced the best results. Even after some EDA and Preprocessing, the scores achieved for DecisionTree and RandomForest after each EDA process were almost identical, although there was a bit of variance in scores between the models.</b><br>
<b>For this, we saved the model 'rfc_under' into a serial file through Pickle.</b><br>


<hr><hr><hr>