<a href="https://colab.research.google.com/github/data-analytics-workshop/python/blob/master/005b_case_study_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Case Study - Insurance Claims Prediction

Insurance companies sell promises – in a return for a premium, they guarantee that if something bad would happen, a policyholder will get a certain amount of benefit according to a policy sum assured. After such policy is sold, it can pass many years until a claim is made. For covering such future payments, insurers set up a reserve called also Claims Reserve, calculated by financial department and actuaries (statisticians in insurance companies).

It is extremely important to predict the final amount of claims and estimate Claims Reserve as accurately as possible so that insurers can:

1.   effectively utilize the capital if it is not meant to build reserves,
3.   make better management decisions about investments, new products and sales strategy,
2.   build trust and stability through accurate financial statements.

Traditionally, insurers predict claim amount by taking similar groups of policies and analyzing their historical development of losses. Based on patterns from historical data, some estimates of claim amounts are produced. Moreover, one has to include expert adjustments to accommodate the fact that data is historical, but we predict on new policies. Then, the groups of policies are aggregated to get Claims Reserve amount that will be placed on financial statement.

#### **Install and Import Libraries Needed**

Install Library

In [0]:
# Install Category Encoders
! pip install category_encoders

Import Libraries

In [0]:
# Import Library for Data Manipulation
import pandas as pd
import category_encoders as ce

# Import Library for Machine Learning
import sklearn.metrics as metrics

# Import Library for Visualization
import matplotlib. pyplot as plt
import seaborn as sns

#### **Import Dataset**

Insurance Claim Data

In [0]:
# Import Dataset
df_claim = pd.read_csv('https://raw.githubusercontent.com/dianrdn/data/master/insurance_claim_data.csv', sep =';')
df_claim

In [0]:
# Prints the Dataset Information
df_claim.info()

In [0]:
# Prints Descriptive Statistics
df_claim.describe().transpose()

Customer Profile Data

In [0]:
# Import Dataset
df_customer = pd.read_csv('https://raw.githubusercontent.com/dianrdn/data/master/insurance_profile_data.csv', sep =';')
df_customer

In [0]:
# Prints the Dataset Information
df_customer.info()

In [0]:
# Prints Descriptive Statistics
df_customer.describe().transpose()

Join Insurance Claim Data and Customer Profile Data

In [0]:
# Merge Insurance Claim and Customer Profile Data based on id
df_insurance = pd.merge(df_customer, df_claim, on='id')
df_insurance

In [0]:
# Prints the Dataset Information
df_insurance.info()

In [0]:
# Count of Each Class 
df_insurance.insuranceclaim.value_counts()

#### **Explore the Dataset**

Visualize Data using Pairplot

In [0]:
# Set Graph Size
plt.rcParams['figure.figsize'] = (16, 8)

# Visualize Pair Plot with Colors
sns.pairplot(df_insurance, hue='insuranceclaim')

Visualize Data using Scatterplot

In [0]:
# Draw Scatterplot
sns.relplot(x='bmi', y='charges', hue='insuranceclaim', style= 'sex', size='age', col ='children', kind='scatter', data=df_insurance)

Visualize Correlation between Features

In [0]:
# Draw Correlation Map
sns.clustermap(df_insurance.corr(), center=0, cmap='vlag', linewidths=.75)

#### **Preprocess the Data**

Handling Missing Values

In [0]:
# Check for Missing Values
df_insurance.isnull().sum()

In [0]:
# Search for Median Value
median = df_insurance['bmi'].median()

# Use Median to Replace Missing Values
df_insurance['bmi'].fillna(median, inplace=True)

# Check for Missing Values
df_insurance.isnull().sum()

Encode Categorical Data

In [0]:
# Import Module
from sklearn.preprocessing import OneHotEncoder

# Encoder
encoder = OneHotEncoder(sparse=False)

# Encode Categorical Data
df_encoded = pd.DataFrame(encoder.fit_transform(df_insurance[['region']]))
df_encoded.columns = encoder.get_feature_names(['region'])

# Replace Categotical Data with Encoded Data
df_insurance.drop(['region'] ,axis=1, inplace=True)
df_insurance_encoded= pd.concat([df_insurance, df_encoded], axis=1)

# Show Encoded Dataframe
df_insurance_encoded

Select Feature and Target

In [0]:
# Select Features
feature = df_insurance_encoded.drop(['id', 'insuranceclaim'], axis=1)
feature

In [0]:
# Select Target
target = df_insurance_encoded['insuranceclaim']
target

Set Training and Testing Data

In [0]:
# Set Training and Testing Data (70:30)
from sklearn.model_selection import train_test_split, cross_val_score
feature_train, feature_test, target_train, target_test = train_test_split(feature , target, shuffle = True, test_size=0.3, random_state=1)

# Show the Training and Testing Data
print(feature_train.shape)
print(feature_test.shape)
print(target_train.shape)
print(target_test.shape)

#### **Modeling**

##### **Decision Tree Classifier**

Build Model

In [0]:
# Import library
from sklearn import tree

# Modeling Decision Tree
dtree = tree.DecisionTreeClassifier(min_impurity_decrease=0.01)
dtree.fit(feature_train, target_train)

# Predict Test Data 
target_predicted_dtree = dtree.predict(feature_test)
target_predicted_dtree

In [0]:
# Visualize Tree

from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus

dot_data = StringIO()
export_graphviz(dtree, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True,
                class_names=['notclaim','claim'],
                feature_names=['age', 'sex',	'bmi',	'steps',	'children',	'smoker', 'charges',
                               'region_northeast',	'region_northwest', 'region_southeast',	'region_southwest'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())

Model Evaluation

In [0]:
# Confsion Matrix
cm_dtree = metrics.confusion_matrix(target_test, target_predicted_dtree)
cm_dtree

In [0]:
# Accuracy, Precision, Recall
acc_dtree = metrics.accuracy_score(target_test, target_predicted_dtree)
prec_dtree = metrics.precision_score(target_test, target_predicted_dtree)
rec_dtree = metrics.recall_score(target_test, target_predicted_dtree)
f1_dtree = metrics.f1_score(target_test, target_predicted_dtree)
kappa_dtree = metrics.cohen_kappa_score(target_test, target_predicted_dtree)

# Show Accuracy, Precision, Recall
print('Accuracy:', acc_dtree )
print('Precision:', prec_dtree)
print('Recall:', rec_dtree)
print('F1 Score:', f1_dtree)
print('Cohens Kappa Score:', kappa_dtree)

In [0]:
# Import Visualization Package
plt.rcParams['figure.figsize'] = (10, 10)
plt.style.use('ggplot')

# Visualize ROC Curve
target_predicted_dtree_prob = dtree.predict_proba(feature_test)[::,1]
fp_rate_dtree, tp_rate_dtree, _ = metrics.roc_curve(target_test,  target_predicted_dtree_prob)
auc_dtree = metrics.roc_auc_score(target_test, target_predicted_dtree_prob)
plt.plot(fp_rate_dtree, tp_rate_dtree, label='Decision Tree, auc='+str(auc_dtree))
plt.xlabel('false positive rate') 
plt.ylabel('true positive rate')
plt.legend(loc=4)
plt.show()

##### **K-Nearest Neighbor Classifier**

Build Model

In [0]:
# Import Module
from sklearn.neighbors import KNeighborsClassifier

# Modeling Naive Bayes
knn = KNeighborsClassifier(n_neighbors= 71)
knn.fit(feature_train, target_train)

# Predict Test Data 
target_predicted_knn = knn.predict(feature_test)
target_predicted_knn

Model Evaluation

In [0]:
# Confsion Matrix
cm_knn = metrics.confusion_matrix(target_test, target_predicted_knn)
cm_knn

In [0]:
# Accuracy, Precision, Recall
acc_knn = metrics.accuracy_score(target_test, target_predicted_knn)
prec_knn = metrics.precision_score(target_test, target_predicted_knn)
rec_knn = metrics.recall_score(target_test, target_predicted_knn)
f1_knn = metrics.f1_score(target_test, target_predicted_knn)
kappa_knn = metrics.cohen_kappa_score(target_test, target_predicted_knn)

# Show Accuracy, Precision, Recall
print('Accuracy:', acc_knn)
print('Precision:', prec_knn)
print('Recall:', rec_knn)
print('F1 Score:', f1_knn)
print('Cohens Kappa Score:', kappa_knn)

In [0]:
# Import Visualization Package
plt.rcParams['figure.figsize'] = (10, 10)
plt.style.use('ggplot')

# Visualize ROC Curve
target_predicted_knn_prob = knn.predict_proba(feature_test)[::,1]
fp_rate_knn, tp_rate_knn, _ = metrics.roc_curve(target_test,  target_predicted_knn_prob)
auc_knn = metrics.roc_auc_score(target_test, target_predicted_knn_prob)
plt.plot(fp_rate_knn, tp_rate_knn, label='KNN, auc='+str(auc_knn))
plt.xlabel('false positive rate') 
plt.ylabel('true positive rate')
plt.legend(loc=4)
plt.show()

##### **Naive Bayes Classifier**

Build Model

In [0]:
# Import Module
from sklearn.naive_bayes import GaussianNB 

# Modeling Naive Bayes
nb = GaussianNB()
nb.fit(feature_train, target_train)

# Predict Test Data 
target_predicted_nb = nb.predict(feature_test)
target_predicted_nb

Model Evaluation

In [0]:
# Confsion Matrix
cm_nb = metrics.confusion_matrix(target_test, target_predicted_nb)
cm_nb

In [0]:
# Accuracy, Precision, Recall
acc_nb = metrics.accuracy_score(target_test, target_predicted_nb)
prec_nb = metrics.precision_score(target_test, target_predicted_nb)
rec_nb = metrics.recall_score(target_test, target_predicted_nb)
f1_nb = metrics.f1_score(target_test, target_predicted_nb)
kappa_nb = metrics.cohen_kappa_score(target_test, target_predicted_nb)

# Show Accuracy, Precision, Recall
print('Accuracy:', acc_nb)
print('Precision:', prec_nb)
print('Recall:', rec_nb)
print('F1 Score:', f1_nb)
print('Cohens Kappa Score:', kappa_nb)

In [0]:
# Import Visualization Package
plt.rcParams['figure.figsize'] = (10, 10)
plt.style.use('ggplot')

# Visualize ROC Curve
target_predicted_nb_prob = nb.predict_proba(feature_test)[::,1]
fp_rate_nb, tp_rate_nb, _ = metrics.roc_curve(target_test,  target_predicted_nb_prob)
auc_nb = metrics.roc_auc_score(target_test, target_predicted_nb_prob)
plt.plot(fp_rate_nb, tp_rate_nb, label='Decision Tree, auc='+str(auc_nb))
plt.xlabel('false positive rate') 
plt.ylabel('true positive rate')
plt.legend(loc=4)
plt.show()

#### **Evaluating Models**

Compare Model Performance

In [0]:
# Comparing Model Performance
print('Decision Tree Accuracy =',acc_dtree)
print('Decision Tree Precision =',prec_dtree)
print('Decision Tree Recall =',rec_dtree)
print('Decision Tree F1-Score =', f1_dtree)
print('_______________________')
print('k-NN Accuracy =', acc_knn)
print('k-NN Precision =', prec_knn)
print('k-NN Recall =', rec_knn)
print('k-NN F1-Score =', f1_knn)
print('_______________________')
print('Naive Bayes Accuracy =', acc_nb)
print('Naive Bayes Precision =', prec_nb)
print('Naive Bayes Recall =', rec_nb)
print('Naive Bayes F1-Score =', f1_nb)

Compare ROC Curve

In [0]:
# Comparing ROC Curve
plt.plot(fp_rate_dtree,tp_rate_dtree,label='Decision Tree, auc='+str(auc_dtree))
plt.plot(fp_rate_knn,tp_rate_knn,label='K-NN, auc='+str(auc_knn))
plt.plot(fp_rate_nb,tp_rate_nb,label='Naive Bayes, auc='+str(auc_nb))
plt.title('ROC Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc=4)
plt.show()

#### **Predict New Data**

Import New Customer Data

In [0]:
# Import New Dataset
df_new_customer = pd.read_csv('https://raw.githubusercontent.com/dianrdn/data/master/insurance_new_customer_profile.csv', sep =';')
df_new_customer

Preprocess the New Customer Data

In [0]:
# Import Module
from sklearn.preprocessing import OneHotEncoder

# Encoder
encoder = OneHotEncoder(sparse=False)

# Encode Categorical Data
df_new_customer_encoded = pd.DataFrame(encoder.fit_transform(df_new_customer[['region']]))
df_new_customer_encoded.columns = encoder.get_feature_names(['region'])

# Replace Categotical Data with Encoded Data
df_new_customer.drop(['region'] ,axis=1, inplace=True)
df_new_customer_encoded = pd.concat([df_new_customer, df_new_customer_encoded], axis=1)

# Show Encoded Dataframe
df_new_customer_encoded

In [0]:
# Select Features
new_feature = df_new_customer_encoded.drop(['id'], axis=1)
new_feature

Predict New Customer Data

In [0]:
# Predict using Decision Tree Classifier
new_predicted_dtree = pd.DataFrame(dtree.predict(new_feature), columns = ['insuranceclaim_dtree'])
new_predicted_dtree.reset_index()
new_predicted_dtree

In [0]:
# Predict using K-Nearest Neighbor Classifier
new_predicted_knn = pd.DataFrame(knn.predict(new_feature), columns = ['insuranceclaim_knn'])
new_predicted_knn.reset_index()
new_predicted_knn

In [0]:
# Predict using Naive Bayes Classifier
new_predicted_nb = pd.DataFrame(nb.predict(new_feature), columns = ['insuranceclaim_nb'])
new_predicted_nb.reset_index()
new_predicted_nb

Show Prediction Comparation

In [0]:
# Show Prediction Result
pred_new_customer = pd.concat([df_new_customer, new_predicted_dtree, new_predicted_knn, new_predicted_nb], axis=1)
pred_new_customer

Save Prediction Result

In [0]:
# Save Prediction Result
pred_new_customer.to_csv('new_customer_prediction.csv', index=False)