<a href="https://colab.research.google.com/github/abhivyaktsr/insurance_risk_prediction/blob/main/Life_Insurance_risk_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Insurance risk response prediction
____

<img src="https://storage.googleapis.com/kaggle-competitions/kaggle/4699/media/iStock_insurancehands300.png" align='center'><br/>

**Life insurance** is a means to ensure financial support or stability. Millions of life insurance applications are processed everyday. In India alone, there are 24 life insurance companies. Each of these companies offer multitude of insurance policies.

**In India, over 988 Million people do not have insurance yet**. This indicates a lot of potential for growth in this industry. According to a report by the India Brand Equity Foundation (IBEF), the **Indian insurance industry is expected to grow to Rs 19,56,920 crore (US$ 280 billion) by FY2020**, owing to the solid economic growth and higher personal disposable incomes in the country.

###### Table of Contents

1. [Problem Statement](#Problem_Statement)<br>
2. [Import Packages and Setup](#Import_Packages_and_Setup)<br>
3. [Loading Data](#Loading_Data)<br>

<a id=Problem_Statement></a>
## 1. Problem Statement
___
Life insurance companies take number of days to process an insurance application. This involves evaluating the risk of providing insurance to an applicant, by considering the following:
  * Medical history 
  * Family history 
  * Employment status
  * Type of policy applied 
  
With growing number of applicants, compounded by increasing number of different policies offered by life insurance companies, there is an added delay in processing each application. Based on historical data, it is possible to build a Machine Learning model to evaluate the risk in providing insurance, which drastically reduces the processing time of applications.


### Objective of project:
**To generate a machine learning model to predict insurance risk response of an application.**

This involves developing an understanding on:
  * Which attributes contribute to decision on insurance risk response the most?
  * Which machine learning algorithm provides the best evaluation metrics?
  * What are the limitations of the model? In what scenarios the model can be used?
  * How can the model be improved in future?

<a id=Import_Packages_and_Setup></a>
## 2. Import Packages and Setup
___

In [14]:
# To hide pip installation logs
!pip install gwpy &> /dev/null

# Importing pandas-profile version 2.5.0 which is supported by Google Colab
!pip install pandas-profiling==2.5.0



In [15]:
# Importing packages
import numpy as np
import pandas as pd
import pandas_profiling as profile

# Importing visualization packages
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# Miscellaneious
import warnings
from tabulate import tabulate

In [16]:
pd.set_option('mode.chained_assignment', None)        # To suppress pandas warnings.
#pd.set_option('display.max_colwidth', None)           # To display all the data in each column
pd.set_option('display.max_columns', None)            # To display every column of the dataset in head()

plt.style.use('seaborn-whitegrid')                    # Using seaborn white grid style for charts

warnings.filterwarnings('ignore')                     # To suppress all the warnings in the notebook

<a id=Loading_Data></a>
## 3. Loading Data
___

The data can be downloaded from https://raw.githubusercontent.com/insaid2018/Term-2/master/Projects/insurance_data.csv

In [17]:
# Importing insurance raw data
insurance_raw_data = pd.read_csv('https://raw.githubusercontent.com/insaid2018/Term-2/master/Projects/insurance_data.csv')

In [18]:
numerical_columns = ["Product_Info_4", "Ins_Age", "Ht", "Wt", "BMI", "Employment_Info_1", "Employment_Info_4", "Employment_Info_6", "Insurance_History_5", 
                     "Family_Hist_2", "Family_Hist_3", "Family_Hist_4", "Family_Hist_5"]
categorical_columns = ["Product_Info_1", "Product_Info_2", "Product_Info_3", "Product_Info_5", "Product_Info_6", "Product_Info_7", 
                       "Employment_Info_2", "Employment_Info_3", "Employment_Info_5", 
                       "InsuredInfo_1", "InsuredInfo_2", "InsuredInfo_3", "InsuredInfo_4", "InsuredInfo_5", "InsuredInfo_6", "InsuredInfo_7", 
                       "Insurance_History_1", "Insurance_History_2", "Insurance_History_3", "Insurance_History_4", "Insurance_History_7", "Insurance_History_8", "Insurance_History_9", 
                       "Family_Hist_1", "Medical_History_2", "Medical_History_3", "Medical_History_4", "Medical_History_5", "Medical_History_6", "Medical_History_7", "Medical_History_8",
                       "Medical_History_9", "Medical_History_11", "Medical_History_12", "Medical_History_13", "Medical_History_14", "Medical_History_16", "Medical_History_17", 
                       "Medical_History_18", "Medical_History_19", "Medical_History_20", "Medical_History_21", "Medical_History_22", "Medical_History_23", "Medical_History_25", 
                       "Medical_History_26", "Medical_History_27", "Medical_History_28", "Medical_History_29", "Medical_History_30", "Medical_History_31", "Medical_History_33", 
                       "Medical_History_34", "Medical_History_35", "Medical_History_36", "Medical_History_37", "Medical_History_38", "Medical_History_39", "Medical_History_40", 
                       "Medical_History_41"]
discrete_columns = ["Medical_History_1", "Medical_History_10", "Medical_History_15", "Medical_History_24", "Medical_History_32"]
dummy_variables = ["Medical_Keyword_1", "Medical_Keyword_2", "Medical_Keyword_3", "Medical_Keyword_4", "Medical_Keyword_5", "Medical_Keyword_6", "Medical_Keyword_7", "Medical_Keyword_8",                
                   "Medical_Keyword_9", "Medical_Keyword_10", "Medical_Keyword_11", "Medical_Keyword_12", "Medical_Keyword_13", "Medical_Keyword_14", "Medical_Keyword_15", 
                   "Medical_Keyword_16", "Medical_Keyword_17", "Medical_Keyword_18", "Medical_Keyword_19", "Medical_Keyword_20", "Medical_Keyword_21", "Medical_Keyword_22", 
                   "Medical_Keyword_23", "Medical_Keyword_24", "Medical_Keyword_25", "Medical_Keyword_26", "Medical_Keyword_27", "Medical_Keyword_28", "Medical_Keyword_29", 
                   "Medical_Keyword_30", "Medical_Keyword_31", "Medical_Keyword_32", "Medical_Keyword_33", "Medical_Keyword_34", "Medical_Keyword_35", "Medical_Keyword_36", 
                   "Medical_Keyword_37", "Medical_Keyword_38", "Medical_Keyword_39", "Medical_Keyword_40", "Medical_Keyword_41", "Medical_Keyword_42", "Medical_Keyword_43", 
                   "Medical_Keyword_44", "Medical_Keyword_45", "Medical_Keyword_46", "Medical_Keyword_47", "Medical_Keyword_48"]

<a id=Preliminary_Analysis></a>
## 4. Preliminary analysis
___

In [None]:
# Data analysis of numerical columns
insurance_raw_data[numerical_columns].describe()

Unnamed: 0,Product_Info_4,Ins_Age,Ht,Wt,BMI,Employment_Info_1,Employment_Info_4,Employment_Info_6,Insurance_History_5,Family_Hist_2,Family_Hist_3,Family_Hist_4,Family_Hist_5
count,59381.0,59381.0,59381.0,59381.0,59381.0,59362.0,52602.0,48527.0,33985.0,30725.0,25140.0,40197.0,17570.0
mean,0.328952,0.405567,0.707283,0.292587,0.469462,0.077582,0.006283,0.361469,0.001733,0.47455,0.497737,0.44489,0.484635
std,0.282562,0.19719,0.074239,0.089037,0.122213,0.082347,0.032816,0.349551,0.007338,0.154959,0.140187,0.163012,0.1292
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.076923,0.238806,0.654545,0.225941,0.385517,0.035,0.0,0.06,0.0004,0.362319,0.401961,0.323944,0.401786
50%,0.230769,0.402985,0.709091,0.288703,0.451349,0.06,0.0,0.25,0.000973,0.463768,0.519608,0.422535,0.508929
75%,0.487179,0.567164,0.763636,0.345188,0.532858,0.1,0.0,0.55,0.002,0.57971,0.598039,0.56338,0.580357
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.943662,1.0


**Observations:**


1.   13 numerical continuous columns exist in the dataset 
2.   All columns are normalized to values between 0 to 1.
3.   Almost all columns exhibit skewness either right or left.
4.   Following columns have missing entries:


> 'Employment_Info_1',
 'Employment_Info_4',
 'Employment_Info_6',
 'Insurance_History_5',
 'Family_Hist_2',
 'Family_Hist_3',
 'Family_Hist_4',
 'Family_Hist_5'







In [None]:
insurance_raw_data[categorical_columns].describe(include='all')

Unnamed: 0,Product_Info_1,Product_Info_2,Product_Info_3,Product_Info_5,Product_Info_6,Product_Info_7,Employment_Info_2,Employment_Info_3,Employment_Info_5,InsuredInfo_1,InsuredInfo_2,InsuredInfo_3,InsuredInfo_4,InsuredInfo_5,InsuredInfo_6,InsuredInfo_7,Insurance_History_1,Insurance_History_2,Insurance_History_3,Insurance_History_4,Insurance_History_7,Insurance_History_8,Insurance_History_9,Family_Hist_1,Medical_History_2,Medical_History_3,Medical_History_4,Medical_History_5,Medical_History_6,Medical_History_7,Medical_History_8,Medical_History_9,Medical_History_11,Medical_History_12,Medical_History_13,Medical_History_14,Medical_History_16,Medical_History_17,Medical_History_18,Medical_History_19,Medical_History_20,Medical_History_21,Medical_History_22,Medical_History_23,Medical_History_25,Medical_History_26,Medical_History_27,Medical_History_28,Medical_History_29,Medical_History_30,Medical_History_31,Medical_History_33,Medical_History_34,Medical_History_35,Medical_History_36,Medical_History_37,Medical_History_38,Medical_History_39,Medical_History_40,Medical_History_41
count,59381.0,59381,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0
unique,,19,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
top,,D3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
freq,,14321,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
mean,1.026355,,24.415655,2.006955,2.673599,1.043583,8.641821,1.300904,2.142958,1.209326,2.007427,5.83584,2.883666,1.02718,1.409188,1.038531,1.727606,1.055792,2.146983,1.958707,1.901989,2.048484,2.41936,2.68623,253.9871,2.102171,1.654873,1.007359,2.889897,2.012277,2.044088,1.769943,2.993836,2.056601,2.768141,2.968542,1.327529,2.978006,1.053536,1.034455,1.985079,1.108991,1.981644,2.528115,1.194961,2.808979,2.980213,1.06721,2.542699,2.040771,2.985265,2.804618,2.689076,1.002055,2.179468,1.938398,1.00485,2.83072,2.967599,1.641064
std,0.160191,,5.072885,0.083107,0.739103,0.291949,4.227082,0.715034,0.350033,0.417939,0.085858,2.674536,0.320627,0.231566,0.491688,0.274915,0.445195,0.329328,0.989139,0.945739,0.971223,0.755149,0.509577,0.483159,178.621154,0.303098,0.475414,0.085864,0.456128,0.17236,0.291353,0.421032,0.09534,0.231153,0.640259,0.197715,0.740118,0.146778,0.225848,0.182859,0.121375,0.311847,0.134236,0.84917,0.406082,0.393237,0.197652,0.250589,0.839904,0.1981,0.170989,0.593798,0.724661,0.063806,0.412633,0.240574,0.069474,0.556665,0.252427,0.933361
min,1.0,,1.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,1.0,,26.0,2.0,3.0,1.0,9.0,1.0,2.0,1.0,2.0,3.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,112.0,2.0,1.0,1.0,3.0,2.0,2.0,2.0,3.0,2.0,3.0,3.0,1.0,3.0,1.0,1.0,2.0,1.0,2.0,3.0,1.0,3.0,3.0,1.0,3.0,2.0,3.0,3.0,3.0,1.0,2.0,2.0,1.0,3.0,3.0,1.0
50%,1.0,,26.0,2.0,3.0,1.0,9.0,1.0,2.0,1.0,2.0,6.0,3.0,1.0,1.0,1.0,2.0,1.0,3.0,2.0,1.0,2.0,2.0,3.0,162.0,2.0,2.0,1.0,3.0,2.0,2.0,2.0,3.0,2.0,3.0,3.0,1.0,3.0,1.0,1.0,2.0,1.0,2.0,3.0,1.0,3.0,3.0,1.0,3.0,2.0,3.0,3.0,3.0,1.0,2.0,2.0,1.0,3.0,3.0,1.0
75%,1.0,,26.0,2.0,3.0,1.0,9.0,1.0,2.0,1.0,2.0,8.0,3.0,1.0,2.0,1.0,2.0,1.0,3.0,3.0,3.0,3.0,3.0,3.0,418.0,2.0,2.0,1.0,3.0,2.0,2.0,2.0,3.0,2.0,3.0,3.0,1.0,3.0,1.0,1.0,2.0,1.0,2.0,3.0,1.0,3.0,3.0,1.0,3.0,2.0,3.0,3.0,3.0,1.0,2.0,2.0,1.0,3.0,3.0,3.0


In [None]:
insurance_raw_data[discrete_columns].describe()

Unnamed: 0,Medical_History_1,Medical_History_10,Medical_History_15,Medical_History_24,Medical_History_32
count,50492.0,557.0,14785.0,3801.0,1107.0
mean,7.962172,141.118492,123.760974,50.635622,11.965673
std,13.027697,107.759559,98.516206,78.149069,38.718774
min,0.0,0.0,0.0,0.0,0.0
25%,2.0,8.0,17.0,1.0,0.0
50%,4.0,229.0,117.0,8.0,0.0
75%,9.0,240.0,240.0,64.0,2.0
max,240.0,240.0,240.0,240.0,240.0


In [None]:
insurance_raw_data.info(max_cols=128)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59381 entries, 0 to 59380
Data columns (total 128 columns):
Id                     59381 non-null int64
Product_Info_1         59381 non-null int64
Product_Info_2         59381 non-null object
Product_Info_3         59381 non-null int64
Product_Info_4         59381 non-null float64
Product_Info_5         59381 non-null int64
Product_Info_6         59381 non-null int64
Product_Info_7         59381 non-null int64
Ins_Age                59381 non-null float64
Ht                     59381 non-null float64
Wt                     59381 non-null float64
BMI                    59381 non-null float64
Employment_Info_1      59362 non-null float64
Employment_Info_2      59381 non-null int64
Employment_Info_3      59381 non-null int64
Employment_Info_4      52602 non-null float64
Employment_Info_5      59381 non-null int64
Employment_Info_6      48527 non-null float64
InsuredInfo_1          59381 non-null int64
InsuredInfo_2          59381 non-null

In [21]:
insurance_data_profile_report = profile.ProfileReport(insurance_raw_data, title="Pandas Profiling Report")

HBox(children=(FloatProgress(value=0.0, description='variables', max=128.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='correlations', max=6.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='interactions [continuous]', max=576.0, style=ProgressStyl…




HBox(children=(FloatProgress(value=0.0, description='table', max=1.0, style=ProgressStyle(description_width='i…




HBox(children=(FloatProgress(value=0.0, description='missing', max=4.0, style=ProgressStyle(description_width=…









HBox(children=(FloatProgress(value=0.0, description='package', max=1.0, style=ProgressStyle(description_width=…




HBox(children=(FloatProgress(value=0.0, description='build report structure', max=1.0, style=ProgressStyle(des…




In [22]:
insurance_data_profile_report.to_file("Insurance_data_profile_report.html")