### Prudential Life Insurance Assessment

- https://www.kaggle.com/c/prudential-life-insurance-assessment
- https://www.kaggle.com/c/prudential-life-insurance-assessment/data

In this dataset, you are provided over a hundred variables describing attributes of life insurance applicants. The task is to predict the "Response" variable for each Id in the test set. "Response" is an ordinal measure of risk that has 8 levels.

- train.csv - the training set, contains the Response values
- test.csv - the test set, you must predict the Response variable for all rows in this file
- sample_submission.csv - a sample submission file in the correct format


Data Description
In this dataset, you are provided over a hundred variables describing attributes of life insurance applicants. The task is to predict the "Response" variable for each Id in the test set. "Response" is an ordinal measure of risk that has 8 levels.

- Id	A unique identifier associated with an application.
- Product_Info_1-7	A set of normalized variables relating to the product applied for
- Ins_Age	Normalized age of applicant
- Ht	Normalized height of applicant
- Wt	Normalized weight of applicant
- BMI	Normalized BMI of applicant
- Employment_Info_1-6	A set of normalized variables relating to the employment history of the applicant.
- InsuredInfo_1-6	A set of normalized variables providing information about the applicant.
- Insurance_History_1-9	A set of normalized variables relating to the insurance history of the applicant.
- Family_Hist_1-5	A set of normalized variables relating to the family history of the applicant.
- Medical_History_1-41	A set of normalized variables relating to the medical history of the applicant.
- Medical_Keyword_1-48	A set of dummy variables relating to the presence of/absence of a medical keyword being associated with the application.
- Response	This is the target variable, an ordinal variable relating to the final decision associated with an application

In [79]:
import pandas as pd
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
train_org_df = pd.read_csv('./inputdata/train.csv')
test_org_df = pd.read_csv('./inputdata/test.csv')
submission_org_df = pd.read_csv('./inputdata/sample_submission.csv')

In [8]:
pd.set_option('display.max_columns', 128)

In [9]:
train_org_df.head()

Unnamed: 0,Id,Product_Info_1,Product_Info_2,Product_Info_3,Product_Info_4,Product_Info_5,Product_Info_6,Product_Info_7,Ins_Age,Ht,Wt,BMI,Employment_Info_1,Employment_Info_2,Employment_Info_3,Employment_Info_4,Employment_Info_5,Employment_Info_6,InsuredInfo_1,InsuredInfo_2,InsuredInfo_3,InsuredInfo_4,InsuredInfo_5,InsuredInfo_6,InsuredInfo_7,Insurance_History_1,Insurance_History_2,Insurance_History_3,Insurance_History_4,Insurance_History_5,Insurance_History_7,Insurance_History_8,Insurance_History_9,Family_Hist_1,Family_Hist_2,Family_Hist_3,Family_Hist_4,Family_Hist_5,Medical_History_1,Medical_History_2,Medical_History_3,Medical_History_4,Medical_History_5,Medical_History_6,Medical_History_7,Medical_History_8,Medical_History_9,Medical_History_10,Medical_History_11,Medical_History_12,Medical_History_13,Medical_History_14,Medical_History_15,Medical_History_16,Medical_History_17,Medical_History_18,Medical_History_19,Medical_History_20,Medical_History_21,Medical_History_22,Medical_History_23,Medical_History_24,Medical_History_25,Medical_History_26,Medical_History_27,Medical_History_28,Medical_History_29,Medical_History_30,Medical_History_31,Medical_History_32,Medical_History_33,Medical_History_34,Medical_History_35,Medical_History_36,Medical_History_37,Medical_History_38,Medical_History_39,Medical_History_40,Medical_History_41,Medical_Keyword_1,Medical_Keyword_2,Medical_Keyword_3,Medical_Keyword_4,Medical_Keyword_5,Medical_Keyword_6,Medical_Keyword_7,Medical_Keyword_8,Medical_Keyword_9,Medical_Keyword_10,Medical_Keyword_11,Medical_Keyword_12,Medical_Keyword_13,Medical_Keyword_14,Medical_Keyword_15,Medical_Keyword_16,Medical_Keyword_17,Medical_Keyword_18,Medical_Keyword_19,Medical_Keyword_20,Medical_Keyword_21,Medical_Keyword_22,Medical_Keyword_23,Medical_Keyword_24,Medical_Keyword_25,Medical_Keyword_26,Medical_Keyword_27,Medical_Keyword_28,Medical_Keyword_29,Medical_Keyword_30,Medical_Keyword_31,Medical_Keyword_32,Medical_Keyword_33,Medical_Keyword_34,Medical_Keyword_35,Medical_Keyword_36,Medical_Keyword_37,Medical_Keyword_38,Medical_Keyword_39,Medical_Keyword_40,Medical_Keyword_41,Medical_Keyword_42,Medical_Keyword_43,Medical_Keyword_44,Medical_Keyword_45,Medical_Keyword_46,Medical_Keyword_47,Medical_Keyword_48,Response
0,2,1,D3,10,0.076923,2,1,1,0.641791,0.581818,0.148536,0.323008,0.028,12,1,0.0,3,,1,2,6,3,1,2,1,1,1,3,1,0.000667,1,1,2,2,,0.598039,,0.526786,4.0,112,2,1,1,3,2,2,1,,3,2,3,3,240.0,3,3,1,1,2,1,2,3,,1,3,3,1,3,2,3,,1,3,1,2,2,1,3,3,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8
1,5,1,A1,26,0.076923,2,3,1,0.059701,0.6,0.131799,0.272288,0.0,1,3,0.0,2,0.0018,1,2,6,3,1,2,1,2,1,3,1,0.000133,1,3,2,2,0.188406,,0.084507,,5.0,412,2,1,1,3,2,2,1,,3,2,3,3,0.0,1,3,1,1,2,1,2,3,,1,3,3,1,3,2,3,,3,1,1,2,2,1,3,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4
2,6,1,E1,26,0.076923,2,3,1,0.029851,0.745455,0.288703,0.42878,0.03,9,1,0.0,2,0.03,1,2,8,3,1,1,1,2,1,1,3,,3,2,3,3,0.304348,,0.225352,,10.0,3,2,2,1,3,2,2,2,,3,2,3,3,,1,3,1,1,2,1,2,3,,2,2,3,1,3,2,3,,3,3,1,3,2,1,3,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8
3,7,1,D4,10,0.487179,2,3,1,0.164179,0.672727,0.205021,0.352438,0.042,9,1,0.0,3,0.2,2,2,8,3,1,2,1,2,1,1,3,,3,2,3,3,0.42029,,0.352113,,0.0,350,2,2,1,3,2,2,2,,3,2,3,3,,1,3,1,1,2,2,2,3,,1,3,3,1,3,2,3,,3,3,1,2,2,1,3,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8
4,8,1,D2,26,0.230769,2,3,1,0.41791,0.654545,0.23431,0.424046,0.027,9,1,0.0,2,0.05,1,2,6,3,1,2,1,2,1,1,3,,3,2,3,2,0.463768,,0.408451,,,162,2,2,1,3,2,2,2,,3,2,3,3,,1,3,1,1,2,1,2,3,,2,2,3,1,3,2,3,,3,3,1,3,2,1,3,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8


In [10]:
test_org_df.head()

Unnamed: 0,Id,Product_Info_1,Product_Info_2,Product_Info_3,Product_Info_4,Product_Info_5,Product_Info_6,Product_Info_7,Ins_Age,Ht,Wt,BMI,Employment_Info_1,Employment_Info_2,Employment_Info_3,Employment_Info_4,Employment_Info_5,Employment_Info_6,InsuredInfo_1,InsuredInfo_2,InsuredInfo_3,InsuredInfo_4,InsuredInfo_5,InsuredInfo_6,InsuredInfo_7,Insurance_History_1,Insurance_History_2,Insurance_History_3,Insurance_History_4,Insurance_History_5,Insurance_History_7,Insurance_History_8,Insurance_History_9,Family_Hist_1,Family_Hist_2,Family_Hist_3,Family_Hist_4,Family_Hist_5,Medical_History_1,Medical_History_2,Medical_History_3,Medical_History_4,Medical_History_5,Medical_History_6,Medical_History_7,Medical_History_8,Medical_History_9,Medical_History_10,Medical_History_11,Medical_History_12,Medical_History_13,Medical_History_14,Medical_History_15,Medical_History_16,Medical_History_17,Medical_History_18,Medical_History_19,Medical_History_20,Medical_History_21,Medical_History_22,Medical_History_23,Medical_History_24,Medical_History_25,Medical_History_26,Medical_History_27,Medical_History_28,Medical_History_29,Medical_History_30,Medical_History_31,Medical_History_32,Medical_History_33,Medical_History_34,Medical_History_35,Medical_History_36,Medical_History_37,Medical_History_38,Medical_History_39,Medical_History_40,Medical_History_41,Medical_Keyword_1,Medical_Keyword_2,Medical_Keyword_3,Medical_Keyword_4,Medical_Keyword_5,Medical_Keyword_6,Medical_Keyword_7,Medical_Keyword_8,Medical_Keyword_9,Medical_Keyword_10,Medical_Keyword_11,Medical_Keyword_12,Medical_Keyword_13,Medical_Keyword_14,Medical_Keyword_15,Medical_Keyword_16,Medical_Keyword_17,Medical_Keyword_18,Medical_Keyword_19,Medical_Keyword_20,Medical_Keyword_21,Medical_Keyword_22,Medical_Keyword_23,Medical_Keyword_24,Medical_Keyword_25,Medical_Keyword_26,Medical_Keyword_27,Medical_Keyword_28,Medical_Keyword_29,Medical_Keyword_30,Medical_Keyword_31,Medical_Keyword_32,Medical_Keyword_33,Medical_Keyword_34,Medical_Keyword_35,Medical_Keyword_36,Medical_Keyword_37,Medical_Keyword_38,Medical_Keyword_39,Medical_Keyword_40,Medical_Keyword_41,Medical_Keyword_42,Medical_Keyword_43,Medical_Keyword_44,Medical_Keyword_45,Medical_Keyword_46,Medical_Keyword_47,Medical_Keyword_48
0,1,1,D3,26,0.487179,2,3,1,0.61194,0.781818,0.338912,0.472262,0.15,3,1,0.0,2,0.5,2,2,11,3,1,1,1,2,1,1,3,,3,2,3,3,,0.627451,0.760563,,2.0,16,2,2,1,3,1,2,2,,3,2,1,3,,1,2,1,1,2,1,2,1,,2,2,1,1,3,2,3,,3,3,1,3,2,1,3,3,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,3,1,A2,26,0.076923,2,3,1,0.626866,0.727273,0.311715,0.484984,0.0,1,3,0.07,2,0.2,1,2,8,3,1,1,1,1,1,3,1,0.001667,1,1,2,2,,0.529412,0.746479,,5.0,261,3,1,1,3,2,2,1,,3,2,3,3,110.0,3,3,1,1,2,1,2,3,,2,2,3,1,3,2,3,,3,3,1,3,2,1,3,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,4,1,D3,26,0.144667,2,3,1,0.58209,0.709091,0.320084,0.519103,0.143,9,1,0.0,2,0.45,1,2,3,3,1,1,1,2,1,1,3,,3,2,3,3,0.666667,,0.661972,,3.0,132,2,1,1,3,2,2,2,,3,2,3,3,240.0,1,3,1,1,2,1,2,3,,2,2,3,1,1,2,3,,1,3,1,3,2,1,3,3,3,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,9,1,A1,26,0.151709,2,1,1,0.522388,0.654545,0.267782,0.486962,0.21,9,1,0.0,2,1.0,2,2,3,3,1,1,1,1,1,3,1,0.000667,2,1,2,2,,0.686275,0.676056,,,162,3,2,1,1,2,3,2,,3,2,3,3,,1,3,1,1,2,2,2,3,,1,3,3,2,3,2,3,,3,1,1,2,2,1,3,3,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1
4,12,1,A1,26,0.076923,2,3,1,0.298507,0.672727,0.246862,0.428718,0.085,9,1,0.0,2,0.2,1,2,8,3,1,2,1,2,1,1,3,,3,2,3,2,0.449275,,0.380282,,18.0,181,3,1,1,3,2,2,2,,3,2,3,3,188.0,1,3,1,1,2,1,2,1,,1,3,3,1,1,2,3,,3,3,1,2,2,1,3,3,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [11]:
submission_org_df.head()

Unnamed: 0,Id,Response
0,1,8
1,3,8
2,4,8
3,9,8
4,12,8


### EDA

- Target is `Response` col
- Ins_Age/Ht/Wt/BMI
  * These features are easy to interpret.
- Product_Info_(1-7)/Employment_info_(1-6)/Insuredinfo_(1-7)/Insurance_History_(1-9)/Family_Hist_(1-5)/Medical_History_(1-41)/Medical_Keyword_(1-48)
  * It seems like good starting point to look at these features' distributions/types/NaN.
  * Since these features' details are not opened, combining these and creating new features are not really good ideas(?)

In [13]:
# just in case
train_df = train_org_df.copy()

In [72]:
prod_info_cols = [f'Product_Info_{x}' for x in range(1, 7)]
emp_info_cols = [f'Employment_Info_{x}' for x in range(1, 6)]
ins_hist_cols = [f'Insurance_History_{x}' for x in range(1, 9)]
fam_hist_cols = [f'Family_Hist_{x}' for x in range(1, 5)]
med_hist_cols = [f'Medical_History_{x}' for x in range(1, 41)]
med_keyword_cols = [f'Medical_Keyword_{x}' for x in range(1, 48)]

# this code looks nicer
# med_keyword_columns = all_data.columns[all_data.columns.str.startswith('Medical_Keyword_')]
# https://www.kaggle.com/soundwaveli00/xgb-test

# print(prod_info_cols)
# print(emp_info_cols)
# print(ins_hist_cols)
# print(fam_hist_cols)
# print(med_hist_cols)
# print(med_keyword_cols)

## Product Info Overview

In [73]:
display(train_df[prod_info_cols].head())
display(train_df[prod_info_cols].describe())
display(train_df[prod_info_cols].info())
display(train_df[['Id', 'Product_Info_2']].groupby('Product_Info_2').count().T)

Unnamed: 0,Product_Info_1,Product_Info_2,Product_Info_3,Product_Info_4,Product_Info_5,Product_Info_6
0,1,D3,10,0.076923,2,1
1,1,A1,26,0.076923,2,3
2,1,E1,26,0.076923,2,3
3,1,D4,10,0.487179,2,3
4,1,D2,26,0.230769,2,3


Unnamed: 0,Product_Info_1,Product_Info_3,Product_Info_4,Product_Info_5,Product_Info_6
count,59381.0,59381.0,59381.0,59381.0,59381.0
mean,1.026355,24.415655,0.328952,2.006955,2.673599
std,0.160191,5.072885,0.282562,0.083107,0.739103
min,1.0,1.0,0.0,2.0,1.0
25%,1.0,26.0,0.076923,2.0,3.0
50%,1.0,26.0,0.230769,2.0,3.0
75%,1.0,26.0,0.487179,2.0,3.0
max,2.0,38.0,1.0,3.0,3.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59381 entries, 0 to 59380
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Product_Info_1  59381 non-null  int64  
 1   Product_Info_2  59381 non-null  object 
 2   Product_Info_3  59381 non-null  int64  
 3   Product_Info_4  59381 non-null  float64
 4   Product_Info_5  59381 non-null  int64  
 5   Product_Info_6  59381 non-null  int64  
dtypes: float64(1), int64(4), object(1)
memory usage: 2.7+ MB


None

Product_Info_2,A1,A2,A3,A4,A5,A6,A7,A8,B1,B2,C1,C2,C3,C4,D1,D2,D3,D4,E1
Id,2363,1974,977,210,775,2098,1383,6835,54,1122,285,160,306,219,6554,6286,14321,10812,2647


## Product Info Correlation

In [102]:
pd.crosstab(train_df['Response'], train_df['Product_Info_2'], normalize='columns', margins=True).style.background_gradient(cmap='coolwarm')

Product_Info_2,A1,A2,A3,A4,A5,A6,A7,A8,B1,B2,C1,C2,C3,C4,D1,D2,D3,D4,E1,All
Response,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,0.055861,0.120567,0.071648,0.090476,0.068387,0.058151,0.203181,0.139429,0.12963,0.065954,0.157895,0.1375,0.107843,0.086758,0.162496,0.118676,0.100552,0.063541,0.075935,0.104528
2,0.09945,0.135258,0.106448,0.0,0.0,0.0,0.206074,0.106511,0.0,0.033868,0.140351,0.175,0.091503,0.077626,0.194538,0.14588,0.116961,0.06539,0.07858,0.110338
3,0.021583,0.010132,0.0174,0.019048,0.016774,0.012393,0.067968,0.016094,0.166667,0.009804,0.031579,0.03125,0.03268,0.009132,0.025786,0.017499,0.016549,0.007584,0.012845,0.017059
4,0.028777,0.019757,0.03173,0.047619,0.036129,0.020972,0.002892,0.013021,0.111111,0.018717,0.045614,0.03125,0.03268,0.031963,0.042264,0.024499,0.029328,0.011469,0.029467,0.024048
5,0.082522,0.075481,0.065507,0.080952,0.054194,0.058151,0.47867,0.135625,0.148148,0.067736,0.066667,0.06875,0.078431,0.063927,0.066372,0.067292,0.087703,0.074917,0.067246,0.091477
6,0.148963,0.210233,0.159672,0.161905,0.162581,0.155863,0.008677,0.158888,0.203704,0.16221,0.164912,0.15625,0.199346,0.159817,0.200793,0.245307,0.229104,0.166204,0.161692,0.189168
7,0.114262,0.139818,0.126919,0.142857,0.178065,0.141087,0.010846,0.141185,0.092593,0.154189,0.136842,0.15,0.156863,0.123288,0.112603,0.146516,0.145241,0.134758,0.151492,0.135178
8,0.448582,0.288754,0.420676,0.457143,0.483871,0.553384,0.021692,0.289247,0.148148,0.487522,0.25614,0.25,0.300654,0.447489,0.195148,0.23433,0.274562,0.476138,0.422743,0.328203
