#
---
# Gist of this Notebook :

<ol>
<li>
    <strong>Loaded and inspected data:</strong> Loaded a dataset from "EDAed_df.csv", converted the "Policy Start Date" column to datetime, and split the data into training and test sets based on a predefined row count. Displayed the shape of dataframe and their sizes.
</li>

<li>
    <strong>Feature Engineering:</strong> Created new features from the "Policy Start Date" column (Day, Month, Year, Quarter, and sin/cos transforms of dates), and performed group by operations with "Premium Amount" to generate min, mean, median, std and max aggregations, and stored results in dataframe.
</li>

<li>
    <strong>Statistical Analysis:</strong> Analyzed the features' relationship with the 'Premium Amount' target variable using statistical tests (Kruskal-Wallis, Spearman correlation, Chi-Square). A null hypothesis was defined as 'No Relationship' among the given two columns and alternate hypothesis 'There is a relationship'. And a verdict was given based on P-value.
</li>

<li>
     <strong>Feature Selection:</strong> Selected features with significant relationships based on a p-value threshold ( <=0.05) by running statistical analysis on each feature against target column.
</li>

<li>
    <strong>Dimensionality Reduction (Planned):</strong> Included comments to indicate a planned, but not executed step of performing dimensionality reduction on less important columns using Principal Component Analysis (PCA).
</li>

<li>
    <strong>Data Encoding:</strong> Transformed categorical columns into numerical data using <code><strong>OneHotEncoder</strong></code> (for nominal features) and <code><strong>OrdinalEncoder</strong></code> (for ordinal features) by creating new columns and dropping the previous categorical one.
</li>

<li>
     <strong>Data Scaling:</strong> Scaled the numerical features in both the train and test data by using <code><strong>RobustScaler</strong></code> after splitting the training set to x_train and x_validate.
</li>

<li>
    <strong>Split Dataset:</strong> split data into train and test datasets and then further split train dataset into x_train, y_train, x_validate, and y_validate using <code>train_test_split</code>. After which, combined train and validate for all transformations.
</li>

<li>
     <strong> Combining Data:</strong> Combined train and test data in a single dataframe by name `df`.
</li>

<li>
    <strong>Downloadable CSV:</strong> Saved the resulting preprocessed dataframe as "trainable_df.csv".
</li>
</ol>

#####
---
#

In [444]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import *
import xgboost as xgb

from sklearn.preprocessing import PowerTransformer


import warnings
warnings.filterwarnings("ignore")
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 150)

In [445]:
df = pd.read_csv("EDAed_df.csv")

df["Policy Start Date"] = pd.to_datetime(df["Policy Start Date"])

In [446]:
df.shape

(2000000, 50)

In [447]:
df.isnull().sum()

Age                                 0
Gender                              0
Annual Income                       0
Marital Status                      0
Number of Dependents                0
Education Level                     0
Occupation                          0
Health Score                        0
Location                            0
Policy Type                         0
Previous Claims                     0
Vehicle Age                         0
Credit Score                        0
Insurance Duration                  0
Policy Start Date                   0
Customer Feedback                   0
Smoking Status                      0
Exercise Frequency                  0
Property Type                       0
Premium Amount                 800000
IsNull_Age                          0
IsNull_Annual Income                0
IsNull_Marital Status               0
IsNull_Number of Dependents         0
IsNull_Occupation                   0
IsNull_Health Score                 0
IsNull_Previ

In [448]:
train = df.iloc[:1200000, :]
train.shape

(1200000, 50)

In [449]:
test = df.iloc[1200000:, :]
test.shape

(800000, 50)

In [450]:
test.head(3)

Unnamed: 0,Age,Gender,Annual Income,Marital Status,Number of Dependents,Education Level,Occupation,Health Score,Location,Policy Type,Previous Claims,Vehicle Age,Credit Score,Insurance Duration,Policy Start Date,Customer Feedback,Smoking Status,Exercise Frequency,Property Type,Premium Amount,IsNull_Age,IsNull_Annual Income,IsNull_Marital Status,IsNull_Number of Dependents,IsNull_Occupation,IsNull_Health Score,IsNull_Previous Claims,IsNull_Vehicle Age,IsNull_Credit Score,IsNull_Insurance Duration,IsNull_Customer Feedback,Health Conscious Level,Health Conscious Level1,Money Per Head,Money Handling Level,Money Handling Level1,Growth,Growth1,Determinstic,Day_Name,Credit by Score,CreditInsurance,Health_Risk_Score,Credit_Health_Score,Health_Age_Interaction,Feedback1,Feedback2,Feedback3,Feedback4,Total Nulls
1200000,28.0,Female,2310.0,Divorced,4.0,Bachelor's,Self-Employed,7.657981,Rural,Basic,2.0,19.0,551.0,1.0,2023-06-04 15:21:39.245086,Poor,Yes,Weekly,House,,0,0,1,0,0,0,1,0,1,0,0,2,3430.775431,577.5,1272810.0,4.192377,4620.0,1155.0,82.5,Sunday,275.5,551.0,4.617101,4219.54746,214.423464,4620.0,1102.0,4.0,15.315962,3
1200001,31.0,Female,126031.0,Married,2.0,Master's,Self-Employed,13.381379,Suburban,Premium,1.0,14.0,372.0,8.0,2024-04-22 15:21:39.224915,Good,Yes,Rarely,Apartment,,0,0,0,0,0,0,1,0,0,0,0,1,1659.291012,63015.5,46883532.0,338.793011,378093.0,42010.333333,4065.516129,Monday,372.0,2976.0,4.330931,4977.873036,414.822753,1008248.0,2976.0,8.0,107.051033,1
1200002,47.0,Female,17092.0,Divorced,0.0,PhD,Unemployed,24.354527,Urban,Comprehensive,1.0,16.0,819.0,9.0,2023-04-05 15:21:39.134960,Average,Yes,Monthly,Condo,,0,0,0,0,0,0,1,0,0,0,0,3,9157.302066,17092.0,13998348.0,20.869353,68368.0,4273.0,363.659574,Wednesday,819.0,7371.0,3.782274,19946.357425,1144.662758,68368.0,3276.0,4.0,97.418107,1


#
---
#

# Adding Dates columns

In [451]:
df["Policy Start Date - Day"] = df["Policy Start Date"].dt.day
df["Policy Start Date - Month"] = df["Policy Start Date"].dt.month
df["Policy Start Date - Year"] = df["Policy Start Date"].dt.year

In [452]:
df["Policy Start Date - Quarter"] = df["Policy Start Date"].dt.year.astype(str) + " Q" + df["Policy Start Date"].dt.quarter.astype(str)

In [453]:
df["Sin_Date"] = np.sin(2 * np.pi * df["Policy Start Date"].astype('int64'))
df["Cos_Date"] = np.cos(2 * np.pi * df["Policy Start Date"].astype('int64'))

In [454]:
df["Sin_Year"] = np.sin(2 * np.pi * df["Policy Start Date - Year"].astype('int64'))
df["Cos_Year"] = np.cos(2 * np.pi * df["Policy Start Date - Year"].astype('int64'))

In [455]:
df["Sin_Month"] = np.sin(2 * np.pi * df["Policy Start Date - Month"].astype('int64'))
df["Cos_Month"] = np.cos(2 * np.pi * df["Policy Start Date - Month"].astype('int64'))

In [456]:
df.head(3)

Unnamed: 0,Age,Gender,Annual Income,Marital Status,Number of Dependents,Education Level,Occupation,Health Score,Location,Policy Type,Previous Claims,Vehicle Age,Credit Score,Insurance Duration,Policy Start Date,Customer Feedback,Smoking Status,Exercise Frequency,Property Type,Premium Amount,IsNull_Age,IsNull_Annual Income,IsNull_Marital Status,IsNull_Number of Dependents,IsNull_Occupation,IsNull_Health Score,IsNull_Previous Claims,IsNull_Vehicle Age,IsNull_Credit Score,IsNull_Insurance Duration,IsNull_Customer Feedback,Health Conscious Level,Health Conscious Level1,Money Per Head,Money Handling Level,Money Handling Level1,Growth,Growth1,Determinstic,Day_Name,Credit by Score,CreditInsurance,Health_Risk_Score,Credit_Health_Score,Health_Age_Interaction,Feedback1,Feedback2,Feedback3,Feedback4,Total Nulls,Policy Start Date - Day,Policy Start Date - Month,Policy Start Date - Year,Policy Start Date - Quarter,Sin_Date,Cos_Date,Sin_Year,Cos_Year,Sin_Month,Cos_Month
0,19.0,Female,10049.0,Married,1.0,Bachelor's,Self-Employed,22.598761,Urban,Premium,2.0,17.0,372.0,5.0,2023-12-23 15:21:39.134960,Poor,No,Weekly,House,2869.0,0,0,0,0,0,0,0,0,0,0,0,4,13740.046488,10049.0,3738228.0,27.013441,20098.0,5024.5,528.894737,Saturday,186.0,1860.0,3.870062,8406.73897,429.376453,20098.0,744.0,4.0,45.197521,0,23,12,2023,2023 Q4,-0.975344,-0.220691,-6.447061e-13,1.0,-2.939152e-15,1.0
1,39.0,Female,31678.0,Divorced,3.0,Master's,Unemployed,15.569731,Rural,Comprehensive,1.0,12.0,694.0,2.0,2023-06-12 15:21:39.111551,Average,Yes,Monthly,House,1483.0,0,0,0,0,1,0,0,0,0,0,0,2,4857.756069,10559.333333,21984532.0,45.645533,95034.0,10559.333333,812.25641,Monday,694.0,1388.0,4.221513,10805.393307,607.219509,126712.0,2776.0,4.0,62.278924,1,12,6,2023,2023 Q2,-0.998725,0.050489,-6.447061e-13,1.0,-1.469576e-15,1.0
2,23.0,Male,25602.0,Divorced,3.0,High School,Self-Employed,47.177549,Suburban,Premium,1.0,14.0,632.0,3.0,2023-09-30 15:21:39.221386,Good,Yes,Weekly,House,567.0,0,0,0,0,0,0,0,0,1,0,0,4,17361.338138,8534.0,16180464.0,40.509494,25602.0,25602.0,1113.130435,Saturday,632.0,1896.0,2.641123,29816.21115,1085.083634,204816.0,5056.0,8.0,377.420394,1,30,9,2023,2023 Q3,-0.994867,0.101192,-6.447061e-13,1.0,-2.204364e-15,1.0


#
---
#

In [457]:
data = df.copy()

#
---
#

In [458]:
df.drop(columns="Policy Start Date", inplace=True)

In [459]:
df.head(3)

Unnamed: 0,Age,Gender,Annual Income,Marital Status,Number of Dependents,Education Level,Occupation,Health Score,Location,Policy Type,Previous Claims,Vehicle Age,Credit Score,Insurance Duration,Customer Feedback,Smoking Status,Exercise Frequency,Property Type,Premium Amount,IsNull_Age,IsNull_Annual Income,IsNull_Marital Status,IsNull_Number of Dependents,IsNull_Occupation,IsNull_Health Score,IsNull_Previous Claims,IsNull_Vehicle Age,IsNull_Credit Score,IsNull_Insurance Duration,IsNull_Customer Feedback,Health Conscious Level,Health Conscious Level1,Money Per Head,Money Handling Level,Money Handling Level1,Growth,Growth1,Determinstic,Day_Name,Credit by Score,CreditInsurance,Health_Risk_Score,Credit_Health_Score,Health_Age_Interaction,Feedback1,Feedback2,Feedback3,Feedback4,Total Nulls,Policy Start Date - Day,Policy Start Date - Month,Policy Start Date - Year,Policy Start Date - Quarter,Sin_Date,Cos_Date,Sin_Year,Cos_Year,Sin_Month,Cos_Month
0,19.0,Female,10049.0,Married,1.0,Bachelor's,Self-Employed,22.598761,Urban,Premium,2.0,17.0,372.0,5.0,Poor,No,Weekly,House,2869.0,0,0,0,0,0,0,0,0,0,0,0,4,13740.046488,10049.0,3738228.0,27.013441,20098.0,5024.5,528.894737,Saturday,186.0,1860.0,3.870062,8406.73897,429.376453,20098.0,744.0,4.0,45.197521,0,23,12,2023,2023 Q4,-0.975344,-0.220691,-6.447061e-13,1.0,-2.939152e-15,1.0
1,39.0,Female,31678.0,Divorced,3.0,Master's,Unemployed,15.569731,Rural,Comprehensive,1.0,12.0,694.0,2.0,Average,Yes,Monthly,House,1483.0,0,0,0,0,1,0,0,0,0,0,0,2,4857.756069,10559.333333,21984532.0,45.645533,95034.0,10559.333333,812.25641,Monday,694.0,1388.0,4.221513,10805.393307,607.219509,126712.0,2776.0,4.0,62.278924,1,12,6,2023,2023 Q2,-0.998725,0.050489,-6.447061e-13,1.0,-1.469576e-15,1.0
2,23.0,Male,25602.0,Divorced,3.0,High School,Self-Employed,47.177549,Suburban,Premium,1.0,14.0,632.0,3.0,Good,Yes,Weekly,House,567.0,0,0,0,0,0,0,0,0,1,0,0,4,17361.338138,8534.0,16180464.0,40.509494,25602.0,25602.0,1113.130435,Saturday,632.0,1896.0,2.641123,29816.21115,1085.083634,204816.0,5056.0,8.0,377.420394,1,30,9,2023,2023 Q3,-0.994867,0.101192,-6.447061e-13,1.0,-2.204364e-15,1.0


In [460]:
df[["Policy Start Date - Day", "Policy Start Date - Month", "Policy Start Date - Year"]] = df[["Policy Start Date - Day", "Policy Start Date - Month", "Policy Start Date - Year"]].astype("O")

In [461]:
def show_nulls(df):
    nulls = []
    nuniques = []
    uniques = []
    types = []
    
    for i in df.columns:
        nulls.append(df[i].isnull().sum())
        nuniques.append(df[i].nunique())
        uniques.append(df[i].unique())
        types.append(df[i].dtype)
    
    
    return pd.DataFrame(
        {
            "Column" : df.columns,
            "Data Type" : types,
            "Nulls" : nulls,
            "No. of Uniques" : nuniques,
            "Uniques" : uniques
        }
    ).sort_values(by="Nulls", ascending=False)

In [462]:
df["Health Conscious Level"] = df["Health Conscious Level"].astype("O")

In [463]:
show_nulls(df)

Unnamed: 0,Column,Data Type,Nulls,No. of Uniques,Uniques
18,Premium Amount,float64,800000,4794,"[2869.0, 1483.0, 567.0, 765.0, 2022.0, 3202.0,..."
1,Gender,object,0,2,"[Female, Male]"
0,Age,float64,0,47,"[19.0, 39.0, 23.0, 21.0, 29.0, 41.0, 48.0, 44...."
3,Marital Status,object,0,3,"[Married, Divorced, Single]"
4,Number of Dependents,float64,0,5,"[1.0, 3.0, 2.0, 0.0, 4.0]"
5,Education Level,object,0,4,"[Bachelor's, Master's, High School, PhD]"
6,Occupation,object,0,3,"[Self-Employed, Unemployed, Employed]"
7,Health Score,float64,0,933976,"[22.59876067181393, 15.569730989408043, 47.177..."
8,Location,object,0,3,"[Urban, Rural, Suburban]"
9,Policy Type,object,0,3,"[Premium, Comprehensive, Basic]"


#
---
#

In [464]:
def do_magic(target_column, *columns: list):
    for i in columns:
        df[f"{i}_MIN_{target_column}"] = df.groupby(by=i)[target_column].transform("min")
        df.groupby(by=i)[target_column].min().to_csv(f"do_magics/{i}_MIN_{target_column}.csv")
        
        df[f"{i}_MEAN_{target_column}"] = df.groupby(by=i)[target_column].transform("mean")
        df.groupby(by=i)[target_column].mean().to_csv(f"do_magics/{i}_MEAN_{target_column}.csv")
        
        df[f"{i}_Q1_{target_column}"] = df.groupby(by=i)[target_column].transform(lambda x : x.quantile(0.25))
        df.groupby(by=i)[target_column].agg(lambda x : x.quantile(0.25)).to_csv(f"do_magics/{i}_Q1_{target_column}.csv")
        
        df[f"{i}_MEDIAN_{target_column}"] = df.groupby(by=i)[target_column].transform("median")
        df.groupby(by=i)[target_column].median().to_csv(f"do_magics/{i}_MEDIAN_{target_column}.csv")
        
        df[f"{i}_Q3_{target_column}"] = df.groupby(by=i)[target_column].transform(lambda x : x.quantile(0.75))
        df.groupby(by=i)[target_column].agg(lambda x : x.quantile(0.75)).to_csv(f"do_magics/{i}_Q3_{target_column}.csv")

        df[f"{i}_STD_{target_column}"] = df.groupby(by=i)[target_column].transform("std")
        df.groupby(by=i)[target_column].std().to_csv(f"do_magics/{i}_STD_{target_column}.csv")

        
        df[f"{i}_MAX_{target_column}"] = df.groupby(by=i)[target_column].transform("max")
        df.groupby(by=i)[target_column].max().to_csv(f"do_magics/{i}_MAX_{target_column}.csv")


In [465]:
do_magic("Premium Amount", "Number of Dependents", "Occupation", "Education Level", "Previous Claims", "Health Conscious Level", "Insurance Duration")

In [466]:
df.isnull().sum()

Age                                                  0
Gender                                               0
Annual Income                                        0
Marital Status                                       0
Number of Dependents                                 0
Education Level                                      0
Occupation                                           0
Health Score                                         0
Location                                             0
Policy Type                                          0
Previous Claims                                      0
Vehicle Age                                          0
Credit Score                                         0
Insurance Duration                                   0
Customer Feedback                                    0
Smoking Status                                       0
Exercise Frequency                                   0
Property Type                                        0
Premium Am

#
---
#

In [467]:
def return_splits(ddf, feature_name, target_name):
    return [ddf[ddf[feature_name] == i][target_name] for i in ddf[feature_name].unique()]

def give_stats_analysis(df, target_column_name):
    ddf = df.copy()
    ddf = ddf.dropna()

    features = []
    tests = []
    stats = []
    pvals = []
    verdict = []
    count = 0

    target = ddf[target_column_name]
    for i in ddf.columns:
        features.append(i)
        feature = ddf[i]
        
        if (feature.dtype == "O" and (target.dtype == "float" or target.dtype == "int")) or (target.dtype == "O" and (feature.dtype == "float" or feature.dtype == "int")):
            stat, pval, *_ = kruskal(*return_splits(ddf, feature.name, target.name))
            tests.append("Kruskal-Wallis")
            stats.append(stat)
            pvals.append(pval)
            
        
        elif (feature.dtype == "float" or feature.dtype == "int") and (target.dtype == "float" or target.dtype == "int"):
            stat, pval, *_ = spearmanr(feature, target)
            tests.append("SpearmanR")
            stats.append(stat)
            pvals.append(pval)

        elif feature.dtype == "O" and target.dtype == "O":
            stat, pval, *_ = chi2_contingency(pd.crosstab(feature, target))
            tests.append("Chi-Square")
            stats.append(stat)
            pvals.append(pval)
        
        else:
            tests.append(np.nan)
            stats.append(np.nan)
            pvals.append(np.nan)
        
        if pval <= 0.025:
            verdict.append("There is Relationship")
        else:
            verdict.append("There is NO Relationship")

        print(f"{feature.name} ■■■ {target_column_name}".ljust(100, "-")+"✅")
    
    return pd.DataFrame({
        "Feature" : features,
        "Target" : [target_column_name]*ddf.shape[1],
        "Statistic Test" : tests,
        "Test Statistic" : stats,
        "P-Value" : pvals,
        "Verdict" : verdict
    }).sort_values(by="P-Value")

# H0 :- There is ***No Relationship*** among the given two columns
# H1 :- There is ***Relationship*** among the given two columns

### ***Health-related indicators***
- [x] Health Score
- [x] Smoking Status
- [x] Exercise Frequency
### ***Demographic information***
- [x] Age
- [x] Gender
- [x] Marital Status
- [x] Number of Dependents
- [x] Occupation
### ***Policy details***
- [x] Policy Type
- [x] Policy Start Date
- [x] Insurance Duration
### ***Financial factors***
- [x] Annual Income
- [x] Credit Score.
### ***Premium calculation***
- [x] Premium Amount

In [468]:
stats_result = give_stats_analysis(df.iloc[:1200000, :], "Premium Amount")
stats_result

Age ■■■ Premium Amount------------------------------------------------------------------------------✅
Gender ■■■ Premium Amount---------------------------------------------------------------------------✅
Annual Income ■■■ Premium Amount--------------------------------------------------------------------✅
Marital Status ■■■ Premium Amount-------------------------------------------------------------------✅
Number of Dependents ■■■ Premium Amount-------------------------------------------------------------✅
Education Level ■■■ Premium Amount------------------------------------------------------------------✅
Occupation ■■■ Premium Amount-----------------------------------------------------------------------✅
Health Score ■■■ Premium Amount---------------------------------------------------------------------✅
Location ■■■ Premium Amount-------------------------------------------------------------------------✅
Policy Type ■■■ Premium Amount----------------------------------------------------

Unnamed: 0,Feature,Target,Statistic Test,Test Statistic,P-Value,Verdict
2,Annual Income,Premium Amount,SpearmanR,-0.061831,0.0,There is Relationship
12,Credit Score,Premium Amount,SpearmanR,-0.036687,0.0,There is Relationship
20,IsNull_Annual Income,Premium Amount,SpearmanR,-0.065399,0.0,There is Relationship
18,Premium Amount,Premium Amount,SpearmanR,1.0,0.0,There is Relationship
35,Growth,Premium Amount,SpearmanR,-0.055,0.0,There is Relationship
34,Money Handling Level1,Premium Amount,SpearmanR,-0.048668,0.0,There is Relationship
33,Money Handling Level,Premium Amount,SpearmanR,-0.072097,0.0,There is Relationship
32,Money Per Head,Premium Amount,SpearmanR,-0.053422,0.0,There is Relationship
44,Feedback1,Premium Amount,SpearmanR,-0.053714,0.0,There is Relationship
39,Credit by Score,Premium Amount,SpearmanR,-0.05485,0.0,There is Relationship


# <ins>Key Premium as per Reseach Papers and as per dataset.</ins>
### `Strikeoff features are said by research and dataset too. But unstrike ones are not impactful to determine premium amount as per dataset but as per research it should be. We need to find why like so in these features`

- ### ~~Age~~
- ### Gender
- ### ~~Health Score~~
- ### Smoking Status
- ### Exercise Frequency
- ### ~~Occupation~~
- ### Policy Type
- ### ~~Previous Claims~~
- ### ~~Annual Income~~
- ### Insurance Duration
- ### ~~Credit Score~~

#
---
#

In [469]:
cols = ["Gender", "Smoking Status", "Exercise Frequency", "Policy Type", "Insurance Duration"]

In [470]:
# fig, axs = plt.subplots(2, 3, figsize=(20, 8))
# for col, ax in zip(cols, axs.flatten()):
#     sns.boxplot(y=df["Premium Amount"], x=df[col], color="mediumblue", ax=ax)

In [471]:
useless_columns = stats_result[stats_result["P-Value"] >= 0.05]["Feature"]
useless_columns

76        Education Level_MEDIAN_Premium Amount
74          Education Level_MEAN_Premium Amount
77            Education Level_Q3_Premium Amount
8                                      Location
65      Number of Dependents_MAX_Premium Amount
5                               Education Level
4                          Number of Dependents
54                                     Cos_Date
11                                  Vehicle Age
92    Health Conscious Level_STD_Premium Amount
19                                   IsNull_Age
9                                   Policy Type
79           Education Level_MAX_Premium Amount
17                                Property Type
3                                Marital Status
16                           Exercise Frequency
15                               Smoking Status
38                                     Day_Name
26                           IsNull_Vehicle Age
78           Education Level_STD_Premium Amount
28                    IsNull_Insurance D

In [472]:
meaningless_df = df[useless_columns]
meaningless_df.head(3)

Unnamed: 0,Education Level_MEDIAN_Premium Amount,Education Level_MEAN_Premium Amount,Education Level_Q3_Premium Amount,Location,Number of Dependents_MAX_Premium Amount,Education Level,Number of Dependents,Cos_Date,Vehicle Age,Health Conscious Level_STD_Premium Amount,IsNull_Age,Policy Type,Education Level_MAX_Premium Amount,Property Type,Marital Status,Exercise Frequency,Smoking Status,Day_Name,IsNull_Vehicle Age,Education Level_STD_Premium Amount,IsNull_Insurance Duration,Policy Start Date - Day,Gender,Sin_Date,Insurance Duration,Education Level_Q1_Premium Amount
0,873.0,1102.698438,1509.0,Urban,4994.0,Bachelor's,1.0,-0.220691,17.0,864.569091,0,Premium,4988.0,House,Married,Weekly,No,Saturday,0,864.866296,0,23,Female,-0.975344,5.0,514.0
1,871.0,1102.113989,1512.0,Rural,4997.0,Master's,3.0,0.050489,12.0,865.103831,0,Comprehensive,4997.0,House,Divorced,Monthly,Yes,Monday,0,866.235322,0,12,Female,-0.998725,2.0,513.0
2,876.0,1104.78749,1513.0,Suburban,4997.0,High School,3.0,0.101192,14.0,864.569091,0,Premium,4999.0,House,Divorced,Weekly,Yes,Saturday,0,865.951488,0,30,Male,-0.994867,3.0,514.0


In [473]:
# df = df[stats_result[stats_result["P-Value"] < 0.05]["Feature"]]
# df.head(3)

# Compressing Meaningless DF's information in a component using PCA

In [474]:
meaningless_df.head(3)

Unnamed: 0,Education Level_MEDIAN_Premium Amount,Education Level_MEAN_Premium Amount,Education Level_Q3_Premium Amount,Location,Number of Dependents_MAX_Premium Amount,Education Level,Number of Dependents,Cos_Date,Vehicle Age,Health Conscious Level_STD_Premium Amount,IsNull_Age,Policy Type,Education Level_MAX_Premium Amount,Property Type,Marital Status,Exercise Frequency,Smoking Status,Day_Name,IsNull_Vehicle Age,Education Level_STD_Premium Amount,IsNull_Insurance Duration,Policy Start Date - Day,Gender,Sin_Date,Insurance Duration,Education Level_Q1_Premium Amount
0,873.0,1102.698438,1509.0,Urban,4994.0,Bachelor's,1.0,-0.220691,17.0,864.569091,0,Premium,4988.0,House,Married,Weekly,No,Saturday,0,864.866296,0,23,Female,-0.975344,5.0,514.0
1,871.0,1102.113989,1512.0,Rural,4997.0,Master's,3.0,0.050489,12.0,865.103831,0,Comprehensive,4997.0,House,Divorced,Monthly,Yes,Monday,0,866.235322,0,12,Female,-0.998725,2.0,513.0
2,876.0,1104.78749,1513.0,Suburban,4997.0,High School,3.0,0.101192,14.0,864.569091,0,Premium,4999.0,House,Divorced,Weekly,Yes,Saturday,0,865.951488,0,30,Male,-0.994867,3.0,514.0


## Encoding Columns

In [475]:
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

###
---
###

In [476]:
meaningless_df["Location"].unique()

array(['Urban', 'Rural', 'Suburban'], dtype=object)

In [477]:
a = OrdinalEncoder(categories=[['Rural', 'Suburban', 'Urban']])

b = pd.DataFrame({"ENCODED_Location" : a.fit_transform(meaningless_df[["Location"]]).flatten()})

meaningless_df = pd.concat([meaningless_df, b], axis=1)
meaningless_df.drop(columns="Location", inplace=True)

###
---
###

In [478]:
meaningless_df["Education Level"].unique()

array(["Bachelor's", "Master's", 'High School', 'PhD'], dtype=object)

In [479]:
a = OrdinalEncoder(categories=[['High School', "Bachelor's", "Master's", 'PhD']])

b = pd.DataFrame({"ENCODED_Education Level" : a.fit_transform(meaningless_df[["Education Level"]]).flatten()})

meaningless_df = pd.concat([meaningless_df, b], axis=1)
meaningless_df.drop(columns="Education Level", inplace=True)

###
---
###

In [480]:
meaningless_df["Policy Type"].unique()

array(['Premium', 'Comprehensive', 'Basic'], dtype=object)

In [481]:
a = OrdinalEncoder(categories=[['Basic', 'Comprehensive', 'Premium']])

b = pd.DataFrame({"ENCODED_Policy Type" : a.fit_transform(meaningless_df[["Policy Type"]]).flatten()})

meaningless_df = pd.concat([meaningless_df, b], axis=1)
meaningless_df.drop(columns="Policy Type", inplace=True)

###
---
###

In [482]:
a = OneHotEncoder(drop="first", sparse_output=False)

b = pd.DataFrame(
        a.fit_transform(meaningless_df[["Property Type"]]),
        columns=a.get_feature_names_out()
    )

meaningless_df = pd.concat([meaningless_df, b], axis=1)
meaningless_df.drop(columns="Property Type", inplace=True)

###
---
###

In [483]:
meaningless_df["Exercise Frequency"].unique()

array(['Weekly', 'Monthly', 'Daily', 'Rarely'], dtype=object)

In [484]:
a = OrdinalEncoder(categories=[['Rarely', 'Monthly', 'Weekly', 'Daily']])

b = pd.DataFrame({"ENCODED_Exercise Frequency" : a.fit_transform(meaningless_df[["Exercise Frequency"]]).flatten()})

meaningless_df = pd.concat([meaningless_df, b], axis=1)
meaningless_df.drop(columns="Exercise Frequency", inplace=True)

###
---
###

In [485]:
a = OneHotEncoder(drop="first", sparse_output=False)

b = pd.DataFrame(
        a.fit_transform(meaningless_df[["Smoking Status"]]),
        columns=a.get_feature_names_out()
    )

meaningless_df = pd.concat([meaningless_df, b], axis=1)
meaningless_df.drop(columns="Smoking Status", inplace=True)

###
---
###

In [486]:
a = OneHotEncoder(drop="first", sparse_output=False)

b = pd.DataFrame(
        a.fit_transform(meaningless_df[["Gender"]]),
        columns=a.get_feature_names_out()
    )

meaningless_df = pd.concat([meaningless_df, b], axis=1)
meaningless_df.drop(columns="Gender", inplace=True)

###
---
###

In [487]:
meaningless_df["Policy Start Date - Day"] = meaningless_df["Policy Start Date - Day"].astype(int)

#
---
#

In [488]:
meaningless_df.head(3)

Unnamed: 0,Education Level_MEDIAN_Premium Amount,Education Level_MEAN_Premium Amount,Education Level_Q3_Premium Amount,Number of Dependents_MAX_Premium Amount,Number of Dependents,Cos_Date,Vehicle Age,Health Conscious Level_STD_Premium Amount,IsNull_Age,Education Level_MAX_Premium Amount,Marital Status,Day_Name,IsNull_Vehicle Age,Education Level_STD_Premium Amount,IsNull_Insurance Duration,Policy Start Date - Day,Sin_Date,Insurance Duration,Education Level_Q1_Premium Amount,ENCODED_Location,ENCODED_Education Level,ENCODED_Policy Type,Property Type_Condo,Property Type_House,ENCODED_Exercise Frequency,Smoking Status_Yes,Gender_Male
0,873.0,1102.698438,1509.0,4994.0,1.0,-0.220691,17.0,864.569091,0,4988.0,Married,Saturday,0,864.866296,0,23,-0.975344,5.0,514.0,2.0,1.0,2.0,0.0,1.0,2.0,0.0,0.0
1,871.0,1102.113989,1512.0,4997.0,3.0,0.050489,12.0,865.103831,0,4997.0,Divorced,Monday,0,866.235322,0,12,-0.998725,2.0,513.0,0.0,2.0,1.0,0.0,1.0,1.0,1.0,0.0
2,876.0,1104.78749,1513.0,4997.0,3.0,0.101192,14.0,864.569091,0,4999.0,Divorced,Saturday,0,865.951488,0,30,-0.994867,3.0,514.0,1.0,0.0,2.0,0.0,1.0,2.0,1.0,1.0


In [489]:
meaningless_df.dtypes

Education Level_MEDIAN_Premium Amount        float64
Education Level_MEAN_Premium Amount          float64
Education Level_Q3_Premium Amount            float64
Number of Dependents_MAX_Premium Amount      float64
Number of Dependents                         float64
Cos_Date                                     float64
Vehicle Age                                  float64
Health Conscious Level_STD_Premium Amount    float64
IsNull_Age                                     int64
Education Level_MAX_Premium Amount           float64
Marital Status                                object
Day_Name                                      object
IsNull_Vehicle Age                             int64
Education Level_STD_Premium Amount           float64
IsNull_Insurance Duration                      int64
Policy Start Date - Day                        int64
Sin_Date                                     float64
Insurance Duration                           float64
Education Level_Q1_Premium Amount            f

###
---
###

# Doing PCA on this `meaningless_df`

In [490]:
# from sklearn.decomposition import PCA

In [491]:
# pca = PCA(n_components=3)
# pca_df = pd.DataFrame(pca.fit_transform(meaningless_df), columns=['PC1_Meaningless_df', "PC2_Meaningless_df", "PC3_Meaningless_df"])
# pca_df

In [492]:
# pca.explained_variance_ratio_

###
---
###

# Combining 2 PCs of Meaningless_columns to the df

In [493]:
# df = pd.concat([df, pca_df.iloc[:, :2]], axis=1)

In [494]:
df.head()

Unnamed: 0,Age,Gender,Annual Income,Marital Status,Number of Dependents,Education Level,Occupation,Health Score,Location,Policy Type,Previous Claims,Vehicle Age,Credit Score,Insurance Duration,Customer Feedback,Smoking Status,Exercise Frequency,Property Type,Premium Amount,IsNull_Age,IsNull_Annual Income,IsNull_Marital Status,IsNull_Number of Dependents,IsNull_Occupation,IsNull_Health Score,IsNull_Previous Claims,IsNull_Vehicle Age,IsNull_Credit Score,IsNull_Insurance Duration,IsNull_Customer Feedback,Health Conscious Level,Health Conscious Level1,Money Per Head,Money Handling Level,Money Handling Level1,Growth,Growth1,Determinstic,Day_Name,Credit by Score,CreditInsurance,Health_Risk_Score,Credit_Health_Score,Health_Age_Interaction,Feedback1,Feedback2,Feedback3,Feedback4,Total Nulls,Policy Start Date - Day,Policy Start Date - Month,Policy Start Date - Year,Policy Start Date - Quarter,Sin_Date,Cos_Date,Sin_Year,Cos_Year,Sin_Month,Cos_Month,Number of Dependents_MIN_Premium Amount,Number of Dependents_MEAN_Premium Amount,Number of Dependents_Q1_Premium Amount,Number of Dependents_MEDIAN_Premium Amount,Number of Dependents_Q3_Premium Amount,Number of Dependents_STD_Premium Amount,Number of Dependents_MAX_Premium Amount,Occupation_MIN_Premium Amount,Occupation_MEAN_Premium Amount,Occupation_Q1_Premium Amount,Occupation_MEDIAN_Premium Amount,Occupation_Q3_Premium Amount,Occupation_STD_Premium Amount,Occupation_MAX_Premium Amount,Education Level_MIN_Premium Amount,Education Level_MEAN_Premium Amount,Education Level_Q1_Premium Amount,Education Level_MEDIAN_Premium Amount,Education Level_Q3_Premium Amount,Education Level_STD_Premium Amount,Education Level_MAX_Premium Amount,Previous Claims_MIN_Premium Amount,Previous Claims_MEAN_Premium Amount,Previous Claims_Q1_Premium Amount,Previous Claims_MEDIAN_Premium Amount,Previous Claims_Q3_Premium Amount,Previous Claims_STD_Premium Amount,Previous Claims_MAX_Premium Amount,Health Conscious Level_MIN_Premium Amount,Health Conscious Level_MEAN_Premium Amount,Health Conscious Level_Q1_Premium Amount,Health Conscious Level_MEDIAN_Premium Amount,Health Conscious Level_Q3_Premium Amount,Health Conscious Level_STD_Premium Amount,Health Conscious Level_MAX_Premium Amount,Insurance Duration_MIN_Premium Amount,Insurance Duration_MEAN_Premium Amount,Insurance Duration_Q1_Premium Amount,Insurance Duration_MEDIAN_Premium Amount,Insurance Duration_Q3_Premium Amount,Insurance Duration_STD_Premium Amount,Insurance Duration_MAX_Premium Amount
0,19.0,Female,10049.0,Married,1.0,Bachelor's,Self-Employed,22.598761,Urban,Premium,2.0,17.0,372.0,5.0,Poor,No,Weekly,House,2869.0,0,0,0,0,0,0,0,0,0,0,0,4,13740.046488,10049.0,3738228.0,27.013441,20098.0,5024.5,528.894737,Saturday,186.0,1860.0,3.870062,8406.73897,429.376453,20098.0,744.0,4.0,45.197521,0,23,12,2023,2023 Q4,-0.975344,-0.220691,-6.447061e-13,1.0,-2.939152e-15,1.0,20.0,1104.678891,516.0,874.0,1510.0,865.235996,4994.0,20.0,1100.430574,514.0,870.0,1507.0,865.079864,4999.0,20.0,1102.698438,514.0,873.0,1509.0,864.866296,4988.0,20.0,1151.583106,526.0,907.0,1606.0,898.40295,4988.0,20.0,1102.677039,514.0,871.0,1511.0,864.569091,4991.0,20.0,1100.812035,515.0,872.0,1508.0,859.965806,4996.0
1,39.0,Female,31678.0,Divorced,3.0,Master's,Unemployed,15.569731,Rural,Comprehensive,1.0,12.0,694.0,2.0,Average,Yes,Monthly,House,1483.0,0,0,0,0,1,0,0,0,0,0,0,2,4857.756069,10559.333333,21984532.0,45.645533,95034.0,10559.333333,812.25641,Monday,694.0,1388.0,4.221513,10805.393307,607.219509,126712.0,2776.0,4.0,62.278924,1,12,6,2023,2023 Q2,-0.998725,0.050489,-6.447061e-13,1.0,-1.469576e-15,1.0,20.0,1104.006551,514.0,875.0,1513.0,864.955881,4997.0,20.0,1103.361209,514.0,872.0,1508.0,867.02349,4997.0,20.0,1102.113989,513.0,871.0,1512.0,866.235322,4997.0,20.0,1083.632645,506.0,855.0,1476.0,853.156218,4997.0,20.0,1098.15965,511.0,862.0,1497.0,865.103831,4999.0,20.0,1106.883166,518.0,878.0,1514.0,863.675409,4997.0
2,23.0,Male,25602.0,Divorced,3.0,High School,Self-Employed,47.177549,Suburban,Premium,1.0,14.0,632.0,3.0,Good,Yes,Weekly,House,567.0,0,0,0,0,0,0,0,0,1,0,0,4,17361.338138,8534.0,16180464.0,40.509494,25602.0,25602.0,1113.130435,Saturday,632.0,1896.0,2.641123,29816.21115,1085.083634,204816.0,5056.0,8.0,377.420394,1,30,9,2023,2023 Q3,-0.994867,0.101192,-6.447061e-13,1.0,-2.204364e-15,1.0,20.0,1104.006551,514.0,875.0,1513.0,864.955881,4997.0,20.0,1100.430574,514.0,870.0,1507.0,865.079864,4999.0,20.0,1104.78749,514.0,876.0,1513.0,865.951488,4999.0,20.0,1083.632645,506.0,855.0,1476.0,853.156218,4997.0,20.0,1102.677039,514.0,871.0,1511.0,864.569091,4991.0,20.0,1101.733099,514.0,872.0,1502.0,865.787949,4997.0
3,21.0,Male,141855.0,Married,2.0,Bachelor's,Self-Employed,10.938144,Rural,Basic,1.0,0.0,367.0,1.0,Poor,Yes,Daily,Apartment,765.0,0,0,0,0,1,0,0,0,0,0,0,3,7350.432875,70927.5,52060785.0,386.525886,283710.0,70927.5,6755.0,Wednesday,367.0,367.0,4.453093,4014.298906,229.701027,283710.0,734.0,2.0,21.876288,1,12,6,2024,2024 Q2,0.111402,0.993775,1.585375e-14,1.0,-1.469576e-15,1.0,20.0,1108.443461,516.0,876.0,1520.0,866.852628,4997.0,20.0,1100.430574,514.0,870.0,1507.0,865.079864,4999.0,20.0,1102.698438,514.0,873.0,1509.0,864.866296,4988.0,20.0,1083.632645,506.0,855.0,1476.0,853.156218,4997.0,20.0,1099.009424,510.0,867.0,1504.0,865.547081,4997.0,20.0,1097.042977,507.0,861.0,1504.0,865.431191,4988.0
4,21.0,Male,39651.0,Single,1.0,Bachelor's,Self-Employed,20.376094,Rural,Premium,0.0,8.0,598.0,4.0,Poor,Yes,Weekly,House,2022.0,0,0,0,0,0,0,0,0,0,0,0,3,6846.367459,39651.0,23711298.0,66.30602,79302.0,19825.5,1888.142857,Wednesday,598.0,2392.0,3.981195,12184.903989,427.897966,79302.0,1196.0,0.0,40.752187,0,1,12,2021,2021 Q4,-0.996246,0.086565,-1.468363e-13,1.0,-2.939152e-15,1.0,20.0,1104.678891,516.0,874.0,1510.0,865.235996,4994.0,20.0,1100.430574,514.0,870.0,1507.0,865.079864,4999.0,20.0,1102.698438,514.0,873.0,1509.0,864.866296,4988.0,20.0,1082.452746,508.0,855.0,1472.0,851.671355,4999.0,20.0,1099.009424,510.0,867.0,1504.0,865.547081,4997.0,20.0,1104.723079,515.0,872.0,1509.0,866.377508,4991.0


In [495]:
df.shape

(2000000, 101)

###
---
###

# Encoding of column in `df`

In [496]:
show_nulls(df)

Unnamed: 0,Column,Data Type,Nulls,No. of Uniques,Uniques
18,Premium Amount,float64,800000,4794,"[2869.0, 1483.0, 567.0, 765.0, 2022.0, 3202.0,..."
85,Previous Claims_STD_Premium Amount,float64,9,9,"[898.4029501785653, 853.1562175615868, 851.671..."
1,Gender,object,0,2,"[Female, Male]"
3,Marital Status,object,0,3,"[Married, Divorced, Single]"
2,Annual Income,float64,0,97970,"[10049.0, 31678.0, 25602.0, 141855.0, 39651.0,..."
4,Number of Dependents,float64,0,5,"[1.0, 3.0, 2.0, 0.0, 4.0]"
5,Education Level,object,0,4,"[Bachelor's, Master's, High School, PhD]"
7,Health Score,float64,0,933976,"[22.59876067181393, 15.569730989408043, 47.177..."
6,Occupation,object,0,3,"[Self-Employed, Unemployed, Employed]"
9,Policy Type,object,0,3,"[Premium, Comprehensive, Basic]"


# Save the pickle file

In [497]:
import gzip
import pickle

def save_pickle(foldername, filename, model):
    with gzip.open(f"{foldername}/{filename}.pkl.gz", 'wb') as f:
        pickle.dump(model, f)


### Policy Start Date - Year	

In [498]:
df["Policy Start Date - Year"].unique()

array([2023, 2024, 2021, 2022, 2020, 2019], dtype=object)

In [499]:
a = OrdinalEncoder(categories=[[2019, 2020, 2021, 2022, 2023, 2024]])

b = pd.DataFrame({"ENCODED_Policy Start Date - Year" : a.fit_transform(df[["Policy Start Date - Year"]]).flatten()})
#####################################################
save_pickle(foldername="do_encodings", filename="ENCODED_Policy Start Date - Year", model=a)
#########################################################

df = pd.concat([df, b], axis=1)
df.drop(columns="Policy Start Date - Year", inplace=True)

### Policy Start Date - Quarter

In [500]:
sorted(list(df["Policy Start Date - Quarter"].unique()))

['2019 Q3',
 '2019 Q4',
 '2020 Q1',
 '2020 Q2',
 '2020 Q3',
 '2020 Q4',
 '2021 Q1',
 '2021 Q2',
 '2021 Q3',
 '2021 Q4',
 '2022 Q1',
 '2022 Q2',
 '2022 Q3',
 '2022 Q4',
 '2023 Q1',
 '2023 Q2',
 '2023 Q3',
 '2023 Q4',
 '2024 Q1',
 '2024 Q2',
 '2024 Q3']

In [501]:
a = OrdinalEncoder(categories=[['2019 Q3', '2019 Q4', '2020 Q1', '2020 Q2', '2020 Q3', '2020 Q4', '2021 Q1', '2021 Q2', '2021 Q3',
                 '2021 Q4', '2022 Q1', '2022 Q2', '2022 Q3', '2022 Q4', '2023 Q1', '2023 Q2', '2023 Q3', '2023 Q4', '2024 Q1', '2024 Q2', '2024 Q3']])

b = pd.DataFrame({"ENCODED_Policy Start Date - Quarter" : a.fit_transform(df[["Policy Start Date - Quarter"]]).flatten()})

save_pickle(foldername="do_encodings", filename="ENCODED_Policy Start Date - Quarter", model=a)

df = pd.concat([df, b], axis=1)
df.drop(columns="Policy Start Date - Quarter", inplace=True)

### Policy Start Date - Month

In [502]:
df["Policy Start Date - Month"] = df["Policy Start Date - Month"].astype(int)

### Customer Feedback

In [503]:
df["Customer Feedback"].unique()

array(['Poor', 'Average', 'Good'], dtype=object)

In [504]:
a = OrdinalEncoder(categories=[['Poor', 'Average', 'Good']])

b = pd.DataFrame({"ENCODED_Customer Feedback" : a.fit_transform(df[["Customer Feedback"]]).flatten()})

save_pickle(foldername="do_encodings", filename="ENCODED_Customer Feedback", model=a)

df = pd.concat([df, b], axis=1)
df.drop(columns="Customer Feedback", inplace=True)

### Occupation

In [505]:
a = OneHotEncoder(drop="first", sparse_output=False)

b = pd.DataFrame(
        a.fit_transform(df[["Occupation"]]),
        columns="ENCODED_" + a.get_feature_names_out()
    )

save_pickle(foldername="do_encodings", filename="ENCODED_Occupation", model=a)

df = pd.concat([df, b], axis=1)
df.drop(columns="Occupation", inplace=True)

### Marital Status

In [506]:
a = OneHotEncoder(drop="first", sparse_output=False)

b = pd.DataFrame(
        a.fit_transform(df[["Marital Status"]]),
        columns="ENCODED_" + a.get_feature_names_out()
    )

save_pickle(foldername="do_encodings", filename="ENCODED_Marital Status", model=a)


df = pd.concat([df, b], axis=1)
df.drop(columns="Marital Status", inplace=True)

In [507]:
df

Unnamed: 0,Age,Gender,Annual Income,Number of Dependents,Education Level,Health Score,Location,Policy Type,Previous Claims,Vehicle Age,Credit Score,Insurance Duration,Smoking Status,Exercise Frequency,Property Type,Premium Amount,IsNull_Age,IsNull_Annual Income,IsNull_Marital Status,IsNull_Number of Dependents,IsNull_Occupation,IsNull_Health Score,IsNull_Previous Claims,IsNull_Vehicle Age,IsNull_Credit Score,IsNull_Insurance Duration,IsNull_Customer Feedback,Health Conscious Level,Health Conscious Level1,Money Per Head,Money Handling Level,Money Handling Level1,Growth,Growth1,Determinstic,Day_Name,Credit by Score,CreditInsurance,Health_Risk_Score,Credit_Health_Score,Health_Age_Interaction,Feedback1,Feedback2,Feedback3,Feedback4,Total Nulls,Policy Start Date - Day,Policy Start Date - Month,Sin_Date,Cos_Date,Sin_Year,Cos_Year,Sin_Month,Cos_Month,Number of Dependents_MIN_Premium Amount,Number of Dependents_MEAN_Premium Amount,Number of Dependents_Q1_Premium Amount,Number of Dependents_MEDIAN_Premium Amount,Number of Dependents_Q3_Premium Amount,Number of Dependents_STD_Premium Amount,Number of Dependents_MAX_Premium Amount,Occupation_MIN_Premium Amount,Occupation_MEAN_Premium Amount,Occupation_Q1_Premium Amount,Occupation_MEDIAN_Premium Amount,Occupation_Q3_Premium Amount,Occupation_STD_Premium Amount,Occupation_MAX_Premium Amount,Education Level_MIN_Premium Amount,Education Level_MEAN_Premium Amount,Education Level_Q1_Premium Amount,Education Level_MEDIAN_Premium Amount,Education Level_Q3_Premium Amount,Education Level_STD_Premium Amount,Education Level_MAX_Premium Amount,Previous Claims_MIN_Premium Amount,Previous Claims_MEAN_Premium Amount,Previous Claims_Q1_Premium Amount,Previous Claims_MEDIAN_Premium Amount,Previous Claims_Q3_Premium Amount,Previous Claims_STD_Premium Amount,Previous Claims_MAX_Premium Amount,Health Conscious Level_MIN_Premium Amount,Health Conscious Level_MEAN_Premium Amount,Health Conscious Level_Q1_Premium Amount,Health Conscious Level_MEDIAN_Premium Amount,Health Conscious Level_Q3_Premium Amount,Health Conscious Level_STD_Premium Amount,Health Conscious Level_MAX_Premium Amount,Insurance Duration_MIN_Premium Amount,Insurance Duration_MEAN_Premium Amount,Insurance Duration_Q1_Premium Amount,Insurance Duration_MEDIAN_Premium Amount,Insurance Duration_Q3_Premium Amount,Insurance Duration_STD_Premium Amount,Insurance Duration_MAX_Premium Amount,ENCODED_Policy Start Date - Year,ENCODED_Policy Start Date - Quarter,ENCODED_Customer Feedback,ENCODED_Occupation_Self-Employed,ENCODED_Occupation_Unemployed,ENCODED_Marital Status_Married,ENCODED_Marital Status_Single
0,19.0,Female,10049.0,1.0,Bachelor's,22.598761,Urban,Premium,2.0,17.0,372.0,5.0,No,Weekly,House,2869.0,0,0,0,0,0,0,0,0,0,0,0,4,13740.046488,10049.000000,3738228.0,27.013441,20098.0,5024.500000,528.894737,Saturday,186.0,1860.0,3.870062,8406.738970,429.376453,20098.0,744.0,4.0,45.197521,0,23,12,-0.975344,-0.220691,-6.447061e-13,1.0,-2.939152e-15,1.0,20.0,1104.678891,516.0,874.0,1510.0,865.235996,4994.0,20.0,1100.430574,514.0,870.0,1507.0,865.079864,4999.0,20.0,1102.698438,514.0,873.0,1509.0,864.866296,4988.0,20.0,1151.583106,526.0,907.0,1606.0,898.402950,4988.0,20.0,1102.677039,514.0,871.0,1511.0,864.569091,4991.0,20.0,1100.812035,515.0,872.0,1508.0,859.965806,4996.0,4.0,17.0,0.0,1.0,0.0,1.0,0.0
1,39.0,Female,31678.0,3.0,Master's,15.569731,Rural,Comprehensive,1.0,12.0,694.0,2.0,Yes,Monthly,House,1483.0,0,0,0,0,1,0,0,0,0,0,0,2,4857.756069,10559.333333,21984532.0,45.645533,95034.0,10559.333333,812.256410,Monday,694.0,1388.0,4.221513,10805.393307,607.219509,126712.0,2776.0,4.0,62.278924,1,12,6,-0.998725,0.050489,-6.447061e-13,1.0,-1.469576e-15,1.0,20.0,1104.006551,514.0,875.0,1513.0,864.955881,4997.0,20.0,1103.361209,514.0,872.0,1508.0,867.023490,4997.0,20.0,1102.113989,513.0,871.0,1512.0,866.235322,4997.0,20.0,1083.632645,506.0,855.0,1476.0,853.156218,4997.0,20.0,1098.159650,511.0,862.0,1497.0,865.103831,4999.0,20.0,1106.883166,518.0,878.0,1514.0,863.675409,4997.0,4.0,15.0,1.0,0.0,1.0,0.0,0.0
2,23.0,Male,25602.0,3.0,High School,47.177549,Suburban,Premium,1.0,14.0,632.0,3.0,Yes,Weekly,House,567.0,0,0,0,0,0,0,0,0,1,0,0,4,17361.338138,8534.000000,16180464.0,40.509494,25602.0,25602.000000,1113.130435,Saturday,632.0,1896.0,2.641123,29816.211150,1085.083634,204816.0,5056.0,8.0,377.420394,1,30,9,-0.994867,0.101192,-6.447061e-13,1.0,-2.204364e-15,1.0,20.0,1104.006551,514.0,875.0,1513.0,864.955881,4997.0,20.0,1100.430574,514.0,870.0,1507.0,865.079864,4999.0,20.0,1104.787490,514.0,876.0,1513.0,865.951488,4999.0,20.0,1083.632645,506.0,855.0,1476.0,853.156218,4997.0,20.0,1102.677039,514.0,871.0,1511.0,864.569091,4991.0,20.0,1101.733099,514.0,872.0,1502.0,865.787949,4997.0,4.0,16.0,2.0,1.0,0.0,0.0,0.0
3,21.0,Male,141855.0,2.0,Bachelor's,10.938144,Rural,Basic,1.0,0.0,367.0,1.0,Yes,Daily,Apartment,765.0,0,0,0,0,1,0,0,0,0,0,0,3,7350.432875,70927.500000,52060785.0,386.525886,283710.0,70927.500000,6755.000000,Wednesday,367.0,367.0,4.453093,4014.298906,229.701027,283710.0,734.0,2.0,21.876288,1,12,6,0.111402,0.993775,1.585375e-14,1.0,-1.469576e-15,1.0,20.0,1108.443461,516.0,876.0,1520.0,866.852628,4997.0,20.0,1100.430574,514.0,870.0,1507.0,865.079864,4999.0,20.0,1102.698438,514.0,873.0,1509.0,864.866296,4988.0,20.0,1083.632645,506.0,855.0,1476.0,853.156218,4997.0,20.0,1099.009424,510.0,867.0,1504.0,865.547081,4997.0,20.0,1097.042977,507.0,861.0,1504.0,865.431191,4988.0,5.0,19.0,0.0,1.0,0.0,1.0,0.0
4,21.0,Male,39651.0,1.0,Bachelor's,20.376094,Rural,Premium,0.0,8.0,598.0,4.0,Yes,Weekly,House,2022.0,0,0,0,0,0,0,0,0,0,0,0,3,6846.367459,39651.000000,23711298.0,66.306020,79302.0,19825.500000,1888.142857,Wednesday,598.0,2392.0,3.981195,12184.903989,427.897966,79302.0,1196.0,0.0,40.752187,0,1,12,-0.996246,0.086565,-1.468363e-13,1.0,-2.939152e-15,1.0,20.0,1104.678891,516.0,874.0,1510.0,865.235996,4994.0,20.0,1100.430574,514.0,870.0,1507.0,865.079864,4999.0,20.0,1102.698438,514.0,873.0,1509.0,864.866296,4988.0,20.0,1082.452746,508.0,855.0,1472.0,851.671355,4999.0,20.0,1099.009424,510.0,867.0,1504.0,865.547081,4997.0,20.0,1104.723079,515.0,872.0,1509.0,866.377508,4991.0,2.0,9.0,0.0,1.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1999995,50.0,Female,38782.0,1.0,Bachelor's,14.498639,Rural,Premium,0.0,8.0,309.0,2.0,Yes,Daily,Condo,,0,0,0,0,1,0,1,0,0,0,0,4,23197.822227,38782.000000,11983638.0,125.508091,77564.0,19391.000000,775.640000,Friday,309.0,618.0,4.275068,4480.079418,724.931945,155128.0,1236.0,0.0,57.994556,2,9,7,0.645845,-0.763468,-1.468363e-13,1.0,-1.714506e-15,1.0,20.0,1104.678891,516.0,874.0,1510.0,865.235996,4994.0,20.0,1100.430574,514.0,870.0,1507.0,865.079864,4999.0,20.0,1102.698438,514.0,873.0,1509.0,864.866296,4988.0,20.0,1082.452746,508.0,855.0,1472.0,851.671355,4999.0,20.0,1102.677039,514.0,871.0,1511.0,864.569091,4991.0,20.0,1106.883166,518.0,878.0,1514.0,863.675409,4997.0,2.0,8.0,1.0,1.0,0.0,1.0,0.0
1999996,35.0,Female,73462.0,0.0,Master's,8.145748,Rural,Basic,2.0,0.0,462.0,2.0,No,Daily,Apartment,,1,0,0,0,1,0,0,0,1,0,0,5,18246.475706,73462.000000,33939444.0,159.008658,220386.0,24487.333333,2098.914286,Tuesday,231.0,924.0,4.592713,3763.335614,285.101183,587696.0,3696.0,16.0,65.165985,3,28,3,0.681828,0.731513,-6.447061e-13,1.0,-7.347881e-16,1.0,20.0,1097.649985,513.0,867.0,1498.0,862.779627,4999.0,20.0,1100.430574,514.0,870.0,1507.0,865.079864,4999.0,20.0,1102.113989,513.0,871.0,1512.0,866.235322,4997.0,20.0,1151.583106,526.0,907.0,1606.0,898.402950,4988.0,20.0,1102.591279,514.0,872.0,1509.0,864.171931,4994.0,20.0,1106.883166,518.0,878.0,1514.0,863.675409,4997.0,4.0,14.0,2.0,1.0,0.0,0.0,1.0
1999997,26.0,Female,35178.0,0.0,Master's,6.636583,Urban,Comprehensive,1.0,10.0,698.0,6.0,No,Monthly,Apartment,,0,0,0,0,0,0,1,0,1,0,0,2,2760.818699,35178.000000,24554244.0,50.398281,105534.0,11726.000000,1353.000000,Monday,698.0,4188.0,4.668171,4632.335221,172.551169,70356.0,1396.0,2.0,13.273167,2,30,9,-0.709843,0.704360,3.510335e-13,1.0,-2.204364e-15,1.0,20.0,1097.649985,513.0,867.0,1498.0,862.779627,4999.0,20.0,1105.160880,515.0,876.0,1514.0,862.882837,4994.0,20.0,1102.113989,513.0,871.0,1512.0,866.235322,4997.0,20.0,1083.632645,506.0,855.0,1476.0,853.156218,4997.0,20.0,1098.159650,511.0,862.0,1497.0,865.103831,4999.0,20.0,1104.558441,515.0,873.0,1513.0,866.550287,4999.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1999998,34.0,Female,45661.0,3.0,Master's,15.937248,Urban,Premium,2.0,17.0,467.0,7.0,No,Weekly,Condo,,0,0,0,0,1,0,0,0,0,0,0,4,17339.725601,15220.333333,21323687.0,97.775161,136983.0,15220.333333,1342.970588,Monday,233.5,3269.0,4.203138,7442.694720,541.866425,182644.0,1868.0,8.0,63.748991,1,9,5,0.561062,-0.827774,-1.305266e-12,1.0,-1.224647e-15,1.0,20.0,1104.006551,514.0,875.0,1513.0,864.955881,4997.0,20.0,1100.430574,514.0,870.0,1507.0,865.079864,4999.0,20.0,1102.113989,513.0,871.0,1512.0,866.235322,4997.0,20.0,1151.583106,526.0,907.0,1606.0,898.402950,4988.0,20.0,1102.677039,514.0,871.0,1511.0,864.569091,4991.0,20.0,1105.964863,515.0,877.0,1513.0,866.225414,4988.0,3.0,11.0,1.0,1.0,0.0,0.0,1.0


#
---
#

In [508]:
def return_splits(ddf, feature_name, target_name):
    return [ddf[ddf[feature_name] == i][target_name] for i in ddf[feature_name].unique()]

def give_stats_analysis(df, target_column_name):
    ddf = df.copy()
    ddf = ddf.dropna()

    features = []
    tests = []
    stats = []
    pvals = []
    verdict = []
    count = 0

    target = ddf[target_column_name]
    for i in ddf.columns:
        features.append(i)
        feature = ddf[i]
        
        if (feature.dtype == "O" and (target.dtype == "float" or target.dtype == "int")) or (target.dtype == "O" and (feature.dtype == "float" or feature.dtype == "int")):
            stat, pval, *_ = kruskal(*return_splits(ddf, feature.name, target.name))
            tests.append("Kruskal-Wallis")
            stats.append(stat)
            pvals.append(pval)
            
        
        elif (feature.dtype == "float" or feature.dtype == "int") and (target.dtype == "float" or target.dtype == "int"):
            stat, pval, *_ = spearmanr(feature, target)
            tests.append("SpearmanR")
            stats.append(stat)
            pvals.append(pval)

        elif feature.dtype == "O" and target.dtype == "O":
            stat, pval, *_ = chi2_contingency(pd.crosstab(feature, target))
            tests.append("Chi-Square")
            stats.append(stat)
            pvals.append(pval)
        
        else:
            tests.append(np.nan)
            stats.append(np.nan)
            pvals.append(np.nan)
        
        if pval <= 0.05:
            verdict.append("There is Relationship")
        else:
            verdict.append("There is NO Relationship")

        print(f"{feature.name} ■■■ {target_column_name}".ljust(50, "-")+"✅")
    
    return pd.DataFrame({
        "Feature" : features,
        "Target" : [target_column_name]*ddf.shape[1],
        "Statistic Test" : tests,
        "Test Statistic" : stats,
        "P-Value" : pvals,
        "Verdict" : verdict
    }).sort_values(by="P-Value")

In [509]:
stats_df = give_stats_analysis(df, "Premium Amount")
stats_df

Age ■■■ Premium Amount----------------------------✅
Gender ■■■ Premium Amount-------------------------✅
Annual Income ■■■ Premium Amount------------------✅
Number of Dependents ■■■ Premium Amount-----------✅
Education Level ■■■ Premium Amount----------------✅
Health Score ■■■ Premium Amount-------------------✅
Location ■■■ Premium Amount-----------------------✅
Policy Type ■■■ Premium Amount--------------------✅
Previous Claims ■■■ Premium Amount----------------✅
Vehicle Age ■■■ Premium Amount--------------------✅
Credit Score ■■■ Premium Amount-------------------✅
Insurance Duration ■■■ Premium Amount-------------✅
Smoking Status ■■■ Premium Amount-----------------✅
Exercise Frequency ■■■ Premium Amount-------------✅
Property Type ■■■ Premium Amount------------------✅
Premium Amount ■■■ Premium Amount-----------------✅
IsNull_Age ■■■ Premium Amount---------------------✅
IsNull_Annual Income ■■■ Premium Amount-----------✅
IsNull_Marital Status ■■■ Premium Amount----------✅
IsNull_Numbe

Unnamed: 0,Feature,Target,Statistic Test,Test Statistic,P-Value,Verdict
2,Annual Income,Premium Amount,SpearmanR,-0.061831,0.0,There is Relationship
10,Credit Score,Premium Amount,SpearmanR,-0.036687,0.0,There is Relationship
15,Premium Amount,Premium Amount,SpearmanR,1.0,0.0,There is Relationship
17,IsNull_Annual Income,Premium Amount,SpearmanR,-0.065399,0.0,There is Relationship
30,Money Handling Level,Premium Amount,SpearmanR,-0.072097,0.0,There is Relationship
31,Money Handling Level1,Premium Amount,SpearmanR,-0.048668,0.0,There is Relationship
29,Money Per Head,Premium Amount,SpearmanR,-0.053422,0.0,There is Relationship
32,Growth,Premium Amount,SpearmanR,-0.055,0.0,There is Relationship
36,Credit by Score,Premium Amount,SpearmanR,-0.05485,0.0,There is Relationship
34,Determinstic,Premium Amount,SpearmanR,-0.056869,0.0,There is Relationship


In [510]:
df.head(3)

Unnamed: 0,Age,Gender,Annual Income,Number of Dependents,Education Level,Health Score,Location,Policy Type,Previous Claims,Vehicle Age,Credit Score,Insurance Duration,Smoking Status,Exercise Frequency,Property Type,Premium Amount,IsNull_Age,IsNull_Annual Income,IsNull_Marital Status,IsNull_Number of Dependents,IsNull_Occupation,IsNull_Health Score,IsNull_Previous Claims,IsNull_Vehicle Age,IsNull_Credit Score,IsNull_Insurance Duration,IsNull_Customer Feedback,Health Conscious Level,Health Conscious Level1,Money Per Head,Money Handling Level,Money Handling Level1,Growth,Growth1,Determinstic,Day_Name,Credit by Score,CreditInsurance,Health_Risk_Score,Credit_Health_Score,Health_Age_Interaction,Feedback1,Feedback2,Feedback3,Feedback4,Total Nulls,Policy Start Date - Day,Policy Start Date - Month,Sin_Date,Cos_Date,Sin_Year,Cos_Year,Sin_Month,Cos_Month,Number of Dependents_MIN_Premium Amount,Number of Dependents_MEAN_Premium Amount,Number of Dependents_Q1_Premium Amount,Number of Dependents_MEDIAN_Premium Amount,Number of Dependents_Q3_Premium Amount,Number of Dependents_STD_Premium Amount,Number of Dependents_MAX_Premium Amount,Occupation_MIN_Premium Amount,Occupation_MEAN_Premium Amount,Occupation_Q1_Premium Amount,Occupation_MEDIAN_Premium Amount,Occupation_Q3_Premium Amount,Occupation_STD_Premium Amount,Occupation_MAX_Premium Amount,Education Level_MIN_Premium Amount,Education Level_MEAN_Premium Amount,Education Level_Q1_Premium Amount,Education Level_MEDIAN_Premium Amount,Education Level_Q3_Premium Amount,Education Level_STD_Premium Amount,Education Level_MAX_Premium Amount,Previous Claims_MIN_Premium Amount,Previous Claims_MEAN_Premium Amount,Previous Claims_Q1_Premium Amount,Previous Claims_MEDIAN_Premium Amount,Previous Claims_Q3_Premium Amount,Previous Claims_STD_Premium Amount,Previous Claims_MAX_Premium Amount,Health Conscious Level_MIN_Premium Amount,Health Conscious Level_MEAN_Premium Amount,Health Conscious Level_Q1_Premium Amount,Health Conscious Level_MEDIAN_Premium Amount,Health Conscious Level_Q3_Premium Amount,Health Conscious Level_STD_Premium Amount,Health Conscious Level_MAX_Premium Amount,Insurance Duration_MIN_Premium Amount,Insurance Duration_MEAN_Premium Amount,Insurance Duration_Q1_Premium Amount,Insurance Duration_MEDIAN_Premium Amount,Insurance Duration_Q3_Premium Amount,Insurance Duration_STD_Premium Amount,Insurance Duration_MAX_Premium Amount,ENCODED_Policy Start Date - Year,ENCODED_Policy Start Date - Quarter,ENCODED_Customer Feedback,ENCODED_Occupation_Self-Employed,ENCODED_Occupation_Unemployed,ENCODED_Marital Status_Married,ENCODED_Marital Status_Single
0,19.0,Female,10049.0,1.0,Bachelor's,22.598761,Urban,Premium,2.0,17.0,372.0,5.0,No,Weekly,House,2869.0,0,0,0,0,0,0,0,0,0,0,0,4,13740.046488,10049.0,3738228.0,27.013441,20098.0,5024.5,528.894737,Saturday,186.0,1860.0,3.870062,8406.73897,429.376453,20098.0,744.0,4.0,45.197521,0,23,12,-0.975344,-0.220691,-6.447061e-13,1.0,-2.939152e-15,1.0,20.0,1104.678891,516.0,874.0,1510.0,865.235996,4994.0,20.0,1100.430574,514.0,870.0,1507.0,865.079864,4999.0,20.0,1102.698438,514.0,873.0,1509.0,864.866296,4988.0,20.0,1151.583106,526.0,907.0,1606.0,898.40295,4988.0,20.0,1102.677039,514.0,871.0,1511.0,864.569091,4991.0,20.0,1100.812035,515.0,872.0,1508.0,859.965806,4996.0,4.0,17.0,0.0,1.0,0.0,1.0,0.0
1,39.0,Female,31678.0,3.0,Master's,15.569731,Rural,Comprehensive,1.0,12.0,694.0,2.0,Yes,Monthly,House,1483.0,0,0,0,0,1,0,0,0,0,0,0,2,4857.756069,10559.333333,21984532.0,45.645533,95034.0,10559.333333,812.25641,Monday,694.0,1388.0,4.221513,10805.393307,607.219509,126712.0,2776.0,4.0,62.278924,1,12,6,-0.998725,0.050489,-6.447061e-13,1.0,-1.469576e-15,1.0,20.0,1104.006551,514.0,875.0,1513.0,864.955881,4997.0,20.0,1103.361209,514.0,872.0,1508.0,867.02349,4997.0,20.0,1102.113989,513.0,871.0,1512.0,866.235322,4997.0,20.0,1083.632645,506.0,855.0,1476.0,853.156218,4997.0,20.0,1098.15965,511.0,862.0,1497.0,865.103831,4999.0,20.0,1106.883166,518.0,878.0,1514.0,863.675409,4997.0,4.0,15.0,1.0,0.0,1.0,0.0,0.0
2,23.0,Male,25602.0,3.0,High School,47.177549,Suburban,Premium,1.0,14.0,632.0,3.0,Yes,Weekly,House,567.0,0,0,0,0,0,0,0,0,1,0,0,4,17361.338138,8534.0,16180464.0,40.509494,25602.0,25602.0,1113.130435,Saturday,632.0,1896.0,2.641123,29816.21115,1085.083634,204816.0,5056.0,8.0,377.420394,1,30,9,-0.994867,0.101192,-6.447061e-13,1.0,-2.204364e-15,1.0,20.0,1104.006551,514.0,875.0,1513.0,864.955881,4997.0,20.0,1100.430574,514.0,870.0,1507.0,865.079864,4999.0,20.0,1104.78749,514.0,876.0,1513.0,865.951488,4999.0,20.0,1083.632645,506.0,855.0,1476.0,853.156218,4997.0,20.0,1102.677039,514.0,871.0,1511.0,864.569091,4991.0,20.0,1101.733099,514.0,872.0,1502.0,865.787949,4997.0,4.0,16.0,2.0,1.0,0.0,0.0,0.0


In [511]:
wanted_columns = stats_df[stats_df["P-Value"] <= 0.05]["Feature"]
list(wanted_columns)

['Annual Income',
 'Credit Score',
 'Premium Amount',
 'IsNull_Annual Income',
 'Money Handling Level',
 'Money Handling Level1',
 'Money Per Head',
 'Growth',
 'Credit by Score',
 'Determinstic',
 'Growth1',
 'Feedback1',
 'Previous Claims_MEDIAN_Premium Amount',
 'IsNull_Health Score',
 'Previous Claims_MEAN_Premium Amount',
 'Previous Claims',
 'Previous Claims_STD_Premium Amount',
 'Previous Claims_Q3_Premium Amount',
 'Previous Claims_Q1_Premium Amount',
 'IsNull_Customer Feedback',
 'Previous Claims_MAX_Premium Amount',
 'Feedback3',
 'IsNull_Previous Claims',
 'IsNull_Marital Status',
 'Health Score',
 'Health_Risk_Score',
 'Feedback2',
 'CreditInsurance',
 'Sin_Year',
 'IsNull_Credit Score',
 'Health_Age_Interaction',
 'Total Nulls',
 'ENCODED_Policy Start Date - Year',
 'ENCODED_Policy Start Date - Quarter',
 'Feedback4',
 'IsNull_Number of Dependents',
 'IsNull_Occupation',
 'Health Conscious Level1',
 'Sin_Month',
 'Policy Start Date - Month',
 'Health Conscious Level',
 'He

In [512]:
remove_columns = stats_df[stats_df["P-Value"] > 0.05]["Feature"]
list(remove_columns)

['Education Level_MEDIAN_Premium Amount',
 'Education Level_MEAN_Premium Amount',
 'Education Level_Q3_Premium Amount',
 'Location',
 'ENCODED_Customer Feedback',
 'Number of Dependents_MAX_Premium Amount',
 'ENCODED_Marital Status_Married',
 'Education Level',
 'Number of Dependents',
 'Cos_Date',
 'Vehicle Age',
 'Health Conscious Level_STD_Premium Amount',
 'IsNull_Age',
 'Policy Type',
 'Education Level_MAX_Premium Amount',
 'Property Type',
 'Exercise Frequency',
 'Smoking Status',
 'Day_Name',
 'IsNull_Vehicle Age',
 'Education Level_STD_Premium Amount',
 'IsNull_Insurance Duration',
 'Policy Start Date - Day',
 'Gender',
 'Sin_Date',
 'ENCODED_Marital Status_Single',
 'Insurance Duration',
 'Education Level_Q1_Premium Amount',
 'ENCODED_Occupation_Unemployed']

In [513]:
df = df[wanted_columns]

In [514]:
df.shape

(2000000, 67)

In [515]:
df.head(3)

Unnamed: 0,Annual Income,Credit Score,Premium Amount,IsNull_Annual Income,Money Handling Level,Money Handling Level1,Money Per Head,Growth,Credit by Score,Determinstic,Growth1,Feedback1,Previous Claims_MEDIAN_Premium Amount,IsNull_Health Score,Previous Claims_MEAN_Premium Amount,Previous Claims,Previous Claims_STD_Premium Amount,Previous Claims_Q3_Premium Amount,Previous Claims_Q1_Premium Amount,IsNull_Customer Feedback,Previous Claims_MAX_Premium Amount,Feedback3,IsNull_Previous Claims,IsNull_Marital Status,Health Score,Health_Risk_Score,Feedback2,CreditInsurance,Sin_Year,IsNull_Credit Score,Health_Age_Interaction,Total Nulls,ENCODED_Policy Start Date - Year,ENCODED_Policy Start Date - Quarter,Feedback4,IsNull_Number of Dependents,IsNull_Occupation,Health Conscious Level1,Sin_Month,Policy Start Date - Month,Health Conscious Level,Health Conscious Level_Q1_Premium Amount,Health Conscious Level_MEAN_Premium Amount,Health Conscious Level_MEDIAN_Premium Amount,Number of Dependents_MEAN_Premium Amount,Number of Dependents_MEDIAN_Premium Amount,Number of Dependents_Q1_Premium Amount,Number of Dependents_Q3_Premium Amount,Number of Dependents_STD_Premium Amount,Health Conscious Level_Q3_Premium Amount,Insurance Duration_MEAN_Premium Amount,Insurance Duration_MEDIAN_Premium Amount,Insurance Duration_Q1_Premium Amount,Insurance Duration_Q3_Premium Amount,Health Conscious Level_MAX_Premium Amount,Credit_Health_Score,Occupation_Q3_Premium Amount,Occupation_MEAN_Premium Amount,Occupation_MAX_Premium Amount,Occupation_MEDIAN_Premium Amount,Occupation_Q1_Premium Amount,Previous Claims_MIN_Premium Amount,Insurance Duration_MAX_Premium Amount,ENCODED_Occupation_Self-Employed,Age,Insurance Duration_STD_Premium Amount,Occupation_STD_Premium Amount
0,10049.0,372.0,2869.0,0,3738228.0,27.013441,10049.0,20098.0,186.0,528.894737,5024.5,20098.0,907.0,0,1151.583106,2.0,898.40295,1606.0,526.0,0,4988.0,4.0,0,0,22.598761,3.870062,744.0,1860.0,-6.447061e-13,0,429.376453,0,4.0,17.0,45.197521,0,0,13740.046488,-2.939152e-15,12,4,514.0,1102.677039,871.0,1104.678891,874.0,516.0,1510.0,865.235996,1511.0,1100.812035,872.0,515.0,1508.0,4991.0,8406.73897,1507.0,1100.430574,4999.0,870.0,514.0,20.0,4996.0,1.0,19.0,859.965806,865.079864
1,31678.0,694.0,1483.0,0,21984532.0,45.645533,10559.333333,95034.0,694.0,812.25641,10559.333333,126712.0,855.0,0,1083.632645,1.0,853.156218,1476.0,506.0,0,4997.0,4.0,0,0,15.569731,4.221513,2776.0,1388.0,-6.447061e-13,0,607.219509,1,4.0,15.0,62.278924,0,1,4857.756069,-1.469576e-15,6,2,511.0,1098.15965,862.0,1104.006551,875.0,514.0,1513.0,864.955881,1497.0,1106.883166,878.0,518.0,1514.0,4999.0,10805.393307,1508.0,1103.361209,4997.0,872.0,514.0,20.0,4997.0,0.0,39.0,863.675409,867.02349
2,25602.0,632.0,567.0,0,16180464.0,40.509494,8534.0,25602.0,632.0,1113.130435,25602.0,204816.0,855.0,0,1083.632645,1.0,853.156218,1476.0,506.0,0,4997.0,8.0,0,0,47.177549,2.641123,5056.0,1896.0,-6.447061e-13,1,1085.083634,1,4.0,16.0,377.420394,0,0,17361.338138,-2.204364e-15,9,4,514.0,1102.677039,871.0,1104.006551,875.0,514.0,1513.0,864.955881,1511.0,1101.733099,872.0,514.0,1502.0,4991.0,29816.21115,1507.0,1100.430574,4999.0,870.0,514.0,20.0,4997.0,1.0,23.0,865.787949,865.079864


In [516]:
df.select_dtypes(include='O')

Unnamed: 0,Health Conscious Level
0,4
1,2
2,4
3,3
4,3
...,...
1999995,4
1999996,5
1999997,2
1999998,4


In [517]:
# df['Health Conscious Level'] = df['Health Conscious Level'].astype(int)

In [518]:
# df['Policy Start Date - Day'] = df['Policy Start Date - Day'].astype(int)

In [519]:
# df.drop(columns='Day_Name', inplace=True)

### Location 

In [520]:
# a = OneHotEncoder(drop="first", sparse_output=False)

# b = pd.DataFrame(
#         a.fit_transform(df[["Location"]]),
#         columns="ENCODED_" + a.get_feature_names_out()
#     )

# df = pd.concat([df, b], axis=1)
# df.drop(columns="Location", inplace=True)

### Property Type 

In [521]:
# a = OneHotEncoder(drop="first", sparse_output=False)

# b = pd.DataFrame(
#         a.fit_transform(df[["Property Type"]]),
#         columns="ENCODED_" + a.get_feature_names_out()
#     )

# df = pd.concat([df, b], axis=1)
# df.drop(columns="Property Type", inplace=True)

### Gender

In [522]:
# a = OneHotEncoder(drop="first", sparse_output=False)

# b = pd.DataFrame(
#         a.fit_transform(df[["Gender"]]),
#         columns="ENCODED_" + a.get_feature_names_out()
#     )

# df = pd.concat([df, b], axis=1)
# df.drop(columns="Gender", inplace=True)

### Smoking Status

In [523]:
# a = OneHotEncoder(drop="first", sparse_output=False)

# b = pd.DataFrame(
#         a.fit_transform(df[["Smoking Status"]]),
#         columns="ENCODED_" + a.get_feature_names_out()
#     )

# df = pd.concat([df, b], axis=1)
# df.drop(columns="Smoking Status", inplace=True)

### Education Level 

In [524]:
# df['Education Level'].unique()

In [525]:
# a = OrdinalEncoder(categories=[["High School", "Bachelor's", "Master's", "PhD"]])

# b = pd.DataFrame({"ENCODED_Education Level" : a.fit_transform(df[["Education Level"]]).flatten()})

# df = pd.concat([df, b], axis=1)
# df.drop(columns="Education Level", inplace=True)

### Policy Type

In [526]:
# df['Policy Type'].unique()

In [527]:
# a = OrdinalEncoder(categories=[["Basic", "Comprehensive", "Premium"]])

# b = pd.DataFrame({"ENCODED_Policy Type" : a.fit_transform(df[["Policy Type"]]).flatten()})

# df = pd.concat([df, b], axis=1)
# df.drop(columns="Policy Type", inplace=True)

### Exercise Frequency

In [528]:
# df['Exercise Frequency'].unique()

In [529]:
# a = OrdinalEncoder(categories=[["Rarely", "Monthly", "Weekly", "Daily"]])

# b = pd.DataFrame({"ENCODED_Exercise Frequency" : a.fit_transform(df[["Exercise Frequency"]]).flatten()})

# df = pd.concat([df, b], axis=1)
# df.drop(columns="Exercise Frequency", inplace=True)

#
---
#

# Spliting Data

In [530]:
df["Health Conscious Level"] = df["Health Conscious Level"].astype(int)

In [531]:
train = df.iloc[:1200000, :]
test = df.iloc[1200000:, :]

train.shape, test.shape

((1200000, 67), (800000, 67))

In [532]:
X = train.drop(columns="Premium Amount")
Y = train["Premium Amount"]

In [533]:
from sklearn.model_selection import train_test_split

In [534]:
x_train, x_validate, y_train, y_validate = train_test_split(X, Y, test_size=10000)

In [535]:
x_validate.shape

(10000, 66)

In [536]:
test.drop(columns="Premium Amount", inplace=True)

In [537]:
test.shape

(800000, 66)

In [538]:
test

Unnamed: 0,Annual Income,Credit Score,IsNull_Annual Income,Money Handling Level,Money Handling Level1,Money Per Head,Growth,Credit by Score,Determinstic,Growth1,Feedback1,Previous Claims_MEDIAN_Premium Amount,IsNull_Health Score,Previous Claims_MEAN_Premium Amount,Previous Claims,Previous Claims_STD_Premium Amount,Previous Claims_Q3_Premium Amount,Previous Claims_Q1_Premium Amount,IsNull_Customer Feedback,Previous Claims_MAX_Premium Amount,Feedback3,IsNull_Previous Claims,IsNull_Marital Status,Health Score,Health_Risk_Score,Feedback2,CreditInsurance,Sin_Year,IsNull_Credit Score,Health_Age_Interaction,Total Nulls,ENCODED_Policy Start Date - Year,ENCODED_Policy Start Date - Quarter,Feedback4,IsNull_Number of Dependents,IsNull_Occupation,Health Conscious Level1,Sin_Month,Policy Start Date - Month,Health Conscious Level,Health Conscious Level_Q1_Premium Amount,Health Conscious Level_MEAN_Premium Amount,Health Conscious Level_MEDIAN_Premium Amount,Number of Dependents_MEAN_Premium Amount,Number of Dependents_MEDIAN_Premium Amount,Number of Dependents_Q1_Premium Amount,Number of Dependents_Q3_Premium Amount,Number of Dependents_STD_Premium Amount,Health Conscious Level_Q3_Premium Amount,Insurance Duration_MEAN_Premium Amount,Insurance Duration_MEDIAN_Premium Amount,Insurance Duration_Q1_Premium Amount,Insurance Duration_Q3_Premium Amount,Health Conscious Level_MAX_Premium Amount,Credit_Health_Score,Occupation_Q3_Premium Amount,Occupation_MEAN_Premium Amount,Occupation_MAX_Premium Amount,Occupation_MEDIAN_Premium Amount,Occupation_Q1_Premium Amount,Previous Claims_MIN_Premium Amount,Insurance Duration_MAX_Premium Amount,ENCODED_Occupation_Self-Employed,Age,Insurance Duration_STD_Premium Amount,Occupation_STD_Premium Amount
1200000,2310.0,551.0,0,1272810.0,4.192377,577.500000,4620.0,275.5,82.500000,1155.000000,4620.0,907.0,0,1151.583106,2.0,898.402950,1606.0,526.0,0,4988.0,4.0,1,1,7.657981,4.617101,1102.0,551.0,-6.447061e-13,1,214.423464,3,4.0,15.0,15.315962,0,0,3430.775431,-1.469576e-15,6,2,511.0,1098.159650,862.0,1096.464223,864.0,509.0,1498.0,864.717484,1497.0,1097.042977,861.0,507.0,1504.0,4999.0,4219.547460,1507.0,1100.430574,4999.0,870.0,514.0,20.0,4988.0,1.0,28.0,865.431191,865.079864
1200001,126031.0,372.0,0,46883532.0,338.793011,63015.500000,378093.0,372.0,4065.516129,42010.333333,1008248.0,855.0,0,1083.632645,1.0,853.156218,1476.0,506.0,0,4997.0,8.0,1,0,13.381379,4.330931,2976.0,2976.0,1.585375e-14,0,414.822753,1,5.0,19.0,107.051033,0,0,1659.291012,-9.797174e-16,4,1,506.0,1098.453623,871.0,1108.443461,876.0,516.0,1520.0,866.852628,1509.0,1105.876809,876.0,514.0,1514.0,4981.0,4977.873036,1507.0,1100.430574,4999.0,870.0,514.0,20.0,4994.0,1.0,31.0,868.009584,865.079864
1200002,17092.0,819.0,0,13998348.0,20.869353,17092.000000,68368.0,819.0,363.659574,4273.000000,68368.0,855.0,0,1083.632645,1.0,853.156218,1476.0,506.0,0,4997.0,4.0,1,0,24.354527,3.782274,3276.0,7371.0,-6.447061e-13,0,1144.662758,1,4.0,15.0,97.418107,0,0,9157.302066,-9.797174e-16,4,3,510.0,1099.009424,867.0,1097.649985,867.0,513.0,1498.0,862.779627,1504.0,1095.676958,861.0,507.0,1505.0,4997.0,19946.357425,1508.0,1103.361209,4997.0,872.0,514.0,20.0,4988.0,0.0,47.0,862.910692,867.023490
1200003,30424.0,770.0,0,23426480.0,39.511688,10141.333333,121696.0,770.0,1086.571429,7606.000000,60848.0,855.0,0,1083.632645,1.0,853.156218,1476.0,506.0,0,4997.0,2.0,0,0,5.136225,4.743189,1540.0,3850.0,-6.447061e-13,0,143.814305,0,4.0,17.0,10.272450,0,0,4602.057755,-2.449294e-15,10,3,510.0,1099.009424,867.0,1104.006551,875.0,514.0,1513.0,864.955881,1504.0,1100.812035,872.0,515.0,1508.0,4997.0,3954.893383,1507.0,1100.430574,4999.0,870.0,514.0,20.0,4996.0,1.0,28.0,859.965806,865.079864
1200004,10863.0,755.0,0,8201565.0,14.388079,5431.500000,10863.0,755.0,452.625000,10863.000000,43452.0,855.0,0,1083.632645,1.0,853.156218,1476.0,506.0,0,4997.0,4.0,1,0,11.844155,4.407792,3020.0,5285.0,-1.468363e-13,0,284.259727,1,2.0,9.0,47.376621,0,0,9096.311250,-9.799650e-15,11,3,510.0,1099.009424,867.0,1108.443461,876.0,516.0,1520.0,866.852628,1504.0,1105.964863,877.0,515.0,1513.0,4997.0,8942.337231,1508.0,1103.361209,4997.0,872.0,514.0,20.0,4988.0,0.0,24.0,866.225414,867.023490
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1999995,38782.0,309.0,0,11983638.0,125.508091,38782.000000,77564.0,309.0,775.640000,19391.000000,155128.0,855.0,0,1082.452746,0.0,851.671355,1472.0,508.0,0,4999.0,0.0,1,0,14.498639,4.275068,1236.0,618.0,-1.468363e-13,0,724.931945,2,2.0,8.0,57.994556,0,1,23197.822227,-1.714506e-15,7,4,514.0,1102.677039,871.0,1104.678891,874.0,516.0,1510.0,865.235996,1511.0,1106.883166,878.0,518.0,1514.0,4991.0,4480.079418,1507.0,1100.430574,4999.0,870.0,514.0,20.0,4997.0,1.0,50.0,863.675409,865.079864
1999996,73462.0,462.0,0,33939444.0,159.008658,73462.000000,220386.0,231.0,2098.914286,24487.333333,587696.0,907.0,0,1151.583106,2.0,898.402950,1606.0,526.0,0,4988.0,16.0,0,0,8.145748,4.592713,3696.0,924.0,-6.447061e-13,1,285.101183,3,4.0,14.0,65.165985,0,1,18246.475706,-7.347881e-16,3,5,514.0,1102.591279,872.0,1097.649985,867.0,513.0,1498.0,862.779627,1509.0,1106.883166,878.0,518.0,1514.0,4994.0,3763.335614,1507.0,1100.430574,4999.0,870.0,514.0,20.0,4997.0,1.0,35.0,863.675409,865.079864
1999997,35178.0,698.0,0,24554244.0,50.398281,35178.000000,105534.0,698.0,1353.000000,11726.000000,70356.0,855.0,0,1083.632645,1.0,853.156218,1476.0,506.0,0,4997.0,2.0,1,0,6.636583,4.668171,1396.0,4188.0,3.510335e-13,1,172.551169,2,0.0,0.0,13.273167,0,0,2760.818699,-2.204364e-15,9,2,511.0,1098.159650,862.0,1097.649985,867.0,513.0,1498.0,862.779627,1497.0,1104.558441,873.0,515.0,1513.0,4999.0,4632.335221,1514.0,1105.160880,4994.0,876.0,515.0,20.0,4999.0,0.0,26.0,866.550287,862.882837
1999998,45661.0,467.0,0,21323687.0,97.775161,15220.333333,136983.0,233.5,1342.970588,15220.333333,182644.0,907.0,0,1151.583106,2.0,898.402950,1606.0,526.0,0,4988.0,8.0,0,0,15.937248,4.203138,1868.0,3269.0,-1.305266e-12,0,541.866425,1,3.0,11.0,63.748991,0,1,17339.725601,-1.224647e-15,5,4,514.0,1102.677039,871.0,1104.006551,875.0,514.0,1513.0,864.955881,1511.0,1105.964863,877.0,515.0,1513.0,4991.0,7442.694720,1507.0,1100.430574,4999.0,870.0,514.0,20.0,4988.0,1.0,34.0,866.225414,865.079864


##
---
##

# Scaling on `df` 

In [539]:
# fig, axs = plt.subplots(3, 6, figsize=(20, 9))

# for i, ax in zip(x_train.columns, axs.flatten()):
#     sns.kdeplot(x_train[i], ax=ax, color="darkgray", fill=True)

# plt.tight_layout()
# plt.show()

In [540]:
df.head()

Unnamed: 0,Annual Income,Credit Score,Premium Amount,IsNull_Annual Income,Money Handling Level,Money Handling Level1,Money Per Head,Growth,Credit by Score,Determinstic,Growth1,Feedback1,Previous Claims_MEDIAN_Premium Amount,IsNull_Health Score,Previous Claims_MEAN_Premium Amount,Previous Claims,Previous Claims_STD_Premium Amount,Previous Claims_Q3_Premium Amount,Previous Claims_Q1_Premium Amount,IsNull_Customer Feedback,Previous Claims_MAX_Premium Amount,Feedback3,IsNull_Previous Claims,IsNull_Marital Status,Health Score,Health_Risk_Score,Feedback2,CreditInsurance,Sin_Year,IsNull_Credit Score,Health_Age_Interaction,Total Nulls,ENCODED_Policy Start Date - Year,ENCODED_Policy Start Date - Quarter,Feedback4,IsNull_Number of Dependents,IsNull_Occupation,Health Conscious Level1,Sin_Month,Policy Start Date - Month,Health Conscious Level,Health Conscious Level_Q1_Premium Amount,Health Conscious Level_MEAN_Premium Amount,Health Conscious Level_MEDIAN_Premium Amount,Number of Dependents_MEAN_Premium Amount,Number of Dependents_MEDIAN_Premium Amount,Number of Dependents_Q1_Premium Amount,Number of Dependents_Q3_Premium Amount,Number of Dependents_STD_Premium Amount,Health Conscious Level_Q3_Premium Amount,Insurance Duration_MEAN_Premium Amount,Insurance Duration_MEDIAN_Premium Amount,Insurance Duration_Q1_Premium Amount,Insurance Duration_Q3_Premium Amount,Health Conscious Level_MAX_Premium Amount,Credit_Health_Score,Occupation_Q3_Premium Amount,Occupation_MEAN_Premium Amount,Occupation_MAX_Premium Amount,Occupation_MEDIAN_Premium Amount,Occupation_Q1_Premium Amount,Previous Claims_MIN_Premium Amount,Insurance Duration_MAX_Premium Amount,ENCODED_Occupation_Self-Employed,Age,Insurance Duration_STD_Premium Amount,Occupation_STD_Premium Amount
0,10049.0,372.0,2869.0,0,3738228.0,27.013441,10049.0,20098.0,186.0,528.894737,5024.5,20098.0,907.0,0,1151.583106,2.0,898.40295,1606.0,526.0,0,4988.0,4.0,0,0,22.598761,3.870062,744.0,1860.0,-6.447061e-13,0,429.376453,0,4.0,17.0,45.197521,0,0,13740.046488,-2.939152e-15,12,4,514.0,1102.677039,871.0,1104.678891,874.0,516.0,1510.0,865.235996,1511.0,1100.812035,872.0,515.0,1508.0,4991.0,8406.73897,1507.0,1100.430574,4999.0,870.0,514.0,20.0,4996.0,1.0,19.0,859.965806,865.079864
1,31678.0,694.0,1483.0,0,21984532.0,45.645533,10559.333333,95034.0,694.0,812.25641,10559.333333,126712.0,855.0,0,1083.632645,1.0,853.156218,1476.0,506.0,0,4997.0,4.0,0,0,15.569731,4.221513,2776.0,1388.0,-6.447061e-13,0,607.219509,1,4.0,15.0,62.278924,0,1,4857.756069,-1.469576e-15,6,2,511.0,1098.15965,862.0,1104.006551,875.0,514.0,1513.0,864.955881,1497.0,1106.883166,878.0,518.0,1514.0,4999.0,10805.393307,1508.0,1103.361209,4997.0,872.0,514.0,20.0,4997.0,0.0,39.0,863.675409,867.02349
2,25602.0,632.0,567.0,0,16180464.0,40.509494,8534.0,25602.0,632.0,1113.130435,25602.0,204816.0,855.0,0,1083.632645,1.0,853.156218,1476.0,506.0,0,4997.0,8.0,0,0,47.177549,2.641123,5056.0,1896.0,-6.447061e-13,1,1085.083634,1,4.0,16.0,377.420394,0,0,17361.338138,-2.204364e-15,9,4,514.0,1102.677039,871.0,1104.006551,875.0,514.0,1513.0,864.955881,1511.0,1101.733099,872.0,514.0,1502.0,4991.0,29816.21115,1507.0,1100.430574,4999.0,870.0,514.0,20.0,4997.0,1.0,23.0,865.787949,865.079864
3,141855.0,367.0,765.0,0,52060785.0,386.525886,70927.5,283710.0,367.0,6755.0,70927.5,283710.0,855.0,0,1083.632645,1.0,853.156218,1476.0,506.0,0,4997.0,2.0,0,0,10.938144,4.453093,734.0,367.0,1.585375e-14,0,229.701027,1,5.0,19.0,21.876288,0,1,7350.432875,-1.469576e-15,6,3,510.0,1099.009424,867.0,1108.443461,876.0,516.0,1520.0,866.852628,1504.0,1097.042977,861.0,507.0,1504.0,4997.0,4014.298906,1507.0,1100.430574,4999.0,870.0,514.0,20.0,4988.0,1.0,21.0,865.431191,865.079864
4,39651.0,598.0,2022.0,0,23711298.0,66.30602,39651.0,79302.0,598.0,1888.142857,19825.5,79302.0,855.0,0,1082.452746,0.0,851.671355,1472.0,508.0,0,4999.0,0.0,0,0,20.376094,3.981195,1196.0,2392.0,-1.468363e-13,0,427.897966,0,2.0,9.0,40.752187,0,0,6846.367459,-2.939152e-15,12,3,510.0,1099.009424,867.0,1104.678891,874.0,516.0,1510.0,865.235996,1504.0,1104.723079,872.0,515.0,1509.0,4997.0,12184.903989,1507.0,1100.430574,4999.0,870.0,514.0,20.0,4991.0,1.0,21.0,866.377508,865.079864


In [541]:
from sklearn.preprocessing import RobustScaler, MinMaxScaler, StandardScaler

In [542]:
import gzip
import pickle

def save_pickle(foldername, filename, model):
    with gzip.open(f"{foldername}/{filename}.pkl.gz", 'wb') as f:
        pickle.dump(model, f)


In [543]:
def do_scaling(scaler):
    var_cols = list(map(lambda x : x.replace(" ", "_"), x_train.columns))
    scalers = {}
    cols = x_train.select_dtypes("number").columns
    
    for i in range(len(cols)):
        scalers[f"SCALER_{var_cols[i]}"] = scaler
        
        x_train[f"SCALER_{var_cols[i]}"] = scalers[f"SCALER_{var_cols[i]}"].fit_transform(x_train[[cols[i]]]).flatten()
        save_pickle(foldername="do_scalings", filename=f"SCALER_{cols[i]}", model=scalers[f"SCALER_{var_cols[i]}"])
        x_train.drop(columns=cols[i], inplace=True)

        x_validate[f"SCALER_{var_cols[i]}"] = scalers[f"SCALER_{var_cols[i]}"].transform(x_validate[[cols[i]]]).flatten()
        x_validate.drop(columns=cols[i], inplace=True)

        test[f"SCALER_{var_cols[i]}"] = scalers[f"SCALER_{var_cols[i]}"].transform(test[[cols[i]]]).flatten()
        test.drop(columns=cols[i], inplace=True)
    
    return scalers

In [544]:
scaler_objects = do_scaling(RobustScaler())
scaler_objects

{'SCALER_Annual_Income': RobustScaler(),
 'SCALER_Credit_Score': RobustScaler(),
 'SCALER_IsNull_Annual_Income': RobustScaler(),
 'SCALER_Money_Handling_Level': RobustScaler(),
 'SCALER_Money_Handling_Level1': RobustScaler(),
 'SCALER_Money_Per_Head': RobustScaler(),
 'SCALER_Growth': RobustScaler(),
 'SCALER_Credit_by_Score': RobustScaler(),
 'SCALER_Determinstic': RobustScaler(),
 'SCALER_Growth1': RobustScaler(),
 'SCALER_Feedback1': RobustScaler(),
 'SCALER_Previous_Claims_MEDIAN_Premium_Amount': RobustScaler(),
 'SCALER_IsNull_Health_Score': RobustScaler(),
 'SCALER_Previous_Claims_MEAN_Premium_Amount': RobustScaler(),
 'SCALER_Previous_Claims': RobustScaler(),
 'SCALER_Previous_Claims_STD_Premium_Amount': RobustScaler(),
 'SCALER_Previous_Claims_Q3_Premium_Amount': RobustScaler(),
 'SCALER_Previous_Claims_Q1_Premium_Amount': RobustScaler(),
 'SCALER_IsNull_Customer_Feedback': RobustScaler(),
 'SCALER_Previous_Claims_MAX_Premium_Amount': RobustScaler(),
 'SCALER_Feedback3': Robust

In [545]:
x_train.head(3)

Unnamed: 0,SCALER_Annual_Income,SCALER_Credit_Score,SCALER_IsNull_Annual_Income,SCALER_Money_Handling_Level,SCALER_Money_Handling_Level1,SCALER_Money_Per_Head,SCALER_Growth,SCALER_Credit_by_Score,SCALER_Determinstic,SCALER_Growth1,SCALER_Feedback1,SCALER_Previous_Claims_MEDIAN_Premium_Amount,SCALER_IsNull_Health_Score,SCALER_Previous_Claims_MEAN_Premium_Amount,SCALER_Previous_Claims,SCALER_Previous_Claims_STD_Premium_Amount,SCALER_Previous_Claims_Q3_Premium_Amount,SCALER_Previous_Claims_Q1_Premium_Amount,SCALER_IsNull_Customer_Feedback,SCALER_Previous_Claims_MAX_Premium_Amount,SCALER_Feedback3,SCALER_IsNull_Previous_Claims,SCALER_IsNull_Marital_Status,SCALER_Health_Score,SCALER_Health_Risk_Score,SCALER_Feedback2,SCALER_CreditInsurance,SCALER_Sin_Year,SCALER_IsNull_Credit_Score,SCALER_Health_Age_Interaction,SCALER_Total_Nulls,SCALER_ENCODED_Policy_Start_Date_-_Year,SCALER_ENCODED_Policy_Start_Date_-_Quarter,SCALER_Feedback4,SCALER_IsNull_Number_of_Dependents,SCALER_IsNull_Occupation,SCALER_Health_Conscious_Level1,SCALER_Sin_Month,SCALER_Policy_Start_Date_-_Month,SCALER_Health_Conscious_Level,SCALER_Health_Conscious_Level_Q1_Premium_Amount,SCALER_Health_Conscious_Level_MEAN_Premium_Amount,SCALER_Health_Conscious_Level_MEDIAN_Premium_Amount,SCALER_Number_of_Dependents_MEAN_Premium_Amount,SCALER_Number_of_Dependents_MEDIAN_Premium_Amount,SCALER_Number_of_Dependents_Q1_Premium_Amount,SCALER_Number_of_Dependents_Q3_Premium_Amount,SCALER_Number_of_Dependents_STD_Premium_Amount,SCALER_Health_Conscious_Level_Q3_Premium_Amount,SCALER_Insurance_Duration_MEAN_Premium_Amount,SCALER_Insurance_Duration_MEDIAN_Premium_Amount,SCALER_Insurance_Duration_Q1_Premium_Amount,SCALER_Insurance_Duration_Q3_Premium_Amount,SCALER_Health_Conscious_Level_MAX_Premium_Amount,SCALER_Credit_Health_Score,SCALER_Occupation_Q3_Premium_Amount,SCALER_Occupation_MEAN_Premium_Amount,SCALER_Occupation_MAX_Premium_Amount,SCALER_Occupation_MEDIAN_Premium_Amount,SCALER_Occupation_Q1_Premium_Amount,SCALER_Previous_Claims_MIN_Premium_Amount,SCALER_Insurance_Duration_MAX_Premium_Amount,SCALER_ENCODED_Occupation_Self-Employed,SCALER_Age,SCALER_Insurance_Duration_STD_Premium_Amount,SCALER_Occupation_STD_Premium_Amount
436357,0.050061,0.553648,0.0,0.238969,-0.037407,0.649504,-0.258624,0.669903,-0.048528,0.897795,0.727089,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.5,1.0,0.0,-0.215631,0.215631,1.524315,0.319626,-1.0,0.0,0.083345,0.5,0.0,0.1,0.696672,0.0,1.0,0.083645,2.449294e-16,-0.2,0.0,0.0,0.023383,0.0,0.095653,0.0,0.666667,0.0,0.540228,0.285714,-0.739699,0.0,0.0,-0.125,-0.5,0.10618,-0.142857,-0.619544,0.4,-0.333333,0.0,0.0,0.222222,1.0,0.304348,-2.154674,0.0
482033,0.830687,-0.575107,0.0,0.547306,1.215085,0.702121,0.043301,-0.18123,0.399705,2.455742,2.092396,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.5,0.0,0.0,-0.728743,0.728743,0.594164,-0.694579,0.753709,1.0,-0.334912,0.5,-0.333333,-0.1,0.032828,0.0,1.0,-0.368953,-9.797174e-16,0.8,-0.5,-1.0,-0.976617,-0.8,0.631238,0.25,0.666667,0.666667,3.658057,-0.714286,0.458999,1.5,3.0,0.625,0.5,-0.756308,-0.142857,-0.619544,0.4,-0.333333,0.0,0.0,0.333333,1.0,0.608696,-0.781814,0.0
429369,-0.025195,0.695279,0.0,0.179718,-0.116786,-0.168974,0.207935,-0.456311,0.161205,-0.104951,-0.245059,52.0,0.0,57.590036,1.0,30.472002,32.5,9.0,0.0,-4.5,0.0,0.0,0.0,0.728883,-0.728883,-0.380195,-0.188411,1.507418,0.0,0.27508,0.0,-1.0,-1.0,-0.147699,0.0,1.0,-0.197713,-7.347881e-16,0.6,0.0,0.0,0.023383,0.0,0.0,0.125,0.0,0.2,0.0,0.285714,-0.557842,0.0,-1.0,-0.875,-0.5,1.294466,-0.142857,-0.619544,0.4,-0.333333,0.0,0.0,0.333333,1.0,-0.434783,0.0,0.0


In [546]:
x_validate.head(3)

Unnamed: 0,SCALER_Annual_Income,SCALER_Credit_Score,SCALER_IsNull_Annual_Income,SCALER_Money_Handling_Level,SCALER_Money_Handling_Level1,SCALER_Money_Per_Head,SCALER_Growth,SCALER_Credit_by_Score,SCALER_Determinstic,SCALER_Growth1,SCALER_Feedback1,SCALER_Previous_Claims_MEDIAN_Premium_Amount,SCALER_IsNull_Health_Score,SCALER_Previous_Claims_MEAN_Premium_Amount,SCALER_Previous_Claims,SCALER_Previous_Claims_STD_Premium_Amount,SCALER_Previous_Claims_Q3_Premium_Amount,SCALER_Previous_Claims_Q1_Premium_Amount,SCALER_IsNull_Customer_Feedback,SCALER_Previous_Claims_MAX_Premium_Amount,SCALER_Feedback3,SCALER_IsNull_Previous_Claims,SCALER_IsNull_Marital_Status,SCALER_Health_Score,SCALER_Health_Risk_Score,SCALER_Feedback2,SCALER_CreditInsurance,SCALER_Sin_Year,SCALER_IsNull_Credit_Score,SCALER_Health_Age_Interaction,SCALER_Total_Nulls,SCALER_ENCODED_Policy_Start_Date_-_Year,SCALER_ENCODED_Policy_Start_Date_-_Quarter,SCALER_Feedback4,SCALER_IsNull_Number_of_Dependents,SCALER_IsNull_Occupation,SCALER_Health_Conscious_Level1,SCALER_Sin_Month,SCALER_Policy_Start_Date_-_Month,SCALER_Health_Conscious_Level,SCALER_Health_Conscious_Level_Q1_Premium_Amount,SCALER_Health_Conscious_Level_MEAN_Premium_Amount,SCALER_Health_Conscious_Level_MEDIAN_Premium_Amount,SCALER_Number_of_Dependents_MEAN_Premium_Amount,SCALER_Number_of_Dependents_MEDIAN_Premium_Amount,SCALER_Number_of_Dependents_Q1_Premium_Amount,SCALER_Number_of_Dependents_Q3_Premium_Amount,SCALER_Number_of_Dependents_STD_Premium_Amount,SCALER_Health_Conscious_Level_Q3_Premium_Amount,SCALER_Insurance_Duration_MEAN_Premium_Amount,SCALER_Insurance_Duration_MEDIAN_Premium_Amount,SCALER_Insurance_Duration_Q1_Premium_Amount,SCALER_Insurance_Duration_Q3_Premium_Amount,SCALER_Health_Conscious_Level_MAX_Premium_Amount,SCALER_Credit_Health_Score,SCALER_Occupation_Q3_Premium_Amount,SCALER_Occupation_MEAN_Premium_Amount,SCALER_Occupation_MAX_Premium_Amount,SCALER_Occupation_MEDIAN_Premium_Amount,SCALER_Occupation_Q1_Premium_Amount,SCALER_Previous_Claims_MIN_Premium_Amount,SCALER_Insurance_Duration_MAX_Premium_Amount,SCALER_ENCODED_Occupation_Self-Employed,SCALER_Age,SCALER_Insurance_Duration_STD_Premium_Amount,SCALER_Occupation_STD_Premium_Amount
967429,0.594289,-0.72103,0.0,0.278415,1.049671,0.167409,1.414169,-0.291262,0.528256,0.097566,0.025808,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,-0.25,1.0,0.0,0.681842,-0.681842,-0.671972,0.248972,0.0,0.0,0.681602,0.5,0.333333,0.5,-0.162914,0.0,1.0,1.425795,0.0,0.0,1.0,1.0,0.700173,1.6,0.0,0.125,0.0,0.2,0.0,0.142857,0.260301,1.0,-1.0,0.625,0.333333,0.162167,-0.142857,-0.619544,0.4,-0.333333,0.0,0.0,0.0,1.0,0.0,0.822189,0.0
512051,1.039568,-0.892704,0.0,0.49657,1.85504,0.409199,1.443403,-0.420712,2.233974,0.603388,0.220506,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,-0.25,1.0,0.0,-1.03223,1.03223,-0.707339,-0.163738,1.0,0.0,-0.940497,0.0,0.666667,1.0,-0.717314,0.0,0.0,-0.395667,-2.449294e-16,0.2,-0.5,-1.0,-0.976617,-0.8,0.0,0.125,0.0,0.2,0.0,-0.714286,0.0,0.25,0.0,0.5,0.5,-1.020467,0.0,0.0,0.0,0.0,0.0,0.0,0.555556,0.0,-0.826087,0.282128,0.469401
1056662,-0.55953,0.48927,0.0,-0.544784,-0.505448,-0.430224,-0.453232,-0.533981,-0.481507,-0.425017,-0.432158,52.0,0.0,57.590036,1.0,30.472002,32.5,9.0,0.0,-4.5,0.5,0.0,0.0,0.656682,-0.656682,0.208665,-0.509159,-0.246291,0.0,0.400839,-0.5,-0.666667,-0.7,0.494378,0.0,0.0,0.261902,4.898587e-16,-0.4,0.5,0.0,0.0,0.2,0.631238,0.25,0.666667,0.666667,3.658057,0.0,0.458999,1.5,3.0,0.625,0.0,1.05367,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,-0.26087,-0.781814,0.469401


In [547]:
test.head(3)

Unnamed: 0,SCALER_Annual_Income,SCALER_Credit_Score,SCALER_IsNull_Annual_Income,SCALER_Money_Handling_Level,SCALER_Money_Handling_Level1,SCALER_Money_Per_Head,SCALER_Growth,SCALER_Credit_by_Score,SCALER_Determinstic,SCALER_Growth1,SCALER_Feedback1,SCALER_Previous_Claims_MEDIAN_Premium_Amount,SCALER_IsNull_Health_Score,SCALER_Previous_Claims_MEAN_Premium_Amount,SCALER_Previous_Claims,SCALER_Previous_Claims_STD_Premium_Amount,SCALER_Previous_Claims_Q3_Premium_Amount,SCALER_Previous_Claims_Q1_Premium_Amount,SCALER_IsNull_Customer_Feedback,SCALER_Previous_Claims_MAX_Premium_Amount,SCALER_Feedback3,SCALER_IsNull_Previous_Claims,SCALER_IsNull_Marital_Status,SCALER_Health_Score,SCALER_Health_Risk_Score,SCALER_Feedback2,SCALER_CreditInsurance,SCALER_Sin_Year,SCALER_IsNull_Credit_Score,SCALER_Health_Age_Interaction,SCALER_Total_Nulls,SCALER_ENCODED_Policy_Start_Date_-_Year,SCALER_ENCODED_Policy_Start_Date_-_Quarter,SCALER_Feedback4,SCALER_IsNull_Number_of_Dependents,SCALER_IsNull_Occupation,SCALER_Health_Conscious_Level1,SCALER_Sin_Month,SCALER_Policy_Start_Date_-_Month,SCALER_Health_Conscious_Level,SCALER_Health_Conscious_Level_Q1_Premium_Amount,SCALER_Health_Conscious_Level_MEAN_Premium_Amount,SCALER_Health_Conscious_Level_MEDIAN_Premium_Amount,SCALER_Number_of_Dependents_MEAN_Premium_Amount,SCALER_Number_of_Dependents_MEDIAN_Premium_Amount,SCALER_Number_of_Dependents_Q1_Premium_Amount,SCALER_Number_of_Dependents_Q3_Premium_Amount,SCALER_Number_of_Dependents_STD_Premium_Amount,SCALER_Health_Conscious_Level_Q3_Premium_Amount,SCALER_Insurance_Duration_MEAN_Premium_Amount,SCALER_Insurance_Duration_MEDIAN_Premium_Amount,SCALER_Insurance_Duration_Q1_Premium_Amount,SCALER_Insurance_Duration_Q3_Premium_Amount,SCALER_Health_Conscious_Level_MAX_Premium_Amount,SCALER_Credit_Health_Score,SCALER_Occupation_Q3_Premium_Amount,SCALER_Occupation_MEAN_Premium_Amount,SCALER_Occupation_MAX_Premium_Amount,SCALER_Occupation_MEDIAN_Premium_Amount,SCALER_Occupation_Q1_Premium_Amount,SCALER_Previous_Claims_MIN_Premium_Amount,SCALER_Insurance_Duration_MAX_Premium_Amount,SCALER_ENCODED_Occupation_Self-Employed,SCALER_Age,SCALER_Insurance_Duration_STD_Premium_Amount,SCALER_Occupation_STD_Premium_Amount
1200000,-0.602842,-0.2103,0.0,-0.617211,-0.524833,-0.49121,-0.486736,-0.797735,-0.510185,-0.468238,-0.497634,52.0,0.0,57.590036,1.0,30.472002,32.5,9.0,0.0,-4.5,0.0,1.0,1.0,-0.967448,0.967448,-0.566755,-0.837009,0.0,1.0,-0.85602,1.0,0.333333,0.5,-0.696361,0.0,0.0,-0.443763,0.0,0.0,-1.0,-0.75,-1.208313,-1.8,-1.073045,-1.25,-1.666667,-0.8,-0.459772,-1.714286,-1.48387,-2.75,-8.0,-0.625,0.833333,-0.872834,-0.142857,-0.619544,0.4,-0.333333,0.0,0.0,-0.666667,1.0,-0.565217,-0.13203,0.0
1200001,2.777975,-0.978541,0.0,1.573076,4.611736,2.288206,3.460514,-0.485437,3.457878,1.759873,5.498184,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.5,1.0,0.0,-0.643669,0.643669,0.261715,0.069533,1.0,0.0,-0.618053,0.0,0.666667,0.9,0.142893,0.0,0.0,-0.517579,4.898587e-16,-0.4,-1.5,-2.0,-1.12816,0.0,0.631238,0.25,0.666667,0.666667,3.658057,0.0,0.260301,1.0,-1.0,0.625,-2.166667,-0.805656,-0.142857,-0.619544,0.4,-0.333333,0.0,0.0,0.0,1.0,-0.434783,0.822189,0.0
1200002,-0.198907,0.939914,0.0,-0.006114,-0.268819,0.24393,0.187019,0.961165,-0.230081,-0.298192,-0.116795,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,1.0,0.0,-0.022907,0.022907,0.394341,1.712523,0.0,0.0,0.248604,0.0,0.333333,0.5,0.054765,0.0,0.0,-0.205145,4.898587e-16,-0.4,-0.5,-1.0,-0.976617,-0.8,-0.904347,-0.875,-0.333333,-0.8,-4.197114,-0.714286,-1.753579,-2.75,-8.0,-0.5,0.5,0.520366,0.0,0.0,0.0,0.0,0.0,0.0,-0.666667,0.0,0.26087,-1.064823,0.469401


#
---
#

# Joining All Data

In [548]:
train = pd.concat([pd.concat([x_train, y_train], axis=1), pd.concat([x_validate, y_validate], axis=1)]).sort_index()
train.head(3)

Unnamed: 0,SCALER_Annual_Income,SCALER_Credit_Score,SCALER_IsNull_Annual_Income,SCALER_Money_Handling_Level,SCALER_Money_Handling_Level1,SCALER_Money_Per_Head,SCALER_Growth,SCALER_Credit_by_Score,SCALER_Determinstic,SCALER_Growth1,SCALER_Feedback1,SCALER_Previous_Claims_MEDIAN_Premium_Amount,SCALER_IsNull_Health_Score,SCALER_Previous_Claims_MEAN_Premium_Amount,SCALER_Previous_Claims,SCALER_Previous_Claims_STD_Premium_Amount,SCALER_Previous_Claims_Q3_Premium_Amount,SCALER_Previous_Claims_Q1_Premium_Amount,SCALER_IsNull_Customer_Feedback,SCALER_Previous_Claims_MAX_Premium_Amount,SCALER_Feedback3,SCALER_IsNull_Previous_Claims,SCALER_IsNull_Marital_Status,SCALER_Health_Score,SCALER_Health_Risk_Score,SCALER_Feedback2,SCALER_CreditInsurance,SCALER_Sin_Year,SCALER_IsNull_Credit_Score,SCALER_Health_Age_Interaction,SCALER_Total_Nulls,SCALER_ENCODED_Policy_Start_Date_-_Year,SCALER_ENCODED_Policy_Start_Date_-_Quarter,SCALER_Feedback4,SCALER_IsNull_Number_of_Dependents,SCALER_IsNull_Occupation,SCALER_Health_Conscious_Level1,SCALER_Sin_Month,SCALER_Policy_Start_Date_-_Month,SCALER_Health_Conscious_Level,SCALER_Health_Conscious_Level_Q1_Premium_Amount,SCALER_Health_Conscious_Level_MEAN_Premium_Amount,SCALER_Health_Conscious_Level_MEDIAN_Premium_Amount,SCALER_Number_of_Dependents_MEAN_Premium_Amount,SCALER_Number_of_Dependents_MEDIAN_Premium_Amount,SCALER_Number_of_Dependents_Q1_Premium_Amount,SCALER_Number_of_Dependents_Q3_Premium_Amount,SCALER_Number_of_Dependents_STD_Premium_Amount,SCALER_Health_Conscious_Level_Q3_Premium_Amount,SCALER_Insurance_Duration_MEAN_Premium_Amount,SCALER_Insurance_Duration_MEDIAN_Premium_Amount,SCALER_Insurance_Duration_Q1_Premium_Amount,SCALER_Insurance_Duration_Q3_Premium_Amount,SCALER_Health_Conscious_Level_MAX_Premium_Amount,SCALER_Credit_Health_Score,SCALER_Occupation_Q3_Premium_Amount,SCALER_Occupation_MEAN_Premium_Amount,SCALER_Occupation_MAX_Premium_Amount,SCALER_Occupation_MEDIAN_Premium_Amount,SCALER_Occupation_Q1_Premium_Amount,SCALER_Previous_Claims_MIN_Premium_Amount,SCALER_Insurance_Duration_MAX_Premium_Amount,SCALER_ENCODED_Occupation_Self-Employed,SCALER_Age,SCALER_Insurance_Duration_STD_Premium_Amount,SCALER_Occupation_STD_Premium_Amount,Premium Amount
0,-0.391365,-0.978541,0.0,-0.498819,-0.174499,-0.069588,-0.323148,-1.087379,-0.065466,-0.257208,-0.405166,52.0,0.0,57.590036,1.0,30.472002,32.5,9.0,0.0,-4.5,0.0,0.0,0.0,-0.122232,0.122232,-0.725022,-0.347664,0.0,0.0,-0.600771,-0.5,0.333333,0.7,-0.422984,0.0,0.0,-0.014187,-1.469576e-15,1.2,0.0,0.0,0.023383,0.0,0.095653,0.0,0.666667,0.0,0.540228,0.285714,-0.739699,0.0,0.0,-0.125,-0.5,-0.501901,-0.142857,-0.619544,0.4,-0.333333,0.0,0.0,0.222222,1.0,-0.956522,-2.154674,0.0,2869.0
1,0.199672,0.403433,0.0,0.377393,0.111528,-0.04687,0.468853,0.556634,0.216832,0.044643,0.231761,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0,-0.519872,0.519872,0.173298,-0.524112,0.0,0.0,-0.389589,0.0,0.333333,0.5,-0.266712,0.0,1.0,-0.384303,0.0,0.0,-1.0,-0.75,-1.208313,-1.8,0.0,0.125,0.0,0.2,0.0,-1.714286,0.458999,1.5,3.0,0.625,0.833333,-0.28941,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,-0.086957,-0.781814,0.469401,1483.0
2,0.033638,0.137339,0.0,0.098674,0.032683,-0.137028,-0.264976,0.355987,0.516577,0.865019,0.698365,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.5,0.0,0.0,1.268215,-1.268215,1.181256,-0.334206,0.0,1.0,0.177856,0.0,0.333333,0.6,2.616414,0.0,0.0,0.136708,-7.347881e-16,0.6,0.0,0.0,0.023383,0.0,0.0,0.125,0.0,0.2,0.0,0.285714,-0.557842,0.0,-1.0,-0.875,-0.5,1.394712,-0.142857,-0.619544,0.4,-0.333333,0.0,0.0,0.333333,1.0,-0.782609,0.0,0.0,567.0


In [549]:
test.head(3)

Unnamed: 0,SCALER_Annual_Income,SCALER_Credit_Score,SCALER_IsNull_Annual_Income,SCALER_Money_Handling_Level,SCALER_Money_Handling_Level1,SCALER_Money_Per_Head,SCALER_Growth,SCALER_Credit_by_Score,SCALER_Determinstic,SCALER_Growth1,SCALER_Feedback1,SCALER_Previous_Claims_MEDIAN_Premium_Amount,SCALER_IsNull_Health_Score,SCALER_Previous_Claims_MEAN_Premium_Amount,SCALER_Previous_Claims,SCALER_Previous_Claims_STD_Premium_Amount,SCALER_Previous_Claims_Q3_Premium_Amount,SCALER_Previous_Claims_Q1_Premium_Amount,SCALER_IsNull_Customer_Feedback,SCALER_Previous_Claims_MAX_Premium_Amount,SCALER_Feedback3,SCALER_IsNull_Previous_Claims,SCALER_IsNull_Marital_Status,SCALER_Health_Score,SCALER_Health_Risk_Score,SCALER_Feedback2,SCALER_CreditInsurance,SCALER_Sin_Year,SCALER_IsNull_Credit_Score,SCALER_Health_Age_Interaction,SCALER_Total_Nulls,SCALER_ENCODED_Policy_Start_Date_-_Year,SCALER_ENCODED_Policy_Start_Date_-_Quarter,SCALER_Feedback4,SCALER_IsNull_Number_of_Dependents,SCALER_IsNull_Occupation,SCALER_Health_Conscious_Level1,SCALER_Sin_Month,SCALER_Policy_Start_Date_-_Month,SCALER_Health_Conscious_Level,SCALER_Health_Conscious_Level_Q1_Premium_Amount,SCALER_Health_Conscious_Level_MEAN_Premium_Amount,SCALER_Health_Conscious_Level_MEDIAN_Premium_Amount,SCALER_Number_of_Dependents_MEAN_Premium_Amount,SCALER_Number_of_Dependents_MEDIAN_Premium_Amount,SCALER_Number_of_Dependents_Q1_Premium_Amount,SCALER_Number_of_Dependents_Q3_Premium_Amount,SCALER_Number_of_Dependents_STD_Premium_Amount,SCALER_Health_Conscious_Level_Q3_Premium_Amount,SCALER_Insurance_Duration_MEAN_Premium_Amount,SCALER_Insurance_Duration_MEDIAN_Premium_Amount,SCALER_Insurance_Duration_Q1_Premium_Amount,SCALER_Insurance_Duration_Q3_Premium_Amount,SCALER_Health_Conscious_Level_MAX_Premium_Amount,SCALER_Credit_Health_Score,SCALER_Occupation_Q3_Premium_Amount,SCALER_Occupation_MEAN_Premium_Amount,SCALER_Occupation_MAX_Premium_Amount,SCALER_Occupation_MEDIAN_Premium_Amount,SCALER_Occupation_Q1_Premium_Amount,SCALER_Previous_Claims_MIN_Premium_Amount,SCALER_Insurance_Duration_MAX_Premium_Amount,SCALER_ENCODED_Occupation_Self-Employed,SCALER_Age,SCALER_Insurance_Duration_STD_Premium_Amount,SCALER_Occupation_STD_Premium_Amount
1200000,-0.602842,-0.2103,0.0,-0.617211,-0.524833,-0.49121,-0.486736,-0.797735,-0.510185,-0.468238,-0.497634,52.0,0.0,57.590036,1.0,30.472002,32.5,9.0,0.0,-4.5,0.0,1.0,1.0,-0.967448,0.967448,-0.566755,-0.837009,0.0,1.0,-0.85602,1.0,0.333333,0.5,-0.696361,0.0,0.0,-0.443763,0.0,0.0,-1.0,-0.75,-1.208313,-1.8,-1.073045,-1.25,-1.666667,-0.8,-0.459772,-1.714286,-1.48387,-2.75,-8.0,-0.625,0.833333,-0.872834,-0.142857,-0.619544,0.4,-0.333333,0.0,0.0,-0.666667,1.0,-0.565217,-0.13203,0.0
1200001,2.777975,-0.978541,0.0,1.573076,4.611736,2.288206,3.460514,-0.485437,3.457878,1.759873,5.498184,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.5,1.0,0.0,-0.643669,0.643669,0.261715,0.069533,1.0,0.0,-0.618053,0.0,0.666667,0.9,0.142893,0.0,0.0,-0.517579,4.898587e-16,-0.4,-1.5,-2.0,-1.12816,0.0,0.631238,0.25,0.666667,0.666667,3.658057,0.0,0.260301,1.0,-1.0,0.625,-2.166667,-0.805656,-0.142857,-0.619544,0.4,-0.333333,0.0,0.0,0.0,1.0,-0.434783,0.822189,0.0
1200002,-0.198907,0.939914,0.0,-0.006114,-0.268819,0.24393,0.187019,0.961165,-0.230081,-0.298192,-0.116795,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,1.0,0.0,-0.022907,0.022907,0.394341,1.712523,0.0,0.0,0.248604,0.0,0.333333,0.5,0.054765,0.0,0.0,-0.205145,4.898587e-16,-0.4,-0.5,-1.0,-0.976617,-0.8,-0.904347,-0.875,-0.333333,-0.8,-4.197114,-0.714286,-1.753579,-2.75,-8.0,-0.5,0.5,0.520366,0.0,0.0,0.0,0.0,0.0,0.0,-0.666667,0.0,0.26087,-1.064823,0.469401


In [550]:
df = pd.concat([train, test])

#
---
#

# Download the `Model Ready df`

In [551]:
df.to_csv("trainable_df.csv", index=False)