#
---
# Gist of this Notebook : 
<ol>
<li><strong>Converted <code>Policy Start Date</code>:</strong> Converted the "Policy Start Date" column to datetime objects.</li>

<li><strong>Analyzed columns:</strong> Analyzed and displayed which columns had missing data, the number of unique values in each column, and their datatypes.</li>

<li><strong>Feature Engineering :</strong> Generated the following new features:
    <ul>
        <li><strong>Health-Related Features:</strong>
            <ul>
                <li><code>Health Conscious Level</code>: Created by combining smoking, exercise, age, and health score using binning and summing.</li>
                <li><code>Health Conscious Level1</code>: Created by multiplying smoking, exercise, age and health score (using mapping).</li>
                <li><code>Health_Risk_Score</code> : Created by combining the effect of smoking , exercise and health score</li>
                <li><code>Health_Age_Interaction</code> : Interaction between Age and Health Score.</li>
            </ul>
        </li>
        <li><strong>Financial Features:</strong>
            <ul>
                <li><code>Money Per Head</code>: Annual income per dependent.</li>
                <li><code>Money Handling Level</code>: Annual income multiplied by credit score.</li>
                <li><code>Money Handling Level1</code>: Annual income divided by credit score.</li>
            </ul>
        </li>
          <li><strong>Growth Features:</strong>
            <ul>
               <li><code>Growth</code> : Annual Income with respect to education</li>
               <li><code>Growth1</code> : Annual Income divided by education</li>
            </ul>
        </li>
          <li><strong>Deterministic Feature:</strong>
                <ul>
                   <li><code>Determinstic</code>: Annual income with respect to age</li>
               </ul>
        </li>
        <li><strong>Date and Time Features:</strong>
            <ul>
                <li><code>Day_Name</code> : Day of the week the insurance policy was created.</li>
            </ul>
        </li>
         <li><strong>Credit and Insurance Features:</strong>
            <ul>
                <li><code>Credit by Score</code>: Credit score divided by previous claims.</li>
                <li><code>CreditInsurance</code> : Credit score multiplied by insurance duration</li>
                <li> <code>Credit_Health_Score</code> : Credit Score and Health Score Interaction</li>
             </ul>
        </li>
        <li><strong>Customer Feedback Features:</strong>
            <ul>
                <li><code>Feedback1</code> : Annual Income based on the customer feedback (using mapping).</li>
                <li><code>Feedback2</code> : Credit Score based on the customer feedback (using mapping).</li>
                <li><code>Feedback3</code> : Previous Claims based on the customer feedback (using mapping).</li>
                <li><code>Feedback4</code> : Health Score based on the customer feedback (using mapping).</li>
            </ul>
        </li>
        <li><strong>Null Count Feature:</strong>
            <ul>
                 <li><code>Total Nulls</code>: Sum of all <code>IsNull_*</code> columns, indicating the amount of missing data in a row.</li>
             </ul>
        </li>
    </ul>
</li>
  <li><strong>Downloaded the dataframe :</strong> Downloaded the modified dataframe that contains all the newly created features.</li>
</ol>


#####
---
#

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import *
import xgboost as xgb

from sklearn.preprocessing import PowerTransformer


import warnings
warnings.filterwarnings("ignore")
pd.set_option("display.max_columns", None)

In [2]:
df = pd.read_csv("cleaned_df.csv")

df["Policy Start Date"] = pd.to_datetime(df["Policy Start Date"])

In [3]:
df.shape

(2000000, 31)

In [4]:
df.head()

Unnamed: 0,Age,Gender,Annual Income,Marital Status,Number of Dependents,Education Level,Occupation,Health Score,Location,Policy Type,Previous Claims,Vehicle Age,Credit Score,Insurance Duration,Policy Start Date,Customer Feedback,Smoking Status,Exercise Frequency,Property Type,Premium Amount,IsNull_Age,IsNull_Annual Income,IsNull_Marital Status,IsNull_Number of Dependents,IsNull_Occupation,IsNull_Health Score,IsNull_Previous Claims,IsNull_Vehicle Age,IsNull_Credit Score,IsNull_Insurance Duration,IsNull_Customer Feedback
0,19.0,Female,10049.0,Married,1.0,Bachelor's,Self-Employed,22.598761,Urban,Premium,2.0,17.0,372.0,5.0,2023-12-23 15:21:39.134960,Poor,No,Weekly,House,2869.0,0,0,0,0,0,0,0,0,0,0,0
1,39.0,Female,31678.0,Divorced,3.0,Master's,Unemployed,15.569731,Rural,Comprehensive,1.0,12.0,694.0,2.0,2023-06-12 15:21:39.111551,Average,Yes,Monthly,House,1483.0,0,0,0,0,1,0,0,0,0,0,0
2,23.0,Male,25602.0,Divorced,3.0,High School,Self-Employed,47.177549,Suburban,Premium,1.0,14.0,632.0,3.0,2023-09-30 15:21:39.221386,Good,Yes,Weekly,House,567.0,0,0,0,0,0,0,0,0,1,0,0
3,21.0,Male,141855.0,Married,2.0,Bachelor's,Self-Employed,10.938144,Rural,Basic,1.0,0.0,367.0,1.0,2024-06-12 15:21:39.226954,Poor,Yes,Daily,Apartment,765.0,0,0,0,0,1,0,0,0,0,0,0
4,21.0,Male,39651.0,Single,1.0,Bachelor's,Self-Employed,20.376094,Rural,Premium,0.0,8.0,598.0,4.0,2021-12-01 15:21:39.252145,Poor,Yes,Weekly,House,2022.0,0,0,0,0,0,0,0,0,0,0,0


In [5]:
nulls = []
nuniques = []
uniques = []
types = []

for i in df.columns:
    nulls.append(df[i].isnull().sum())
    nuniques.append(df[i].nunique())
    uniques.append(df[i].unique())
    types.append(df[i].dtype)


pd.DataFrame(
    {
        "Column" : df.columns,
        "Data Type" : types,
        "Nulls" : nulls,
        "No. of Uniques" : nuniques,
        "Uniques" : uniques
    }
).sort_values(by="Nulls", ascending=False)

Unnamed: 0,Column,Data Type,Nulls,No. of Uniques,Uniques
19,Premium Amount,float64,800000,4794,"[2869.0, 1483.0, 567.0, 765.0, 2022.0, 3202.0,..."
1,Gender,object,0,2,"[Female, Male]"
0,Age,float64,0,47,"[19.0, 39.0, 23.0, 21.0, 29.0, 41.0, 48.0, 44...."
3,Marital Status,object,0,3,"[Married, Divorced, Single]"
4,Number of Dependents,float64,0,5,"[1.0, 3.0, 2.0, 0.0, 4.0]"
5,Education Level,object,0,4,"[Bachelor's, Master's, High School, PhD]"
2,Annual Income,float64,0,97970,"[10049.0, 31678.0, 25602.0, 141855.0, 39651.0,..."
7,Health Score,float64,0,933976,"[22.59876067181393, 15.569730989408043, 47.177..."
8,Location,object,0,3,"[Urban, Rural, Suburban]"
9,Policy Type,object,0,3,"[Premium, Comprehensive, Basic]"


---
#
# **Feature Engineering**
---

# Health Conscious Level

**Exercise Frequency** --> `['Weekly', 'Monthly', 'Daily', 'Rarely']`

**Smoking Status** --> `['No', 'Yes']`

**Health Score** 
- Poor --> (< 16.285503904803008)
- Average --> (>= 16.285503904803008 & <= 33.959695457149195)
- Good --> (> 33.959695457149195)

**Age** 
- Poor --> (< 30)
- Average --> (>= 30 & <= 53)
- Good --> (> 53.0)

In [6]:
health_min_df = pd.DataFrame()

In [7]:
health_min_df["smoke"] = df["Smoking Status"].replace({"Yes" : 0, "No" : 1})

In [8]:
health_min_df["ex"] = df["Exercise Frequency"].replace({"Rarely" : 0, "Monthly" : 1, "Weekly" : 2, "Daily" : 3})

In [9]:
bins = [0, 30, 53, float('inf')]
labels = [0, 1, 2]

health_min_df["age"] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)

In [10]:
bins = [0, 16.285503904803008, 33.959695457149195, float('inf')]
labels = [0, 1, 2]

health_min_df["health"] = pd.cut(df['Health Score'], bins=bins, labels=labels, right=False)

In [11]:
df["Health Conscious Level"] = health_min_df.sum(axis=1)

In [12]:
df["Health Conscious Level1"] = df["Smoking Status"].replace({"Yes" : 2, "No" : 4}) * df["Exercise Frequency"].replace({"Rarely" : 2, "Monthly" : 4, "Weekly" : 8, "Daily" : 16}) * df['Age'] * df['Health Score']

In [13]:
df.head(3)

Unnamed: 0,Age,Gender,Annual Income,Marital Status,Number of Dependents,Education Level,Occupation,Health Score,Location,Policy Type,Previous Claims,Vehicle Age,Credit Score,Insurance Duration,Policy Start Date,Customer Feedback,Smoking Status,Exercise Frequency,Property Type,Premium Amount,IsNull_Age,IsNull_Annual Income,IsNull_Marital Status,IsNull_Number of Dependents,IsNull_Occupation,IsNull_Health Score,IsNull_Previous Claims,IsNull_Vehicle Age,IsNull_Credit Score,IsNull_Insurance Duration,IsNull_Customer Feedback,Health Conscious Level,Health Conscious Level1
0,19.0,Female,10049.0,Married,1.0,Bachelor's,Self-Employed,22.598761,Urban,Premium,2.0,17.0,372.0,5.0,2023-12-23 15:21:39.134960,Poor,No,Weekly,House,2869.0,0,0,0,0,0,0,0,0,0,0,0,4,13740.046488
1,39.0,Female,31678.0,Divorced,3.0,Master's,Unemployed,15.569731,Rural,Comprehensive,1.0,12.0,694.0,2.0,2023-06-12 15:21:39.111551,Average,Yes,Monthly,House,1483.0,0,0,0,0,1,0,0,0,0,0,0,2,4857.756069
2,23.0,Male,25602.0,Divorced,3.0,High School,Self-Employed,47.177549,Suburban,Premium,1.0,14.0,632.0,3.0,2023-09-30 15:21:39.221386,Good,Yes,Weekly,House,567.0,0,0,0,0,0,0,0,0,1,0,0,4,17361.338138


#
---
#

# Money Per Head

In [14]:
df["Money Per Head"] = df["Annual Income"] / df["Number of Dependents"].where(df["Number of Dependents"] != 0, 1)

In [15]:
df.head(3) 

Unnamed: 0,Age,Gender,Annual Income,Marital Status,Number of Dependents,Education Level,Occupation,Health Score,Location,Policy Type,Previous Claims,Vehicle Age,Credit Score,Insurance Duration,Policy Start Date,Customer Feedback,Smoking Status,Exercise Frequency,Property Type,Premium Amount,IsNull_Age,IsNull_Annual Income,IsNull_Marital Status,IsNull_Number of Dependents,IsNull_Occupation,IsNull_Health Score,IsNull_Previous Claims,IsNull_Vehicle Age,IsNull_Credit Score,IsNull_Insurance Duration,IsNull_Customer Feedback,Health Conscious Level,Health Conscious Level1,Money Per Head
0,19.0,Female,10049.0,Married,1.0,Bachelor's,Self-Employed,22.598761,Urban,Premium,2.0,17.0,372.0,5.0,2023-12-23 15:21:39.134960,Poor,No,Weekly,House,2869.0,0,0,0,0,0,0,0,0,0,0,0,4,13740.046488,10049.0
1,39.0,Female,31678.0,Divorced,3.0,Master's,Unemployed,15.569731,Rural,Comprehensive,1.0,12.0,694.0,2.0,2023-06-12 15:21:39.111551,Average,Yes,Monthly,House,1483.0,0,0,0,0,1,0,0,0,0,0,0,2,4857.756069,10559.333333
2,23.0,Male,25602.0,Divorced,3.0,High School,Self-Employed,47.177549,Suburban,Premium,1.0,14.0,632.0,3.0,2023-09-30 15:21:39.221386,Good,Yes,Weekly,House,567.0,0,0,0,0,0,0,0,0,1,0,0,4,17361.338138,8534.0


#
---
#

# Money Handling Level

In [16]:
df["Money Handling Level"] = df["Annual Income"] * df["Credit Score"]

In [17]:
df["Money Handling Level1"] = df["Annual Income"] / df["Credit Score"]

In [18]:
df.head(3)

Unnamed: 0,Age,Gender,Annual Income,Marital Status,Number of Dependents,Education Level,Occupation,Health Score,Location,Policy Type,Previous Claims,Vehicle Age,Credit Score,Insurance Duration,Policy Start Date,Customer Feedback,Smoking Status,Exercise Frequency,Property Type,Premium Amount,IsNull_Age,IsNull_Annual Income,IsNull_Marital Status,IsNull_Number of Dependents,IsNull_Occupation,IsNull_Health Score,IsNull_Previous Claims,IsNull_Vehicle Age,IsNull_Credit Score,IsNull_Insurance Duration,IsNull_Customer Feedback,Health Conscious Level,Health Conscious Level1,Money Per Head,Money Handling Level,Money Handling Level1
0,19.0,Female,10049.0,Married,1.0,Bachelor's,Self-Employed,22.598761,Urban,Premium,2.0,17.0,372.0,5.0,2023-12-23 15:21:39.134960,Poor,No,Weekly,House,2869.0,0,0,0,0,0,0,0,0,0,0,0,4,13740.046488,10049.0,3738228.0,27.013441
1,39.0,Female,31678.0,Divorced,3.0,Master's,Unemployed,15.569731,Rural,Comprehensive,1.0,12.0,694.0,2.0,2023-06-12 15:21:39.111551,Average,Yes,Monthly,House,1483.0,0,0,0,0,1,0,0,0,0,0,0,2,4857.756069,10559.333333,21984532.0,45.645533
2,23.0,Male,25602.0,Divorced,3.0,High School,Self-Employed,47.177549,Suburban,Premium,1.0,14.0,632.0,3.0,2023-09-30 15:21:39.221386,Good,Yes,Weekly,House,567.0,0,0,0,0,0,0,0,0,1,0,0,4,17361.338138,8534.0,16180464.0,40.509494


#
---
#

# Growth

In [19]:
df["Growth"] = df["Education Level"].replace({"High School" : 1, "Bachelor's" : 2, "Master's" : 3, "PhD" : 4}) * df["Annual Income"]

In [20]:
df["Growth1"] = df["Annual Income"] / df["Education Level"].replace({"High School" : 1, "Bachelor's" : 2, "Master's" : 3, "PhD" : 4})

In [21]:
df.head(3)

Unnamed: 0,Age,Gender,Annual Income,Marital Status,Number of Dependents,Education Level,Occupation,Health Score,Location,Policy Type,Previous Claims,Vehicle Age,Credit Score,Insurance Duration,Policy Start Date,Customer Feedback,Smoking Status,Exercise Frequency,Property Type,Premium Amount,IsNull_Age,IsNull_Annual Income,IsNull_Marital Status,IsNull_Number of Dependents,IsNull_Occupation,IsNull_Health Score,IsNull_Previous Claims,IsNull_Vehicle Age,IsNull_Credit Score,IsNull_Insurance Duration,IsNull_Customer Feedback,Health Conscious Level,Health Conscious Level1,Money Per Head,Money Handling Level,Money Handling Level1,Growth,Growth1
0,19.0,Female,10049.0,Married,1.0,Bachelor's,Self-Employed,22.598761,Urban,Premium,2.0,17.0,372.0,5.0,2023-12-23 15:21:39.134960,Poor,No,Weekly,House,2869.0,0,0,0,0,0,0,0,0,0,0,0,4,13740.046488,10049.0,3738228.0,27.013441,20098.0,5024.5
1,39.0,Female,31678.0,Divorced,3.0,Master's,Unemployed,15.569731,Rural,Comprehensive,1.0,12.0,694.0,2.0,2023-06-12 15:21:39.111551,Average,Yes,Monthly,House,1483.0,0,0,0,0,1,0,0,0,0,0,0,2,4857.756069,10559.333333,21984532.0,45.645533,95034.0,10559.333333
2,23.0,Male,25602.0,Divorced,3.0,High School,Self-Employed,47.177549,Suburban,Premium,1.0,14.0,632.0,3.0,2023-09-30 15:21:39.221386,Good,Yes,Weekly,House,567.0,0,0,0,0,0,0,0,0,1,0,0,4,17361.338138,8534.0,16180464.0,40.509494,25602.0,25602.0


#
---
#

# Determinstic

In [22]:
df["Determinstic"] = df["Annual Income"] * (1 / df["Age"])

In [23]:
df.head(3)

Unnamed: 0,Age,Gender,Annual Income,Marital Status,Number of Dependents,Education Level,Occupation,Health Score,Location,Policy Type,Previous Claims,Vehicle Age,Credit Score,Insurance Duration,Policy Start Date,Customer Feedback,Smoking Status,Exercise Frequency,Property Type,Premium Amount,IsNull_Age,IsNull_Annual Income,IsNull_Marital Status,IsNull_Number of Dependents,IsNull_Occupation,IsNull_Health Score,IsNull_Previous Claims,IsNull_Vehicle Age,IsNull_Credit Score,IsNull_Insurance Duration,IsNull_Customer Feedback,Health Conscious Level,Health Conscious Level1,Money Per Head,Money Handling Level,Money Handling Level1,Growth,Growth1,Determinstic
0,19.0,Female,10049.0,Married,1.0,Bachelor's,Self-Employed,22.598761,Urban,Premium,2.0,17.0,372.0,5.0,2023-12-23 15:21:39.134960,Poor,No,Weekly,House,2869.0,0,0,0,0,0,0,0,0,0,0,0,4,13740.046488,10049.0,3738228.0,27.013441,20098.0,5024.5,528.894737
1,39.0,Female,31678.0,Divorced,3.0,Master's,Unemployed,15.569731,Rural,Comprehensive,1.0,12.0,694.0,2.0,2023-06-12 15:21:39.111551,Average,Yes,Monthly,House,1483.0,0,0,0,0,1,0,0,0,0,0,0,2,4857.756069,10559.333333,21984532.0,45.645533,95034.0,10559.333333,812.25641
2,23.0,Male,25602.0,Divorced,3.0,High School,Self-Employed,47.177549,Suburban,Premium,1.0,14.0,632.0,3.0,2023-09-30 15:21:39.221386,Good,Yes,Weekly,House,567.0,0,0,0,0,0,0,0,0,1,0,0,4,17361.338138,8534.0,16180464.0,40.509494,25602.0,25602.0,1113.130435


#
---
#

# Some Dummy Features

In [24]:
df["Day_Name"] = df["Policy Start Date"].dt.day_name()

In [25]:
df["Credit by Score"] = df["Credit Score"]/df["Previous Claims"].where(df["Previous Claims"] != 0, 1)

In [26]:
df['CreditInsurance'] = df['Credit Score'] * df['Insurance Duration']

In [27]:
df['Health_Risk_Score'] = df['Smoking Status'].apply(lambda x: 1 if x == 'Smoker' else 0) + df['Exercise Frequency'].apply(lambda x: 1 if x == 'Low' else (0.5 if x == 'Medium' else 0)) + (100 - df['Health Score']) / 20

In [28]:
df['Credit_Health_Score'] = df['Credit Score'] * df['Health Score']
df['Health_Age_Interaction'] = df['Health Score'] * df['Age']

#
---
#

# Customer Feedback

In [29]:
df["Feedback1"] = df["Annual Income"] * df["Customer Feedback"].replace({"Poor" : 2, "Average" : 4, "Good" : 8})

In [30]:
df["Feedback2"] = df["Credit Score"] * df["Customer Feedback"].replace({"Poor" : 2, "Average" : 4, "Good" : 8})

In [31]:
df["Feedback3"] = df["Previous Claims"] * df["Customer Feedback"].replace({"Poor" : 2, "Average" : 4, "Good" : 8})

In [32]:
df["Feedback4"] = df["Health Score"] * df["Customer Feedback"].replace({"Poor" : 2, "Average" : 4, "Good" : 8})

#
---
# Weekend

In [40]:
# df['is_weekend'] = df['Day_Name'].isin(['Sunday', 'Saturday']).astype('int')

In [41]:
# df["Day_Name"].value_counts()

DID IN Sheet 3. EDA

#
---
# Downloading the Df

In [38]:
df.head()

Unnamed: 0,Age,Gender,Annual Income,Marital Status,Number of Dependents,Education Level,Occupation,Health Score,Location,Policy Type,Previous Claims,Vehicle Age,Credit Score,Insurance Duration,Policy Start Date,Customer Feedback,Smoking Status,Exercise Frequency,Property Type,Premium Amount,IsNull_Age,IsNull_Annual Income,IsNull_Marital Status,IsNull_Number of Dependents,IsNull_Occupation,IsNull_Health Score,IsNull_Previous Claims,IsNull_Vehicle Age,IsNull_Credit Score,IsNull_Insurance Duration,IsNull_Customer Feedback,Health Conscious Level,Health Conscious Level1,Money Per Head,Money Handling Level,Money Handling Level1,Growth,Growth1,Determinstic,Day_Name,Credit by Score,CreditInsurance,Health_Risk_Score,Credit_Health_Score,Health_Age_Interaction,Feedback1,Feedback2,Feedback3,Feedback4,is_weekend
0,19.0,Female,10049.0,Married,1.0,Bachelor's,Self-Employed,22.598761,Urban,Premium,2.0,17.0,372.0,5.0,2023-12-23 15:21:39.134960,Poor,No,Weekly,House,2869.0,0,0,0,0,0,0,0,0,0,0,0,4,13740.046488,10049.0,3738228.0,27.013441,20098.0,5024.5,528.894737,Saturday,186.0,1860.0,3.870062,8406.73897,429.376453,20098.0,744.0,4.0,45.197521,1
1,39.0,Female,31678.0,Divorced,3.0,Master's,Unemployed,15.569731,Rural,Comprehensive,1.0,12.0,694.0,2.0,2023-06-12 15:21:39.111551,Average,Yes,Monthly,House,1483.0,0,0,0,0,1,0,0,0,0,0,0,2,4857.756069,10559.333333,21984532.0,45.645533,95034.0,10559.333333,812.25641,Monday,694.0,1388.0,4.221513,10805.393307,607.219509,126712.0,2776.0,4.0,62.278924,0
2,23.0,Male,25602.0,Divorced,3.0,High School,Self-Employed,47.177549,Suburban,Premium,1.0,14.0,632.0,3.0,2023-09-30 15:21:39.221386,Good,Yes,Weekly,House,567.0,0,0,0,0,0,0,0,0,1,0,0,4,17361.338138,8534.0,16180464.0,40.509494,25602.0,25602.0,1113.130435,Saturday,632.0,1896.0,2.641123,29816.21115,1085.083634,204816.0,5056.0,8.0,377.420394,1
3,21.0,Male,141855.0,Married,2.0,Bachelor's,Self-Employed,10.938144,Rural,Basic,1.0,0.0,367.0,1.0,2024-06-12 15:21:39.226954,Poor,Yes,Daily,Apartment,765.0,0,0,0,0,1,0,0,0,0,0,0,3,7350.432875,70927.5,52060785.0,386.525886,283710.0,70927.5,6755.0,Wednesday,367.0,367.0,4.453093,4014.298906,229.701027,283710.0,734.0,2.0,21.876288,0
4,21.0,Male,39651.0,Single,1.0,Bachelor's,Self-Employed,20.376094,Rural,Premium,0.0,8.0,598.0,4.0,2021-12-01 15:21:39.252145,Poor,Yes,Weekly,House,2022.0,0,0,0,0,0,0,0,0,0,0,0,3,6846.367459,39651.0,23711298.0,66.30602,79302.0,19825.5,1888.142857,Wednesday,598.0,2392.0,3.981195,12184.903989,427.897966,79302.0,1196.0,0.0,40.752187,0


In [34]:
df.dtypes

Age                                   float64
Gender                                 object
Annual Income                         float64
Marital Status                         object
Number of Dependents                  float64
Education Level                        object
Occupation                             object
Health Score                          float64
Location                               object
Policy Type                            object
Previous Claims                       float64
Vehicle Age                           float64
Credit Score                          float64
Insurance Duration                    float64
Policy Start Date              datetime64[ns]
Customer Feedback                      object
Smoking Status                         object
Exercise Frequency                     object
Property Type                          object
Premium Amount                        float64
IsNull_Age                              int64
IsNull_Annual Income              

#
---
#

# Creating total nulls column

In [41]:
df['Total Nulls'] = df[['IsNull_Age', 'IsNull_Annual Income', 'IsNull_Marital Status', 'IsNull_Number of Dependents', 'IsNull_Occupation', 'IsNull_Health Score', 'IsNull_Previous Claims', 'IsNull_Vehicle Age', 'IsNull_Credit Score', 'IsNull_Insurance Duration', 'IsNull_Customer Feedback']].sum(axis=1)

#
---
#

# Changing datatypes for lower storage

| **Data Type**           | **Description**                   | **Range** / **Supported Values**                                   | **Memory (per value)**    |
|--------------------------|------------------------------------|-------------------------------------------------------------------|---------------------------|
| **int8**                | Signed Integer                    | -128 to 127                                                      | 1 byte (8 bits)          |
| **int16**               | Signed Integer                    | -32,768 to 32,767                                                | 2 bytes (16 bits)        |
| **int32**               | Signed Integer                    | -2,147,483,648 to 2,147,483,647                                  | 4 bytes (32 bits)        |
| **int64**               | Signed Integer                    | -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807          | 8 bytes (64 bits)        |
| **uint8**               | Unsigned Integer                  | 0 to 255                                                         | 1 byte (8 bits)          |
| **uint16**              | Unsigned Integer                  | 0 to 65,535                                                      | 2 bytes (16 bits)        |
| **uint32**              | Unsigned Integer                  | 0 to 4,294,967,295                                               | 4 bytes (32 bits)        |
| **uint64**              | Unsigned Integer                  | 0 to 18,446,744,073,709,551,615                                  | 8 bytes (64 bits)        |
| **float32**             | Floating-point Number             | Approx. ±3.4e38 (7 digits precision)                             | 4 bytes (32 bits)        |
| **float64**             | Floating-point Number             | Approx. ±1.8e308 (15 digits precision)                           | 8 bytes (64 bits)        |
| **complex64**           | Complex Number                    | 32-bit float for real and imaginary parts                        | 8 bytes (64 bits)        |
| **complex128**          | Complex Number                    | 64-bit float for real and imaginary parts                        | 16 bytes (128 bits)      |
| **bool**                | Boolean                           | `True`, `False`                                                  | 1 byte (8 bits)          |
| **object**              | General Python Objects            | Any Python object (e.g., strings, lists)                         | Varies (depends on data) |
| **string**              | Optimized String Data             | Text (Unicode)                                                   | Varies (depends on data) |
| **category**            | Categorical Data                  | Finite set of discrete values (stored as integers)               | 1–8 bytes (optimized)    |
| **datetime64[ns]**      | Datetime                          | `1677-09-21` to `2262-04-11`                                     | 8 bytes (64 bits)        |
| **timedelta64[ns]**     | Time Difference                   | Large range (e.g., years to nanoseconds)                         | 8 bytes (64 bits)        |
| **Sparse[int]**         | Sparse Integer Array              | As per integer dtype, with reduced storage for zeroes            | Depends on sparsity      |
| **Sparse[float]**       | Sparse Floating Array             | As per float dtype, with reduced storage for zeroes              | Depends on sparsity      |
| **Int64Dtype**          | Nullable Integer                  | Similar to `int64` with `pd.NA` for missing values               | 8 bytes (64 bits)        |
| **Float64Dtype**        | Nullable Float                    | Similar to `float64` with `pd.NA` for missing values             | 8 bytes (64 bits)        |
| **BooleanDtype**        | Nullable Boolean                  | Similar to `bool` with `pd.NA` for missing values                | 1 byte (8 bits)          |

### Key Notes:
1. **Object dtype**:
   - Memory varies significantly as it stores pointers to Python objects.
   - Strings in `object` are generally larger than `string` dtype.

2. **Sparse dtypes**:
   - Save memory for arrays with many zeroes or missing values.
   - Actual memory depends on sparsity.

3. **Category dtype**:
   - Efficient storage for repetitive data (e.g., text categories).
   - Internally stored as integers with a mapping to category labels.


In [35]:
# df[['IsNull_Age', 'IsNull_Annual Income', 'IsNull_Marital Status', 'IsNull_Number of Dependents', 'IsNull_Occupation', 'IsNull_Health Score', 'IsNull_Previous Claims', 'IsNull_Vehicle Age', 'IsNull_Credit Score', 'IsNull_Insurance Duration', 'IsNull_Customer Feedback']] = df[['IsNull_Age', 'IsNull_Annual Income', 'IsNull_Marital Status', 'IsNull_Number of Dependents', 'IsNull_Occupation', 'IsNull_Health Score', 'IsNull_Previous Claims', 'IsNull_Vehicle Age', 'IsNull_Credit Score', 'IsNull_Insurance Duration', 'IsNull_Customer Feedback']].astype(bool) 

In [42]:
nulls = []
nuniques = []
uniques = []
types = []

for i in df.columns:
    nulls.append(df[i].isnull().sum())
    nuniques.append(df[i].nunique())
    uniques.append(df[i].unique())
    types.append(df[i].dtype)


pd.DataFrame(
    {
        "Column" : df.columns,
        "Data Type" : types,
        "Nulls" : nulls,
        "No. of Uniques" : nuniques,
        "Uniques" : uniques
    }
).sort_values(by="Nulls", ascending=False)

Unnamed: 0,Column,Data Type,Nulls,No. of Uniques,Uniques
19,Premium Amount,float64,800000,4794,"[2869.0, 1483.0, 567.0, 765.0, 2022.0, 3202.0,..."
0,Age,float64,0,47,"[19.0, 39.0, 23.0, 21.0, 29.0, 41.0, 48.0, 44...."
2,Annual Income,float64,0,97970,"[10049.0, 31678.0, 25602.0, 141855.0, 39651.0,..."
3,Marital Status,object,0,3,"[Married, Divorced, Single]"
4,Number of Dependents,float64,0,5,"[1.0, 3.0, 2.0, 0.0, 4.0]"
5,Education Level,object,0,4,"[Bachelor's, Master's, High School, PhD]"
6,Occupation,object,0,3,"[Self-Employed, Unemployed, Employed]"
7,Health Score,float64,0,933976,"[22.59876067181393, 15.569730989408043, 47.177..."
8,Location,object,0,3,"[Urban, Rural, Suburban]"
9,Policy Type,object,0,3,"[Premium, Comprehensive, Basic]"


# **EDAed_df.csv ---> 754MB**

#
---
#

In [43]:
df.to_csv("EDAed_df.csv", index=False)