# **Project Name**    - **Classification - Health Insurance Cross Sell Prediction**



##### **Project Type**    - Classification
##### **Contribution**    - Individual
##### **Team Member 1 - Gautam Verma**


# **Project Summary -**

#### Project Summary: Insurance Cross-Sell Prediction
The goal of this project is to predict the likelihood of existing insurance customers purchasing a vehicle insurance policy using machine learning techniques. The dataset includes 12 columns, such as Gender, Age, Driving_License, Region_Code, Previously_Insured, Vehicle_Age, Vehicle_Damage, Annual_Premium, Policy_Sales_Channel, Vintage, and Response, with Response being the target variable indicating whether a customer purchased the vehicle insurance.

Data Exploration and Preprocessing:
The initial exploration of the data involved understanding the distribution and relationships within the dataset. Key preprocessing steps included handling missing values and encoding categorical variables. Categorical variables such as Gender, Vehicle_Age, and Vehicle_Damage were label-encoded, while variables like Region_Code and Policy_Sales_Channel were one-hot encoded to ensure compatibility with machine learning algorithms. Numerical features were standardized to bring all variables to a comparable scale.

Exploratory Data Analysis (EDA):
EDA was conducted using visualizations to identify trends and patterns within the data. Histograms, bar charts, and heatmaps were used to visualize the distributions and correlations. Notable observations included:

Customers with Previously_Insured status were less likely to purchase vehicle insurance, which was confirmed through the Chi-square test of independence.
Significant variation in Annual_Premium across different Region_Code values, justifying the need for feature engineering and hypothesis testing.
Hypothesis Testing:
A Chi-square test of independence was employed to examine the relationship between categorical variables such as Previously_Insured and Response. This test confirmed a significant association between the insurance status of customers and their likelihood of purchasing vehicle insurance, guiding feature selection for the predictive model.

Feature Engineering:
To enhance model performance, categorical data was encoded into numerical formats. Label encoding and one-hot encoding were applied as appropriate. This transformation ensured that the machine learning algorithms could effectively process and interpret the data.

Model Development:
Several machine learning models were developed and evaluated, including Logistic Regression, KMeans Clustering, and Random Forest Classifier. The dataset was split into training and testing sets to evaluate the performance of each model. Performance metrics such as accuracy, precision, recall, and F1-score were utilized to assess the effectiveness of the models. Hyperparameter tuning techniques, including grid search and random search, were employed to optimize model parameters and improve performance.

Results:
The Random Forest Classifier achieved the highest performance, with significant improvements observed after hyperparameter tuning. The Logistic Regression model also provided valuable insights due to its interpretability, while KMeans Clustering was explored for customer segmentation, offering a different perspective on the data.

Conclusion:
This project successfully developed a predictive model for insurance cross-sell, leveraging machine learning to identify key factors influencing customer decisions. The findings can help the insurance company strategically target potential customers, enhancing cross-selling opportunities and overall profitability. Future work could involve incorporating additional features, such as customer interaction data, and exploring more advanced modeling techniques to further refine the predictions and improve the model's accuracy.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**

The objective of this project is to predict whether existing insurance customers will purchase a vehicle insurance policy. Utilizing a dataset containing customer demographics, policy details, and previous insurance status, the project involves data preprocessing, feature engineering, and developing predictive models. By applying Logistic Regression, Random Forest Classifier, and KMeans Clustering, and using hyperparameter tuning, the goal is to enhance prediction accuracy and help the insurance company effectively target potential buyers, thereby improving cross-selling opportunities.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go
import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix
from sklearn.tree import DecisionTreeClassifier


from sklearn.cluster import KMeans
from sklearn.manifold import TSNE


from sklearn.model_selection import train_test_split

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
df=pd.read_csv('/content/drive/MyDrive/TRAIN-HEALTH INSURANCE CROSS SELL PREDICTION.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

In [None]:
df=df.drop('id',axis=1)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(df.shape)
print(df.columns)

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
plt.plot(df.isnull().sum(),color='r')

### What did you know about your dataset?

The dataset contains 381,109 entries with 11 columns, including customer demographics, policy details, and previous insurance status. It features a mix of categorical and numerical data types, with no missing values, and is ready for preprocessing and analysis for predicting vehicle insurance purchase responses.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe(include='all')

### Variables Description

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
fig = make_subplots(rows=1, cols=2)

traces = [
    go.Bar(
        x=['Male', 'Female'],
        y=[
            len(df[df['Gender']=='Male']),
            len(df[df['Gender']=='Female'])
        ],
        name='Train Gender',
        text = [
            str(round(100 * len(df[df['Gender']=='Male']) / len(df), 2)) + '%',
            str(round(100 * len(df[df['Gender']=='Female']) / len(df), 2)) + '%'
        ],
        textposition='auto'
    )

]

for i in range(len(traces)):
    fig.append_trace(
        traces[i],
        (i // 2) + 1,
        (i % 2)  +1
    )

fig.update_layout(
    title_text=' gender column',
    height=400,
    width=700
)

fig.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
# Chart - 2 visualization code
fig = make_subplots(rows=1, cols=2)

traces = [
    go.Bar(
        x=['Yes', 'No'],
        y=[
            len(df[df['Driving_License']==1]),
            len(df[df['Driving_License']==0])
        ],
        name='Train Driving_License',
        text = [
            str(round(100 * len(df[df['Driving_License']==1]) / len(df), 2)) + '%',
            str(round(100 * len(df[df['Driving_License']==0]) / len(df), 2)) + '%'
        ],
        textposition='auto'
    )
]
for i in range(len(traces)):
   fig.append_trace(
        traces[i],
        (i // 2) + 1,
        (i % 2)  +1
    )

fig.update_layout(
    title_text='Train Driving_License column',
    height=400,
    width=700
)

fig.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# Chart - 3 visualization code
fig = make_subplots(rows=1, cols=2)

traces = [
    go.Bar(
        x=['Yes', 'No'],
        y=[
            len(df[df['Previously_Insured']==1]),
            len(df[df['Previously_Insured']==0])
        ],
        name='Train Previously_Insured',
        text = [
            str(round(100 * len(df[df['Previously_Insured']==1]) / len(df), 2)) + '%',
            str(round(100 * len(df[df['Previously_Insured']==0]) / len(df), 2)) + '%'
        ],
        textposition='auto'
    )
]
for i in range(len(traces)):
    fig.append_trace(
        traces[i],
        (i // 2) + 1,
        (i % 2)  +1
    )

fig.update_layout(
    title_text='Train Previously_Insured column',
    height=400,
    width=700
)

fig.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code
fig = make_subplots(rows=1, cols=2)

traces = [
    go.Bar(
        x=['Yes', 'No'],
        y=[
            len(df[df['Vehicle_Damage']=='Yes']),
            len(df[df['Vehicle_Damage']=='No'])
        ],
        name='Train Vehicle_Damage',
        text = [
            str(round(100 * len(df[df['Vehicle_Damage']=='Yes']) / len(df), 2)) + '%',
            str(round(100 * len(df[df['Vehicle_Damage']=='No']) / len(df), 2)) + '%'
        ],
        textposition='auto'
    )
]
for i in range(len(traces)):
    fig.append_trace(
        traces[i],
        (i // 2) + 1,
        (i % 2)  + 1
    )

fig.update_layout(
    title_text='Train Vehicle_Damage column',
    height=400,
    width=700
)

fig.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code
fig = make_subplots(rows=1, cols=2)

traces = [
    go.Bar(
        x=['> 2 Years', '1-2 Year', '< 1 Year'],
        y=[
            len(df[df['Vehicle_Age']=='> 2 Years']),
            len(df[df['Vehicle_Age']=='1-2 Year']),
            len(df[df['Vehicle_Age']=='< 1 Year'])
        ],
        name='Train Vehicle_Age',
        text = [
            str(round(100 * len(df[df['Vehicle_Age']=='> 2 Years']) / len(df), 2)) + '%',
            str(round(100 * len(df[df['Vehicle_Age']=='1-2 Year']) / len(df), 2)) + '%',
            str(round(100 * len(df[df['Vehicle_Age']=='< 1 Year']) / len(df), 2)) + '%'
        ],
        textposition='auto'
    )
]
for i in range(len(traces)):
    fig.append_trace(
        traces[i],
        (i // 2) + 1,
        (i % 2)  + 1
    )

fig.update_layout(
    title_text='Train Vehicle_Age column',
    height=400,
    width=700
)

fig.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code
fig = make_subplots(rows=1, cols=2)

traces = [
    go.Histogram(
        x=df['Age'],
        name='Train Age'
    )
]

for i in range(len(traces)):
    fig.append_trace(
        traces[i],
        (i // 2) + 1,
        (i % 2)  + 1
    )

fig.update_layout(
    title_text='Train Age column distribution',
    height=500,
    width=900
)

fig.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code


fig = px.histogram(df, x='Annual_Premium', title='Annual Premium Distribution',
                   labels={'Annual_Premium': 'Annual Premium'}, nbins=50, height=500, width=800)

# Update layout for better fitting
fig.update_layout(
    xaxis_title='Annual Premium',
    yaxis_title='Count',
    bargap=0.1,  # Adjusts the gap between bars
    title_x=0.5  # Centers the title
)

fig.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
df.head(1)

In [None]:
# Chart - 8 visualization code
fig = go.Figure(data=[go.Histogram(x=df['Vintage'])])

# Update layout for better readability
fig.update_layout(
    title='Distribution of Vintage',
    xaxis_title='Vintage',
    yaxis_title='Count',
    bargap=0.2
)

# Show the plot
fig.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code
region_counts = df['Region_Code'].value_counts().reset_index()
region_counts.columns = ['Region_Code', 'Count']

# Create a bar chart for the Region_Code column
fig = go.Figure(data=[
    go.Bar(x=region_counts['Region_Code'], y=region_counts['Count'])
])

# Update layout for better readability
fig.update_layout(
    title='Distribution of Region_Code',
    xaxis_title='Region Code',
    yaxis_title='Count',
    bargap=0.2
)

# Show the plot
fig.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code
fig = px.histogram(
    df,
    "Age",
    color='Response',
    nbins=100,
    title='Age & Response ditribution',
    width=700,
    height=500
)

fig.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

fig = px.histogram(
    df[df['Response'] == 1],
    "Age",
    nbins=100,
    title='Age distribution for positive response',
    width=700,
    height=500
)

fig.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code
gender_response_counts = df.groupby(['Gender', 'Response']).size().reset_index(name='Count')

# Create separate dataframes for each response type
response_0 = gender_response_counts[gender_response_counts['Response'] == 0]
response_1 = gender_response_counts[gender_response_counts['Response'] == 1]

# Create a grouped bar chart for the Gender/Response dependencies
fig = go.Figure(data=[
    go.Bar(name='Response: 0', x=response_0['Gender'], y=response_0['Count']),
    go.Bar(name='Response: 1', x=response_1['Gender'], y=response_1['Count'])
])

# Update layout for better readability
fig.update_layout(
    title='Gender vs. Response Distribution',
    xaxis_title='Gender',
    yaxis_title='Count',
    barmode='group'  # Group bars together
)

# Show the plot
fig.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

fig = px.histogram(
    df,
    "Annual_Premium",
    color='Response',
    nbins=100,
    title='Annual_Premium & Response ditribution',
    width=700,
    height=500
)
fig.show()

In [None]:
## Chart 14

fig = px.histogram(
    df,
    "Vintage",
    color='Response',
    nbins=100,
    title='Vintage & Response ditribution',
    width=700,
    height=500
)

fig.show()

## ***Feature Engineering & Data Pre-processing***

### Categorical Encoding

In [None]:
# Encoding Categorical Data
df['Gender'] = df['Gender'].map({'Male': 0, 'Female': 1})
df['Vehicle_Age'] = df['Vehicle_Age'].map({'< 1 Year': 0, '1-2 Year': 1, '> 2 Years': 2})
df['Vehicle_Damage'] = df['Vehicle_Damage'].map({'No': 0, 'Yes': 1})

#### Chart -15 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code


# Calculate the correlation matrix
correlation_matrix = df.corr()

# Create a heatmap using Plotly Express
fig = px.imshow(correlation_matrix,
                labels=dict(x="Features", y="Features", color="Correlation"),
                x=correlation_matrix.columns,
                y=correlation_matrix.columns,
                color_continuous_scale='Viridis')

# Update layout for better readability
fig.update_layout(
    title='Correlation Heatmap',
    xaxis_title='Features',
    yaxis_title='Features'
)

# Show the plot
fig.show()

In [None]:
for col in df.columns:
    if col == 'Response':
        continue
    print(col, df[col].corr(df['Response']))

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Scatter Plot

In [None]:


fig = px.scatter(
    df,
    x="Annual_Premium",
    y="Age",
    color="Response",
    width=600,
    height=600,
    title='Annual_premium vs Age scatter'
)
fig.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

# ***Hypothesis Testing***

Previously Insured and Response Relationship:

Hypothesis:


 **HO**: There is no association between Previously Insured status and Response.

 **Ha**: There is an association between Previously Insured status and Response.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
import pandas as pd
from scipy.stats import chi2_contingency

contingency_table = pd.crosstab(df['Previously_Insured'], df['Response'])
print("Contingency Table:")
print(contingency_table)

In [None]:
chi2, p, dof, expected = chi2_contingency(contingency_table)

print(f'Chi-square statistic: {chi2}')
print(f'p-value: {p}')
print(f'Degrees of freedom: {dof}')
print("Expected frequencies:")
print(expected)


In [None]:
alpha = 0.05

if p < alpha:
    print("Reject the null hypothesis (H0). There is a significant association between Previously Insured status and Response.")
else:
    print("Fail to reject the null hypothesis (H0). There is no significant association between Previously Insured status and Response.")

##### Which statistical test have you done to obtain P-Value?

We used the Chi-square test of independence for this hypothesis because it is a statistical test specifically designed to determine whether there is a significant association between two categorical variables. In our case, the two categorical variables are "Previously Insured" and "Response".

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***