<a href="https://colab.research.google.com/github/anirudhawagh/HEALTH-INSURANCE-CROSS-SELL-PREDICTION-by-Aniruddha-wagh/blob/main/HEALTH_INSURANCE_CROSS_SELL_PREDICTION_by_Aniruddha.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -HEALTH INSURANCE CROSS SELL PREDICTION



##### **Project Type**    -Classification
##### **Contribution**    - Individual
##### **Team Member 1 -** Aniruddha Narayan Wagh


# **Project Summary -**

In this project, our goal was to assist an insurance company in predicting whether their past policyholders would be interested in purchasing vehicle insurance. By understanding customer behavior, the company could refine their strategies and optimize revenue generation. The dataset included 381,109 rows and 12 features, with the categorical variable "Response" indicating customer interest in vehicle insurance.

We began by checking for null and duplicate values, finding none and thus avoiding the need for data cleaning. Next, we normalized the numerical columns to ensure consistency.

In the exploratory data analysis, we categorized age into three groups (YoungAge, MiddleAge, and OldAge) to gain insights into age-related preferences. We also categorized Region_Code and Policy_Sales_Channel to extract valuable information. Through plots, we explored independent features and their relationship with the target variable.

For feature selection, we employed Kendall's rank correlation coefficient for numerical features and the Mutual Information technique for categorical features. These methods helped identify the most relevant features associated with the target variable.

To predict customer interest, we implemented various supervised machine learning algorithms, including Decision Tree Classifier, AdaBoost, LightGBM, BaggingRegressor, NaiveBayes, and Logistic Regression. Hyperparameter tuning was applied to optimize model performance and prevent overfitting.



# **GitHub Link -**

 GitHub Link -https://github.com/anirudhawagh/HEALTH-INSURANCE-CROSS-SELL-PREDICTION-by-Aniruddha-wagh

# **Problem Statement**


Our client is an Insurance company that has provided Health Insurance to its customers now they need your help in building a model to predict whether the policyholders (customers) from past year will also be interested in Vehicle Insurance provided by the company.

An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium. A premium is a sum of money that the customer needs to pay regularly to an insurance company for this guarantee.

For example, you may pay a premium of Rs. 5000 each year for a health insurance cover of Rs. 200,000/- so that if, God forbid, you fall ill and need to be hospitalised in that year, the insurance provider company will bear the cost of hospitalisation etc. for upto Rs. 200,000. Now if you are wondering how can company bear such high hospitalisation cost when it charges a premium of only Rs. 5000/-, that is where the concept of probabilities comes in picture. For example, like you, there may be 100 customers who would be paying a premium of Rs. 5000 every year, but only a few of them (say 2-3) would get hospitalised that year and not everyone. This way everyone shares the risk of everyone else.

Just like medical insurance, there is vehicle insurance where every year customer needs to pay a premium of certain amount to insurance provider company so that in case of unfortunate accident by the vehicle, the insurance provider company will provide a compensation (called ‘sum assured’) to the customer.

Building a model to predict whether a customer would be interested in Vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimise its business model and revenue.

Now, in order to predict, whether the customer would be interested in Vehicle insurance, you have information about demographics (gender, age, region code type), Vehicles (Vehicle Age, Damage), Policy (Premium, sourcing channel) etc.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
#@title

# Basic
import numpy as np
import pandas as pd

# Plotation
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
%matplotlib inline

# ML Models
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
import lightgbm as lgb

# Evaluation Metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import log_loss

# Hyper Parameter Tuning
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingRandomSearchCV

# Miscellaneous
import time
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import mutual_info_classif
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

In [None]:
from google.colab import drive
drive.mount('/content/drive')

dataset link = https://drive.google.com/file/d/1AW5Gz6IqktDOoIjaBeWvy-HMaF5Y84sX/view

### Dataset Loading

In [None]:
# Load Dataset
database ="/content/drive/MyDrive/TRAIN-HEALTH INSURANCE CROSS SELL PREDICTION.csv"
data_df =pd.read_csv(database)

### Dataset First View

In [None]:
# Dataset First Look
data_df

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Index:", data_df.index)
print('\n')
print("Columns:", data_df.columns)
print('\n')
print("Number of rows:", len(data_df))

### Dataset Information

In [None]:
# Dataset Info
data_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

duplicate_count = data_df.duplicated().sum()

print("Number of duplicate values:", duplicate_count)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
data_df.isnull().sum()

In [None]:
# Visualizing the missing values
miss_values =data_df.isnull().sum().sort_values(ascending=False)
miss_values # We have check the count of null value in individual columns

### What did you know about your dataset?

Dataset Size: The dataset consists of 381,109 rows and 12 columns.Column Information: The dataset contains columns with the following names: 'id', 'Gender', 'Age', 'Driving_License', 'Region_Code', 'Previously_Insured', 'Vehicle_Age', 'Vehicle_Damage', 'Annual_Premium', 'Policy_Sales_Channel', 'Vintage', and 'Response'. Each column represents a different feature or attribute of the data.Data Characteristics: The dataset contains various types of information such as demographic details (gender, age), driving-related information (driving license, region code), insurance-related details (previously insured, vehicle age, vehicle damage), policy information (annual premium, policy sales channel), and historical information (vintage). The 'Response' column likely indicates the target variable or the outcome of interest.Data Quality: The dataset does not have any duplicate values or missing/null values, as indicated by the counts of duplicates and missing values being zero.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
columns = data_df.columns.tolist()

print("Dataset Columns:")
for column in columns:
    print(column)

In [None]:
# Dataset Describe
data_df.describe()

### Variables Description

Health Insurance Dataset
Columns:
ID: Unique identifier for the Customer.

Age: Age of the Customer.

Gender: Gender of the Customer.

Driving_License: 0 for customer not having DL, 1 for customer having DL.

Region_Code: Unique code for the region of the customer.

Previously_Insured: 0 for customer not having vehicle insurance, 1 for customer having vehicle insurance.

Vehicle_Age: Age of the vehicle.

Vehicle_Damage: Customer got his/her vehicle damaged in the past. 0 : Customer didn't get his/her vehicle damaged in the past.

Annual_Premium: The amount customer needs to pay as premium in the year.

Policy_Sales_Channel: Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.

Vintage: Number of Days, Customer has been associated with the company.

Response (Dependent Feature): 1 for Customer is interested, 0 for Customer is not interested.

### Check Unique Values for each variable.

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
data_df.head()

In [None]:
# Create a new column 'Renewed_Insurance' to indicate whether a customer renewed insurance or not
data_df['Renewed_Insurance'] = data_df['Response'].map({1: 'Yes', 0: 'No'})

# Create custom age bins
age_bins = [20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100]

age_bins = [20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100]
age_labels = [f'{age}-{age+4}' for age in age_bins[:-1]]
data_df['Age_Group'] = pd.cut(data_df['Age'], bins=age_bins, right=False, labels=age_labels)



# Display the updated DataFrame with the new columns
print(data_df)

In [None]:
print(data_df['Age_Group'].unique())
print(data_df['Renewed_Insurance'].unique())

In [None]:
import pandas as pd

# Assume your DataFrame is named data_df

# Define the age categories
def categorize_age(age):
    if age < 35:
        return 'Young'
    elif age >= 35 and age <= 64:
        return 'Middle Age'
    else:
        return 'Senior'

# Create the new column 'Age_Category'
data_df['Age_Category'] = data_df['Age'].apply(categorize_age)

# Display the updated DataFrame
print(data_df)


In [None]:
import pandas as pd

# List of region codes
region_codes = [28, 3, 11, 41, 33, 6, 35, 50, 15, 45, 8, 36, 30, 26, 16, 47, 48, 19, 39, 23, 37, 5, 17, 2, 7, 29, 46, 27, 25, 13, 18, 20, 49, 22, 44, 0, 9, 31, 12, 34, 21, 10, 14, 38, 24, 40, 43, 32, 4, 51, 42, 1, 52]

# Divide the region codes into 4 sets
num_sets = 4
sets_of_region_codes = [region_codes[i:i+13] for i in range(0, len(region_codes), 13)]

# Create a mapping of region codes to region names
region_mapping = {
    'North': sets_of_region_codes[0],
    'South': sets_of_region_codes[1],
    'East': sets_of_region_codes[2],
    'West': sets_of_region_codes[3]
}

# Create a new column 'Region' in the dataset based on the region codes
data_df['Region'] = data_df['Region_Code'].apply(lambda x: next((region_name for region_name, region_codes in region_mapping.items() if x in region_codes), 'Unknown'))

# Display the updated DataFrame with the new 'Region' column
print(data_df)


In [None]:
# Dataset Columns
columns = data_df.columns.tolist()

print("Dataset Columns:")
for column in columns:
    print(column)

In [None]:
data_df.head()

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Calculate the counts of customers with and without insurance
insurance_counts = data_df['Previously_Insured'].value_counts()

# Plot the pie chart
plt.figure(figsize=(6, 6))
plt.pie(insurance_counts, labels=['Not Insured', 'Insured'], autopct='%1.1f%%', startangle=90, colors=['lightcoral', 'lightblue'])
plt.title('Proportion of Customers with Insurance')
plt.axis('equal')  # Equal aspect ratio ensures that the pie chart is circular.

plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming you have loaded the data into data_df

# Create custom age bins
age_bins = [20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80]

# Use pd.cut to create age groups based on the custom age bins and set the labels
age_labels = [f'{age}-{age+4}' for age in age_bins[:-1]]
data_df['Age_Group'] = pd.cut(data_df['Age'], bins=age_bins, right=True, labels=age_labels)

# Create a figure with two subplots
plt.figure(figsize=(18, 8))

# Subplot 1: Distribution of Age Groups
plt.subplot(1, 2, 1)
sns.countplot(x='Age_Group', data=data_df)
plt.xlabel('Age Group', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.title('Distribution of Age Groups', fontsize=15, fontweight='bold')
plt.xticks(rotation=45)

# Subplot 2: Distribution of Age Categories
plt.subplot(1, 2, 2)
sns.countplot(x='Age_Category', data=data_df, palette='tab20')
plt.xlabel('Age Category', fontsize=10)
plt.ylabel('Count', fontsize=10)
plt.title('Distribution of Age Categories', fontsize=12, fontweight='bold')
plt.xticks(rotation=45)

# Adjust the layout of the subplots
plt.tight_layout()

# Show the plots
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(18, 8))  # Adjust the figure size as needed

# Create a figure with a single row and two columns
fig, axes = plt.subplots(1, 2, figsize=(18, 8))  # Set the same figure size for both subplots

# Filter data for renewals (Response: 'Yes')
renewals_data = data_df[data_df['Response'] == 1]

# Subplot 1: Count of Renewals by Age Category
sns.countplot(x='Age_Category', data=renewals_data, palette='tab20', ax=axes[0])
axes[0].set_xlabel('Age Category', fontsize=12)
axes[0].set_ylabel('Count', fontsize=12)
axes[0].set_title('Renewals Count by Age Category', fontsize=10, fontweight='bold')
axes[0].tick_params(axis='x', rotation=45)

# Subplot 2: Count of Renewals by Age Group
filtered_data_df = data_df[data_df['Age_Group'] != 'Unknown']
filtered_data_df['Renewed_Insurance'] = filtered_data_df['Renewed_Insurance'].astype('category').cat.set_categories(['Yes', 'No'])
age_group_renewals = filtered_data_df.groupby('Age_Group')['Renewed_Insurance'].value_counts().unstack(fill_value=0)
age_group_renewals[['No', 'Yes']].plot(kind='bar', stacked=True, ax=axes[1], width=0.9)  # Adjust width of bars
axes[1].set_xlabel('Age Group', fontsize=14)
axes[1].set_ylabel('Count of Renewals', fontsize=14)
axes[1].tick_params(axis='x', rotation=45)
axes[1].legend(title='Renewed Insurance', loc='upper right', labels=['No', 'Yes'])

# Set the titles separately and adjust the layout
axes[1].set_title('Renewals Count by Age Group', fontsize=10, fontweight='bold')

# Adjust the layout of the figure to ensure everything fits properly
plt.tight_layout()

plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Plot the data
plt.figure(figsize=(18, 8))  # Adjust the figure size as needed

# Subplot 1: Distribution of Vehicle Damage in Each Age Group
plt.subplot(1, 2, 1)
sns.countplot(x='Age_Group', hue='Vehicle_Damage', data=data_df)
plt.xlabel('Age Group', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.title('Distribution of Vehicle Damage in Each Age Group', fontsize=15, fontweight='bold')
plt.xticks(rotation=45)
plt.legend(title='Vehicle Damage', labels=['Not Damaged', 'Damaged'], loc='upper right')

# Subplot 2: Distribution of Vehicle Damage in Each Age Category
plt.subplot(1, 2, 2)
sns.countplot(x='Age_Category', hue='Vehicle_Damage', data=data_df)
plt.xlabel('Age Category', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.title('Distribution of Vehicle Damage in Each Age Category', fontsize=15, fontweight='bold')
plt.xticks(rotation=45)
plt.legend(title='Vehicle Damage', labels=['Not Damaged', 'Damaged'], loc='upper right')

# Adjust the layout of the subplots
plt.tight_layout()

plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Plot the data
plt.figure(figsize=(18, 8))  # Adjust the figure size as needed

# Subplot 1: Distribution of Insurance Renewals in Each Age Group
plt.subplot(1, 2, 1)
sns.countplot(x='Age_Group', hue='Renewed_Insurance', data=data_df)
plt.xlabel('Age Group', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.title('Distribution of Insurance Renewals in Each Age Group', fontsize=15, fontweight='bold')
plt.xticks(rotation=45)
plt.legend(title='Renewed Insurance', labels=['Not Renewed', 'Renewed'], loc='upper right')

# Subplot 2: Distribution of Insurance Renewals in Each Age Category
plt.subplot(1, 2, 2)
sns.countplot(x='Age_Category', hue='Renewed_Insurance', data=data_df)
plt.xlabel('Age Category', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.title('Distribution of Insurance Renewals in Each Age Category', fontsize=15, fontweight='bold')
plt.xticks(rotation=45)
plt.legend(title='Renewed Insurance', labels=['Not Renewed', 'Renewed'], loc='upper right')

# Adjust the layout of the subplots
plt.tight_layout()

plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(12, 6))
sns.countplot(x='Age_Group', hue='Response', data=data_df)
plt.xlabel('Age Group', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.title('Response Distribution by Age Group', fontsize=15, fontweight='bold')
plt.legend(title='Response', labels=['Not Interested', 'Interested'], loc='upper right')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Create subplots for the response distribution by previous insurance status
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('Response Distribution by Previous Insurance Status and Region', fontsize=15, fontweight='bold')

# Iterate through each region and plot the count plot
for i, region in enumerate(['North', 'South', 'East', 'West']):
    ax = axes[i//2, i%2]

    region_data = data_df[data_df['Region'] == region]
    sns.countplot(x='Previously_Insured', hue='Response', data=region_data, ax=ax)

    ax.set_title(f'Response Distribution by Previous Insurance Status in {region} Region', fontsize=12, fontweight='bold')
    ax.set_xlabel('Previously Insured', fontsize=10)
    ax.set_ylabel('Count', fontsize=10)
    ax.legend(title='Response', labels=['Not Interested', 'Interested'], loc='upper right')
    ax.set_xticks([0, 1])
    ax.set_xticklabels(['No', 'Yes'])

    ax.tick_params(axis='both', which='major', labelsize=8)
    ax.tick_params(axis='both', which='minor', labelsize=8)

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(10, 6))
sns.countplot(x='Vehicle_Age', hue='Vehicle_Damage', data=data_df)
plt.xlabel('Vehicle Age', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.title('Vehicle Damage Distribution by Vehicle Age', fontsize=15, fontweight='bold')
plt.legend(title='Vehicle Damage', labels=['No', 'Yes'], loc='upper right')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Chart 9 - Distribution of Insurance Response
plt.figure(figsize=(8, 5))
sns.countplot(x='Response', data=data_df)
plt.xlabel('Response', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.title('Distribution of Insurance Response', fontsize=15, fontweight='bold')
plt.xticks(ticks=[0, 1], labels=['Not Interested', 'Interested'])
plt.show()

# Filter out rows with unknown region values
filtered_data = data_df[data_df['Region'] != 'Unknown']

# Function to create a pie chart for response distribution in each region
def plot_region_response_pie(region_response_counts):
    # Create a figure and axes
    fig, axes = plt.subplots(1, len(region_response_counts), figsize=(20, 5))

    # Iterate through each region and plot the pie chart
    for i, (region, response_counts) in enumerate(region_response_counts.iterrows()):
        ax = axes[i]
        ax.pie(response_counts, labels=response_counts.index, autopct='%.1f%%', shadow=True, startangle=90)
        ax.set_title(f'Response Distribution in {region} Region', fontsize=12, fontweight='bold')

    plt.tight_layout()
    plt.show()

# Calculate the response distribution for each region
region_response_counts = filtered_data.groupby(['Region', 'Response']).size().unstack()

# Call the function to create the pie charts
plot_region_response_pie(region_response_counts)


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Filter out rows with unknown region values
filtered_data = data_df[data_df['Region'] != 'Unknown']

# Calculate the percentage of Vintage values for each region
region_vintage_percentage = filtered_data.groupby('Region')['Vintage'].count() / len(filtered_data) * 100

# Create a bar plot to visualize the percentage distribution
plt.figure(figsize=(10, 6))
sns.barplot(x=region_vintage_percentage.index, y=region_vintage_percentage.values, palette='muted')
plt.title('Percentage Distribution of Vintage by Region', fontsize=15, fontweight='bold')
plt.xlabel('Region', fontsize=12)
plt.ylabel('Percentage', fontsize=12)
plt.xticks(rotation=45)
plt.ylim(0, max(region_vintage_percentage.values) + 2)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 12 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Filter the data to exclude rows with missing values in the 'Region' column
filtered_data_df = data_df.dropna(subset=['Region'])

# Group the filtered data by 'Region' and calculate the average response rate for each region
region_response_rates = filtered_data_df.groupby('Region')['Response'].mean()

# Remove the 'Unknown' region from the Series if it exists
region_response_rates = region_response_rates.drop(index='Unknown', errors='ignore')

# Create a bar plot to visualize the average response rate for each region
plt.figure(figsize=(10, 6))
sns.barplot(x=region_response_rates.index, y=region_response_rates.values, palette='muted')
plt.xlabel('Region', fontsize=14)
plt.ylabel('Average Response Rate', fontsize=14)
plt.title('Average Response Rate by Region', fontsize=15, fontweight='bold')
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
import matplotlib.pyplot as plt

# Data for male and female response counts
male_response_ct = data_df[data_df['Gender'] == 'Male']['Response'].value_counts()
female_response_ct = data_df[data_df['Gender'] == 'Female']['Response'].value_counts()

# Create subplots for the pie charts
fig, axes = plt.subplots(1, 2, figsize=(10, 5))

# Plot the pie chart for males
axes[0].pie(male_response_ct, autopct='%.1f%%', shadow=True, startangle=70, labels=['Not Interested', 'Interested'])
axes[0].set_title('Response for Males', fontsize=10, fontweight='bold')
axes[0].legend(loc='upper right', title='Response', fontsize=8)

# Plot the pie chart for females
axes[1].pie(female_response_ct, autopct='%.1f%%', shadow=True, startangle=70, labels=['Not Interested', 'Interested'])
axes[1].set_title('Response for Females', fontsize=10, fontweight='bold')
axes[1].legend(loc='upper right', title='Response', fontsize=8)

# Adjust spacing between subplots
plt.tight_layout()

plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming you have loaded your dataset into the "data_df" DataFrame

# Select columns for the correlation matrix
columns_for_correlation = ['Age', 'Annual_Premium', 'Vintage', 'Response']

# Calculate the correlation matrix
correlation_matrix = data_df[columns_for_correlation].corr()

# Create a heatmap of the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Heatmap', fontsize=15, fontweight='bold')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming you have loaded your dataset into the "data_df" DataFrame

# Select columns for the pair plot
columns_for_pairplot = ['Age', 'Annual_Premium', 'Vintage', 'Response']

# Create a pair plot
sns.pairplot(data_df[columns_for_pairplot], hue='Response', diag_kind='hist', palette='tab10')
plt.suptitle('Pair Plot of Selected Variables', fontsize=15, fontweight='bold')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1
Statement: The response rate is different between male and female customers.


#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.
Null Hypothesis (H0): Response rate is the same between male and female customers.
Alternative Hypothesis (H1): Response rate is different between male and female customers.

#### 2. Perform an appropriate statistical test.

In [None]:
import pandas as pd
from scipy.stats import chi2_contingency

# Load your dataset into the "data_df" DataFrame
# Assuming you have columns "Gender" and "Response"

# Create a contingency table
contingency_table = pd.crosstab(data_df['Gender'], data_df['Response'])

# Perform the chi-squared test
chi2, p_value, dof, expected = chi2_contingency(contingency_table)

print("P-Value:", p_value)


##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2
Statement: Statement: The average annual premium for customers who have previously insured their vehicles is different from those who haven't.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): Average annual premium is the same for customers with and without previous insurance.
Alternative Hypothesis (H1): Average annual premium is different for customers with and without previous insurance.

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import ttest_ind

# Separate data into two groups based on Previously_Insured
insured_premium = data_df[data_df['Previously_Insured'] == 1]['Annual_Premium']
not_insured_premium = data_df[data_df['Previously_Insured'] == 0]['Annual_Premium']

# Perform independent t-test
t_statistic, p_value = ttest_ind(insured_premium, not_insured_premium)

print("P-Value:", p_value)


##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3
Statement: There is a correlation between the age of the customer and the annual premium they pay.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): There is no correlation between age and annual premium.
Alternative Hypothesis (H1): There is a correlation between age and annual premium.

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import pearsonr

# Calculate Pearson correlation coefficient and p-value
correlation_coefficient, p_value = pearsonr(data_df['Age'], data_df['Annual_Premium'])

print("P-Value:", p_value)


##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
missing_values = data_df.isna().sum()
print(missing_values)

#### What all missing value imputation techniques have you used and why did you use those techniques?

"There are no missing values in my dataset, so there was no need to use any missing value imputation techniques."

### 2. Handling Outliers

Outlier Treatment Technique 1: Capping (Trimming)

In [None]:
# Calculate the IQR for 'Age' column
Q1_age = data_df['Age'].quantile(0.25)
Q3_age = data_df['Age'].quantile(0.75)
IQR_age = Q3_age - Q1_age

# Define the upper and lower bounds for capping
upper_bound_age = Q3_age + 1.5 * IQR_age
lower_bound_age = Q1_age - 1.5 * IQR_age

# Apply capping to 'Age' column
data_df['Age'] = np.where(data_df['Age'] > upper_bound_age, upper_bound_age,
                          np.where(data_df['Age'] < lower_bound_age, lower_bound_age, data_df['Age']))


In [None]:
import numpy as np

# Define the lower bound for flooring
lower_bound_age = 20  # Replace this with your desired lower bound for 'Age' column
lower_bound_premium = 2630  # Replace this with your desired lower bound for 'Annual_Premium' column

# Apply flooring to 'Age' column
data_df['Age'] = np.where(data_df['Age'] < lower_bound_age, lower_bound_age, data_df['Age'])

# Apply flooring to 'Annual_Premium' column
data_df['Annual_Premium'] = np.where(data_df['Annual_Premium'] < lower_bound_premium, lower_bound_premium, data_df['Annual_Premium'])


In [None]:
import numpy as np

# Apply log transformation to 'Annual_Premium' column
data_df['Annual_Premium'] = np.log(data_df['Annual_Premium'])


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Original distribution
plt.figure(figsize=(10, 6))
sns.histplot(data=data_df, x='Annual_Premium', bins=30, kde=True, color='blue', label='Original')
plt.title('Distribution of Annual Premium (Before Transformation)')
plt.xlabel('Annual Premium')
plt.ylabel('Frequency')
plt.legend()
plt.show()

# Apply logarithmic transformation
data_df['Annual_Premium_Log'] = np.log1p(data_df['Annual_Premium'])

# Transformed distribution
plt.figure(figsize=(10, 6))
sns.histplot(data=data_df, x='Annual_Premium_Log', bins=30, kde=True, color='orange', label='Transformed')
plt.title('Distribution of Annual Premium (After Logarithmic Transformation)')
plt.xlabel('Log(Annual Premium + 1)')
plt.ylabel('Frequency')
plt.legend()
plt.show()


##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer:In the analysis of the dataset, three outlier treatment techniques were employed: Capping (Trimming), Flooring, and Transformation. However, after applying these techniques, it was observed that there were no substantial changes in the dataset. The absence of significant changes indicates that the data points in the numerical columns 'Age' and 'Annual_Premium' were already relatively well-distributed and did not exhibit extreme outliers. Consequently, the decision to use these outlier treatment techniques was aimed at ensuring data integrity and maintaining the consistency of the analysis, even though they did not result in significant alterations to the dataset.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***