# **Project Name**    -  Health Insurance Cross Sell Prediction



##### Project Type    - Classification
##### Contribution    - Individual

# **Project Summary -**

An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium There are multiple factors that play a major role in capturing customers for any insurance policy. Here we have information about demographics such as age, gender, region code, and vehicle damage, vehicle age, annual premium, policy sourcing channel. Based on the previous trend, this data analysis and prediction with machine learning models can help us understand what are the reasons for news popularity on social media and obtain the best classification model. We have a dataset which contains information about demographics (gender, age, region code type), Vehicles (Vehicle Age, Damage), Policy (Premium, sourcing channel) etc. related to a person who is interested in vehicle insurance. Predicting whether a customer would be interested in buying Vehicle Insurance so that the company can then accordingly plan its communication strategy to reach out to those customers and optimise its business model and revenue.

# **GitHub Link -**

https://github.com/ad353/Health_Insurance_Cross_Sell_Prediction/blob/main/Health_Insurance_Cross_Sell_Prediction.ipynb

# **Problem Statement**


Our client is an Insurance company that has provided Health Insurance to its customers now they need our help in building a model to predict whether the policyholders (customers) will also be interested in Vehicle Insurance provided by the company.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

from sklearn.model_selection import GridSearchCV
from scipy.stats import randint
from scipy.stats import randint as sp_randint

sns.set_style('darkgrid')

import warnings
warnings.filterwarnings('ignore')

In [None]:
pip install imblearn

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
df=pd.read_csv("/content/drive/MyDrive/Project/Classification/TRAIN-HEALTH INSURANCE CROSS SELL PREDICTION.csv",encoding="latin-1")

### Dataset First View

In [None]:
df.head()

### Dataset Rows & Columns count

In [None]:
print("Total no. of rows ",df.shape[0])
print("Total no. of columns ",df.shape[1])

### Dataset Information

In [None]:
df.info()

#### Duplicate Values

In [None]:
print("There are {} duplicates present.".format(df.duplicated().sum()))

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull(),cmap="viridis",yticklabels=False)
plt.title("Visualizing missing values",fontsize=12)
plt.show()

### What did you know about your dataset?

- Our dataset contains 381109 rows and 12 columns
- We do not have null values in dataset

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

- **id** - Unique ID for Customer
- **Gender** - Gender of Customer
- **Age** - Age of Customer
- **Driving_License** - 1: Customer has DL, 0 : Customer does not have DL
- **Region_Code** - Unique code for the region of the customer
- **Previously_Insured** - 1 : Customer already has Vehicle Insurance, 0 : Customer doesn't have Vehicle Insurance
- **Vehicle_Age** - Age of the vehicle
- **Vehicle_Damage** - 1: Customer got his/her vehicle damaged in the past. 0 : Customer didn't get his/her vehicle damaged in the past.
- **Annual_Premium** - The amount customer needs to pay as premium
- **Policy_Sales_Channel** - Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.
- **Vintage** - Number of Days, Customer has been associated with the company
- **Response** - 1 : Customer is interested 0 : Customer is not interested

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique().sort_values(ascending=True)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Assingning numericals column of df to variable numerical_cols
numerical_cols = list(df.describe())
numerical_df = df[numerical_cols]
numerical_df.head()

In [None]:
# Assigning catagoricals column of df to variable catagorical_cols
categorical_cols=list(set(df.columns)-set(numerical_cols))
categorical_df=df[categorical_cols]
categorical_df.head()

### What all manipulations have you done and insights you found?

We have divided the dataframe into two dataframes, namely Numerical df and Categorical df.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
plt.figure(figsize=(8,6))
sns.countplot(x='Response',data=df,palette="rocket")
plt.title("Not-Interested vs Interested Customers",fontsize=15)
plt.xlabel('Response',fontsize=12)
plt.ylabel('Count',fontsize=12)
plt.show()

##### 1. Why did you pick the specific chart?

Countplots are effective at highlighting imbalances in data.

##### 2. What is/are the insight(s) found from the chart?

The data is highly imbalanced

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes,the gained insights help creating a positive business impact

#### Chart - 2

In [None]:
print(df['Gender'].value_counts())
df['Gender'].value_counts().plot(kind='pie',
                              figsize=(15,6),
                               autopct="%1.1f%%")
plt.title('Distribution of Gender',fontsize=20)
plt.show()

##### 1. Why did you pick the specific chart?

Pie charts are effective for displaying the distribution of categories as parts of a whole. In the case of gender distribution, it's easy to see the relative proportions of males and females within the dataset.

##### 2. What is/are the insight(s) found from the chart?

There are 206089 males i.e (54.1%)  and 175020 females i.e(45.9%). Males are little bit more in comparison to females

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the gender distribution of customer base can be valuable for targeted marketing efforts.

#### Chart - 3

In [None]:
plt.figure(figsize=(16,10))
sns.countplot(x='Age',data=df)
plt.title("Distribution of Age")
plt.xticks(rotation=90)
plt.show()

##### 1. Why did you pick the specific chart?

To show distribution of Age

##### 2. What is/are the insight(s) found from the chart?

From the above distribution of age we can see that most of the customers age is between 21 to 25 years.There are few Customers above the age of 60 years

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the age distribution of customer base can be valuable for targeted marketing efforts. Different age groups may have varying preferences, needs, and behaviors, which can inform  marketing strategies.

#### Chart - 4

In [None]:
plt.figure(figsize=(8,6))
sns.countplot(x='Driving_License',hue="Response",data=df,palette="seismic")
plt.title("Driving License with Response",fontsize=15)
plt.xlabel('Driving_License',fontsize=12)
plt.ylabel('Count',fontsize=12)
plt.show()

##### 1. Why did you pick the specific chart?

To show count of customers who have DL

##### 2. What is/are the insight(s) found from the chart?

As we can see from the graph,
Customers who are interested in Vehicle Insurance almost having driving license

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 Yes, the gained insights help to create a positive impact on business because on the basis of above analysis we can target those peoples who are having driving license.

#### Chart - 5

In [None]:
plt.figure(figsize=(8,6))
sns.countplot(x='Previously_Insured',hue="Response",data=df,palette="inferno")
plt.title("Previously Insured with Response",fontsize=15)
plt.xlabel('Previously_Insured',fontsize=12)
plt.ylabel('Count',fontsize=12)
plt.show()

##### 1. Why did you pick the specific chart?

To show count of Previously Insured customers

##### 2. What is/are the insight(s) found from the chart?

Customer who are not perviosly insured are likely to be inetrested

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We can use this insight to design targeted marketing campaigns specifically aimed at individuals who don't have previous insurance coverage. Highlighting the benefits of insurance products to this group may result in increased interest and potential new customers.

#### Chart - 6

In [None]:
plt.figure(figsize=(8,6))
sns.countplot(x='Vehicle_Age',hue="Response",data=df,palette="husl")
plt.title("Vehicle_Age with Response",fontsize=15)
plt.xlabel('Vehicle_Age',fontsize=12)
plt.ylabel('Count',fontsize=12)
plt.show()

##### 1. Why did you pick the specific chart?

Countplots allow to visually compare the frequency of different categories.

##### 2. What is/are the insight(s) found from the chart?

Customers with vechicle age 1-2 years are more likely to interested as compared to the other two

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We can use this insight to target marketing efforts specifically towards customers whose vehicles fall within the 1-2 year age range.

#### Chart - 7

In [None]:
print(df['Vehicle_Damage'].value_counts())

df['Vehicle_Damage'].value_counts().plot(kind='pie',
                              figsize=(15,6),
                               autopct="%1.1f%%")
plt.title('Distribution of Vehicle_Damage',fontsize=20)
plt.show()

##### 1. Why did you pick the specific chart?

Pie charts often include percentage labels, making it easy for viewers to see the exact percentage of each category. This can be helpful for precise communication of the distribution.

##### 2. What is/are the insight(s) found from the chart?

50.5% of the vehicles have past damage

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Knowing that a significant number of customers have experienced vehicle damage, we can design marketing campaigns that emphasize the benefits of insurance.

#### Chart - 8

In [None]:
plt.figure(figsize=(8,6))
sns.distplot(df['Annual_Premium'],color="purple")
plt.title("Distribution of Annual premium",fontsize=15)
plt.show()

##### 1. Why did you pick the specific chart?

Distplot helps in identifying the skewness of the data distribution.

##### 2. What is/are the insight(s) found from the chart?

From the distribution plot we can infer that the annual premium variable is right skewed

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insight that the annual premium variable is right-skewed can be a valuable tool for pricing, risk assessment,and marketing strategies in the insurance.

#### Chart - 9

In [None]:
plt.figure(figsize=(8,6))
sns.boxplot(x="Annual_Premium",data=df,color="Purple")
plt.title("Boxplot of Annual premium",fontsize=15)
plt.show()

##### 1. Why did you pick the specific chart?

Boxplots make it easy to identify potential outliers in the data. Outliers are data points that fall significantly above or below the whiskers of the plot, helping to spot unusual or unexpected values.

##### 2. What is/are the insight(s) found from the chart?

From the boxplot we can observe lot of outliers in the variable

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding and identifying outliers can help in risk management.

#### Chart - 10

In [None]:
plt.figure(figsize=(20,5))
sns.countplot(x='Region_Code',data=df,order=df['Region_Code'].value_counts().index)
plt.title('Number of customers with respect to various Region code',fontsize=15)
plt.xticks(rotation=90)
plt.xlabel("Region_Code",fontsize=12)
plt.ylabel('Number of Customers',fontsize=12)
plt.show()

##### 1. Why did you pick the specific chart?

To show Number of customers with respect to various Region code

##### 2. What is/are the insight(s) found from the chart?

We can see most of the people are from region 28

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insight that most customers are from a specific region can be a valuable tool for tailoring marketing and business strategies.

#### Chart - 11

In [None]:
plt.figure(figsize=(8,6))
sns.countplot(x='Vehicle_Age',hue="Vehicle_Damage",data=df,palette="twilight")
plt.title("Vehicle_Age vs Vehicle_Damage",fontsize=15)
plt.xlabel('Vehicle_Age',fontsize=12)
plt.ylabel('Count',fontsize=12)
plt.show()

##### 1. Why did you pick the specific chart?

Countplots allow to visually compare the frequency of different categories. This can be helpful for identifying imbalances, trends, or patterns in your data.

##### 2. What is/are the insight(s) found from the chart?

Vehicles in the age 1-2 year are more damaged compared to other two

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

To offer specialized insurance products or coverage options specifically for vehicles in the 1-2 year age range.

#### Chart - 12

In [None]:
plt.figure(figsize=(16,8))
sns.countplot(data=df, x='Age',hue='Response', palette='Set2')
plt.xlabel('Age response',fontsize=12)
plt.ylabel('count',fontsize=12)
plt.title("Age vs Tesponse",fontsize=15)
plt.show()

##### 1. Why did you pick the specific chart?

To show Distribution Age vs Response

##### 2. What is/are the insight(s) found from the chart?

People ages between from 31 to 50 are more likely to respond

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Knowing that this age group is more responsive, insurance providers can develop insurance products or policies that are better suited to their preferences and life stages.

#### Chart - 13

In [None]:
sns.barplot(x='Vehicle_Age',y='Annual_Premium',data=df,palette='magma')
plt.title("Vehile Age vs Annual Premium",fontsize=15)
plt.xlabel("Vehicle Age",fontsize=12)
plt.ylabel("Annual Premium",fontsize=12)
plt.show()

##### 1. Why did you pick the specific chart?

To show Vehicle Age vs Annual Premium

##### 2. What is/are the insight(s) found from the chart?

Customer paying higher premium if age of vehicle greater than 2 years

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the relationship between vehicle age and premiums allows insurers to better assess risk. It helps in underwriting policies and determining the level of coverage needed for different vehicle age groups, which can lead to more effective risk management.

#### Chart - 14

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(df.corr(),cmap="viridis",linewidths=3,annot=True)

##### 1. Why did you pick the specific chart?

A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables.

As correlation matrix is used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses. The range of correlation is [-1,1].

##### 2. What is/are the insight(s) found from the chart?

Policy_sales_channel and age is negatively correlated.

There is a significant negative correlation of approximately -0.34 between "Previously_Insured" and "Response." This suggests that customers who were previously insured are less likely to respond positively. This could be valuable information for targeting marketing efforts.

#### Chart - 15

In [None]:
sns.pairplot(data=df)

##### 1. Why did you pick the specific chart?

A pairplot, also known as a scatterplot matrix, is a visualization that allows you to visualize the relationships between all pairs of variables in a dataset. It is a useful tool for data exploration because it allows you to quickly see how all of the variables in a dataset are related to one another.

Thus, we used pair plot to analyse the patterns of data and realationship between the features. It's exactly same as the correlation map but here you will get the graphical representation.

##### 2. What is/are the insight(s) found from the chart?

The graph above shows how each feature is distributed in respect to other features. Since many features have binary values, we cannot see a good relationship with other features.

## ***5. Feature Engineering & Data Pre-processing***

In [None]:
# Creating a copy of the dataset for further feature engineering
df1=df.copy()

### 1. Handling Missing Values

In [None]:
df1.isnull().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

There are no missing values present

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

In [None]:
for column_name in numerical_cols:
    plt.figure(figsize=(7,5))
    sns.boxplot(df1[column_name])
    plt.show()

In [None]:
sns.distplot(df1['Annual_Premium'])

In [None]:
sns.boxplot(x='Annual_Premium',data=df1)

In [None]:
df1['Annual_Premium'].describe()

In [None]:
percentile25=df1['Annual_Premium'].quantile(0.25)
percentile75=df1['Annual_Premium'].quantile(0.75)

In [None]:
print("25% is {} and 75% is {}".format(percentile25,percentile75))

In [None]:
iqr=percentile75-percentile25

In [None]:
iqr

In [None]:
upper_limit = percentile75 + 1.5*iqr
lower_limit = percentile25 - 1.5*iqr

In [None]:
print("Upper limit ",upper_limit)
print("Lower limit ",lower_limit)

In [None]:
df1[df1['Annual_Premium']<lower_limit]

In [None]:
df1[df1['Annual_Premium'] > upper_limit]

In [None]:
# Trimming

df2=df1[df1["Annual_Premium"] < upper_limit]

In [None]:
df2.shape

In [None]:
sns.boxplot(x="Annual_Premium",data=df2)

##### What all outlier treatment techniques have you used and why did you use those techniques?

The data is right-skewed. So to remove outliers we have used trimming

### 3. Categorical Encoding

In [None]:
# label Encoding
le=LabelEncoder()
df2['Vehicle_Damage']=le.fit_transform(df2['Vehicle_Damage'])

In [None]:
# One hot Enoding
df2=pd.get_dummies(df2,columns=['Gender','Vehicle_Age'],drop_first=True)

In [None]:
df2.head()

#### What all categorical encoding techniques have you used & why did you use those techniques?

I have used Label Encoding for Vehicle_Damage column and One hot Encoding for Gender and Vehicle_Age column

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

# Contain all independent variables
X=df2.drop('Response',axis=1)

# Contain dependent variable
y=df2['Response']

#### 2. Feature Selection

In [None]:
etc=ExtraTreesClassifier()
etc.fit(X,y)
print(etc.feature_importances_)
imp_features=pd.Series(etc.feature_importances_,index=X.columns)
imp_features.nlargest(12).plot(kind='bar')
plt.show()

In [None]:
# Dropping less important features
X=X.drop('Driving_License',axis=1)

##### What all feature selection methods have you used  and why?

I have used ExtraTreesClassifier to select important features from data

##### Which all features you found important and why?

'id', 'Age', 'Region_Code', 'Previously_Insured', 'Vehicle_Damage','Annual_Premium', 'Policy_Sales_Channel', 'Vintage', 'Gender_Male','Vehicle_Age_< 1 Year', 'Vehicle_Age_> 2 Years'

I found these featues are important

### 3. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Yes,dataset is imbalanced

In [None]:
# Handling Imbalanced Dataset
from imblearn.over_sampling import RandomOverSampler
ros=RandomOverSampler()
X_new,y_new=ros.fit_resample(X,y)

from collections import Counter
print('Original dataset shape {}'.format(Counter(y)))
print('Resampled dataset shape {}'.format(Counter(y_new)))
sns.countplot(y_new,palette='husl')

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

To handle class imbalance I have used over sampling technique

### 4. Data Splitting

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_new, y_new, test_size=0.3, random_state=42)

In [None]:
print('Shape of X_train',X_train.shape)
print('Shape of y_train',y_train.shape)
print('Shape of X_test',X_test.shape)
print('Shape of y_test',y_test.shape)

##### What data splitting ratio have you used and why?

I have used spliting ratio of 70-30 i.e 70% for training and 30% for testing

### 5. Data Scaling

In [None]:
scaler=StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.transform(X_test)

##### Which method have you used to scale you data and why?

I have used Standardization for Data Scaling

Standardization is a preprocessing technique to scale numerical features to have a mean of 0 and a standard deviation of 1

## ***6. ML Model Implementation***

### ML Model - Logistic Regression

In [None]:
logistic_model=LogisticRegression()
logistic_model.fit(X_train,y_train)
y_pred_logistic=logistic_model.predict(X_test)
logistic_probability=logistic_model.predict_proba(X_test)[:,1]


acc_lr=accuracy_score(y_test,y_pred_logistic)
precision_lr=precision_lr=precision_score(y_test,y_pred_logistic)
recall_lr=recall_score(y_test,y_pred_logistic)
f1_score_lr=f1_score(y_test,y_pred_logistic)
roc_lr=roc_auc_score(y_test,y_pred_logistic)

In [None]:
# Evaluation
print("Accuracy_Score : ",accuracy_score(y_test,y_pred_logistic))
print("Precision_Score : ",precision_score(y_test,y_pred_logistic))
print("Recall_Score : ",recall_score(y_test,y_pred_logistic))
print("f1_Score : ",f1_score(y_test,y_pred_logistic))
print("ROC_AUC_Score : ",roc_auc_score(y_test,y_pred_logistic))

### Logistic Regression ROC Curve

In [None]:
fpr_logistic,tpr_logistic, _ = roc_curve(y_test,logistic_probability)

plt.title('Logistic Regression ROC curve')
plt.xlabel('FPR')
plt.ylabel('TPR')

plt.plot(fpr_logistic,tpr_logistic)
plt.plot((0,1), ls='dashed',color='black')
plt.show()

### Logistic Regression Confusion Matrix

In [None]:
cm_logistic = confusion_matrix(y_test, y_pred_logistic)
print(cm_logistic)

plt.figure(figsize=(4, 4))
sns.heatmap(cm_logistic, annot=True,fmt='d',cmap='BuGn')
plt.xlabel('Predictions', fontsize=15)
plt.ylabel('Actuals', fontsize=15)
plt.title('Confusion Matrix for Logistic Regression', fontsize=15)
plt.show()

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
print(classification_report(y_test,y_pred_logistic))

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:

logistic_model_tuning=LogisticRegression()
param_grid={'C': [0.001, 0.01, 0.1, 1, 10, 100],'penalty':['l2','l1', 'elasticnet']}
grid_LogReg=GridSearchCV(estimator=logistic_model_tuning,param_grid=param_grid,cv=5)
grid_LogReg.fit(X_train,y_train)
y_tuned_logistic=grid_LogReg.predict(X_test)
y_tuned_log_prob=grid_LogReg.predict_proba(X_test)[:,1]
print("Best cross-validation score:",grid_LogReg.best_score_)
print('Best Parameters:',grid_LogReg.best_params_)


acc_lr_tun=accuracy_score(y_test,y_tuned_logistic)
precision_lr_tun=precision_lr=precision_score(y_test,y_tuned_logistic)
recall_lr_tun=recall_score(y_test,y_tuned_logistic)
f1_score_lr_tun=f1_score(y_test,y_tuned_logistic)
roc_lr_tun=roc_auc_score(y_test,y_tuned_logistic)


In [None]:

# Metrics after tuning
print("Accuracy_Score : ",accuracy_score(y_test,y_tuned_logistic))
print("Precision_Score : ",precision_score(y_test,y_tuned_logistic))
print("Recall_Score : ",recall_score(y_test,y_tuned_logistic))
print("f1_Score : ",f1_score(y_test,y_tuned_logistic))
print("ROC_AUC_Score : ",roc_auc_score(y_test,y_tuned_logistic))


### ROC Curve for Logistic Regression after tuning

In [None]:

fpr_logistic_tuning,tpr_logistic_tuning, _ = roc_curve(y_test,y_tuned_log_prob)

plt.title('ROC Curve for Logistic Regression after tuning')
plt.xlabel('FPR')
plt.ylabel('TPR')

plt.plot(fpr_logistic_tuning,tpr_logistic_tuning)
plt.plot((0,1), ls='dashed',color='black')
plt.show()


### Confusion Matrix for Logistic Regression after tuning

In [None]:

cm_logistic_tuning = confusion_matrix(y_test,y_tuned_logistic)
print(cm_logistic_tuning)

plt.figure(figsize=(4, 4))
sns.heatmap(cm_logistic_tuning, annot=True,fmt='d',cmap='Blues')
plt.xlabel('Predictions', fontsize=15)
plt.ylabel('Actuals', fontsize=15)
plt.title('Confusion Matrix for Logistic Regression', fontsize=15)
plt.show()


In [None]:
print(classification_report(y_test,y_tuned_logistic))

##### Which hyperparameter optimization technique have you used and why?

 I have used GridSearchCV for hyper parameter tuning.GridSearchCV performs an exhaustive search over a specified hyperparameter space, creating a grid of all possible hyperparameter combinations, and evaluating the performance of each combination using cross-validation. It then selects the combination of hyperparameters that results in the best performance on the validation set.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

I cannot see much improvement in the model's performance after hyperparameter tuning

### ML Model - Deision Tree

In [None]:
DT_model=DecisionTreeClassifier()
DT_model.fit(X_train,y_train)
y_pred_DT=DT_model.predict(X_test)
DT_probability=DT_model.predict_proba(X_test)[:,1]

acc_DT=accuracy_score(y_test,y_pred_DT)
precision_DT=precision_lr=precision_score(y_test,y_pred_DT)
recall_DT=recall_score(y_test,y_pred_DT)
f1_score_DT=f1_score(y_test,y_pred_DT)
roc_DT=roc_auc_score(y_test,y_pred_DT)

In [None]:
# Evaluation
print("Accuracy_Score : ",accuracy_score(y_test,y_pred_DT))
print("Precision_Score : ",precision_score(y_test,y_pred_DT))
print("Recall_Score : ",recall_score(y_test,y_pred_DT))
print("f1_Score : ",f1_score(y_test,y_pred_DT))
print("ROC_AUC_Score : ",roc_auc_score(y_test,y_pred_DT))

### Decision Tree ROC Curve

In [None]:
fpr_DT,tpr_DT, _ = roc_curve(y_test,DT_probability)

plt.title('Decision Tree ROC curve')
plt.xlabel('FPR')
plt.ylabel('TPR')

plt.plot(fpr_DT,tpr_DT)
plt.plot((0,1), ls='dashed',color='black')
plt.show()

### Decision Tree Confusion Matrix

In [None]:
cm_DT = confusion_matrix(y_test,y_pred_DT)
print(cm_DT)

plt.figure(figsize=(4, 4))
sns.heatmap(cm_DT, annot=True,fmt='d',cmap='BuGn')
plt.xlabel('Predictions', fontsize=15)
plt.ylabel('Actuals', fontsize=15)
plt.title('Confusion Matrix for Decision Tree', fontsize=15)
plt.show()

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
print(classification_report(y_test,y_pred_DT))

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:

DT_model_tuning=DecisionTreeClassifier(max_depth=None,max_features='log2',min_samples_leaf=1,min_samples_split=2)

DT_model_tuning.fit(X_train,y_train)
y_tuned_DT=DT_model_tuning.predict(X_test)
y_tuned_DT_prob=DT_model_tuning.predict_proba(X_test)[:,1]

acc_DT_tun=accuracy_score(y_test,y_tuned_DT)
precision_DT_tun=precision_lr=precision_score(y_test,y_tuned_DT)
recall_DT_tun=recall_score(y_test,y_tuned_DT)
f1_score_DT_tun=f1_score(y_test,y_tuned_DT)
roc_DT_tun=roc_auc_score(y_test,y_tuned_DT)


In [None]:

# Metrics after tuning
print("Accuracy_Score : ",accuracy_score(y_test,y_tuned_DT))
print("Precision_Score : ",precision_score(y_test,y_tuned_DT))
print("Recall_Score : ",recall_score(y_test,y_tuned_DT))
print("f1_Score : ",f1_score(y_test,y_tuned_DT))
print("ROC_AUC_Score : ",roc_auc_score(y_test,y_tuned_DT))


### ROC Curve for Deision Tree after tuning

In [None]:

fpr_DT_tuning,tpr_DT_tuning, _ = roc_curve(y_test,y_tuned_DT_prob)

plt.title('ROC Curve for Decision Tree after tuning')
plt.xlabel('FPR')
plt.ylabel('TPR')

plt.plot(fpr_DT_tuning,tpr_DT_tuning)
plt.plot((0,1), ls='dashed',color='black')
plt.show()


### Confusion Matrix for Deision Tree after tuning

In [None]:

cm_DT_tuning = confusion_matrix(y_test,y_tuned_DT)
print(cm_DT_tuning)

plt.figure(figsize=(4, 4))
sns.heatmap(cm_DT_tuning, annot=True,fmt='d',cmap='Blues')
plt.xlabel('Predictions', fontsize=15)
plt.ylabel('Actuals', fontsize=15)
plt.title('Confusion Matrix for Decision Tree after tuning', fontsize=15)
plt.show()


In [None]:
print(classification_report(y_test,y_tuned_DT))

##### Which hyperparameter optimization technique have you used and why?

I used Grid Search as the hyperparameter optimization technique. Grid Search is a simple and commonly used method for hyperparameter tuning. It systematically searches through a predefined set of hyperparameter combinations to find the best configuration that maximizes the model's performance.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

I cannot see much improvement in the model's performance after hyperparameter tuning

### ML Model - Random Forest

In [None]:
RF_model=RandomForestClassifier()
RF_model.fit(X_train,y_train)
y_pred_RF=RF_model.predict(X_test)
RF_probability=RF_model.predict_proba(X_test)[:,1]

acc_RF=accuracy_score(y_test,y_pred_RF)
precision_RF=precision_lr=precision_score(y_test,y_pred_RF)
recall_RF=recall_score(y_test,y_pred_RF)
f1_score_RF=f1_score(y_test,y_pred_RF)
roc_RF=roc_auc_score(y_test,y_pred_RF)

In [None]:
# Evaluation
print("Accuracy_Score : ",accuracy_score(y_test,y_pred_RF))
print("Precision_Score : ",precision_score(y_test,y_pred_RF))
print("Recall_Score : ",recall_score(y_test,y_pred_RF))
print("f1_Score : ",f1_score(y_test,y_pred_RF))
print("ROC_AUC_Score : ",roc_auc_score(y_test,y_pred_RF))

### Random Forest ROC Curve

In [None]:
fpr_RF,tpr_RF, _ = roc_curve(y_test,RF_probability)

plt.title('Random Forest ROC curve')
plt.xlabel('FPR')
plt.ylabel('TPR')

plt.plot(fpr_RF,tpr_RF)
plt.plot((0,1), ls='dashed',color='black')
plt.show()

### Random Forest Confusion Matrix

In [None]:
cm_RF = confusion_matrix(y_test, y_pred_RF)
print(cm_RF)

plt.figure(figsize=(4, 4))
sns.heatmap(cm_RF, annot=True,fmt='d',cmap='BuGn')
plt.xlabel('Predictions', fontsize=15)
plt.ylabel('Actuals', fontsize=15)
plt.title('Confusion Matrix for Random Forest', fontsize=15)
plt.show()

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
print(classification_report(y_test,y_pred_RF))

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
RF_model_tuning=RandomForestClassifier(n_estimators=100,max_depth=None,min_samples_split=2,min_samples_leaf=1,max_features='log2',n_jobs=-1)

RF_model_tuning.fit(X_train,y_train)
y_tuned_RF=RF_model_tuning.predict(X_test)
y_tuned_RF_prob=RF_model_tuning.predict_proba(X_test)[:,1]


acc_RF_tun=accuracy_score(y_test,y_tuned_RF)
precision_RF_tun=precision_lr=precision_score(y_test,y_tuned_RF)
recall_RF_tun=recall_score(y_test,y_tuned_RF)
f1_score_RF_tun=f1_score(y_test,y_tuned_RF)
roc_RF_tun=roc_auc_score(y_test,y_tuned_RF)

In [None]:
# Evaluation
print("Accuracy_Score : ",accuracy_score(y_test,y_tuned_RF))
print("Precision_Score : ",precision_score(y_test,y_tuned_RF))
print("Recall_Score : ",recall_score(y_test,y_tuned_RF))
print("f1_Score : ",f1_score(y_test,y_tuned_RF))
print("ROC_AUC_Score : ",roc_auc_score(y_test,y_tuned_RF))

### ROC Curve for Random Forest after tuning

In [None]:
fpr_RF_tuning,tpr_RF_tuning, _ = roc_curve(y_test,y_tuned_RF_prob)

plt.title('ROC Curve for Random Forest after tuning')
plt.xlabel('FPR')
plt.ylabel('TPR')

plt.plot(fpr_RF_tuning,tpr_RF_tuning)
plt.plot((0,1), ls='dashed',color='black')
plt.show()

### Confusion Matrix for Random Forest after tuning

In [None]:
cm_RF_tuning = confusion_matrix(y_test,y_tuned_RF)
print(cm_RF_tuning)

plt.figure(figsize=(4, 4))
sns.heatmap(cm_RF_tuning, annot=True,fmt='d',cmap='Blues')
plt.xlabel('Predictions', fontsize=15)
plt.ylabel('Actuals', fontsize=15)
plt.title('Confusion Matrix for Random Forest after tuning', fontsize=15)
plt.show()


In [None]:
print(classification_report(y_test,y_tuned_RF))

##### Which hyperparameter optimization technique have you used and why?

I used Grid Search as the hyperparameter optimization technique. Grid Search is a simple and commonly used method for hyperparameter tuning. It systematically searches through a predefined set of hyperparameter combinations to find the best configuration that maximizes the model's performance.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

There is slight improvement after tuning the model.

### ML Model - XGB Classifier

In [None]:
XGB_classifier_model=XGBClassifier()
XGB_classifier_model.fit(X_train,y_train)
y_pred_XGB=XGB_classifier_model.predict(X_test)
XGB_probability=XGB_classifier_model.predict_proba(X_test)[:,1]


acc_XGB=accuracy_score(y_test,y_pred_XGB)
precision_XGB=precision_lr=precision_score(y_test,y_pred_XGB)
recall_XGB=recall_score(y_test,y_pred_XGB)
f1_score_XGB=f1_score(y_test,y_pred_XGB)
roc_XGB=roc_auc_score(y_test,y_pred_XGB)

In [None]:
# Evaluation
print("Accuracy_Score : ",accuracy_score(y_test,y_pred_XGB))
print("Precision_Score : ",precision_score(y_test,y_pred_XGB))
print("Recall_Score : ",recall_score(y_test,y_pred_XGB))
print("f1_Score : ",f1_score(y_test,y_pred_XGB))
print("ROC_AUC_Score : ",roc_auc_score(y_test,y_pred_XGB))

### XGB Classifier ROC Curve

In [None]:
fpr_XGB,tpr_XGB, _ = roc_curve(y_test,XGB_probability)

plt.title('XGB Classifier ROC curve')
plt.xlabel('FPR')
plt.ylabel('TPR')

plt.plot(fpr_XGB,tpr_XGB)
plt.plot((0,1), ls='dashed',color='black')
plt.show()

### XGB Classifier Confusion Matrix

In [None]:
cm_XGB = confusion_matrix(y_test, y_pred_XGB)
print(cm_XGB)

plt.figure(figsize=(4, 4))
sns.heatmap(cm_RF, annot=True,fmt='d',cmap='BuGn')
plt.xlabel('Predictions', fontsize=15)
plt.ylabel('Actuals', fontsize=15)
plt.title('Confusion Matrix for XGB Classifier', fontsize=15)
plt.show()

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
print(classification_report(y_test,y_pred_XGB))

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
XGB_model_tuning=XGBClassifier(n_estimators=150,max_depth=8,learning_rate=0.3)
XGB_model_tuning.fit(X_train,y_train)
y_tuned_XGB=XGB_model_tuning.predict(X_test)
y_tuned_XGB_prob=XGB_model_tuning.predict_proba(X_test)[:,1]


acc_XGB_tun=accuracy_score(y_test,y_tuned_XGB)
precision_XGB_tun=precision_lr=precision_score(y_test,y_tuned_XGB)
recall_XGB_tun=recall_score(y_test,y_tuned_XGB)
f1_score_XGB_tun=f1_score(y_test,y_tuned_XGB)
roc_XGB_tun=roc_auc_score(y_test,y_tuned_XGB)

In [None]:
print("Accuracy_Score : ",accuracy_score(y_test,y_tuned_XGB))
print("Precision_Score : ",precision_score(y_test,y_tuned_XGB))
print("Recall_Score : ",recall_score(y_test,y_tuned_XGB))
print("f1_Score : ",f1_score(y_test,y_tuned_XGB))
print("ROC_AUC_Score : ",roc_auc_score(y_test,y_tuned_XGB))

### ROC Curve for XGB Classifier after tuning

In [None]:
fpr_XGB_tuning,tpr_XGB_tuning, _ = roc_curve(y_test,y_tuned_XGB_prob)

plt.title('ROC Curve for XGB Classifier after tuning')
plt.xlabel('FPR')
plt.ylabel('TPR')

plt.plot(fpr_XGB_tuning,tpr_XGB_tuning)
plt.plot((0,1), ls='dashed',color='black')
plt.show()

### Confusion Matrix for XGB Classifier after tuning

In [None]:
cm_XGB_tuning = confusion_matrix(y_test,y_tuned_XGB)
print(cm_XGB_tuning)

plt.figure(figsize=(4, 4))
sns.heatmap(cm_XGB_tuning, annot=True,fmt='d',cmap='Blues')
plt.xlabel('Predictions', fontsize=15)
plt.ylabel('Actuals', fontsize=15)
plt.title('Confusion Matrix for XGB Classifier after tuning', fontsize=15)
plt.show()


In [None]:
print(classification_report(y_test,y_tuned_XGB))

##### Which hyperparameter optimization technique have you used and why?

I used Grid Search as the hyperparameter optimization technique. Grid Search is a simple and commonly used method for hyperparameter tuning. It systematically searches through a predefined set of hyperparameter combinations to find the best configuration that maximizes the model's performance.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

There is slight improvement after tuning the model.

### Compare Models

In [None]:
ind=['Logistic Regression','Logistic Regression after tuning','Decision Tree','Decision Tree after tuning',
     'Random Forest','Random Forest after tuning','XGB Classifier','XGB Classifier after tuning']
data={"Accuracy":[acc_lr,acc_lr_tun,acc_DT,acc_DT_tun,acc_RF,acc_RF_tun,acc_XGB,acc_XGB_tun],
      "Recall":[recall_lr,recall_lr_tun,recall_DT,recall_DT_tun,recall_RF,recall_RF_tun,recall_XGB,recall_XGB_tun],
      "Precision":[precision_lr,precision_lr_tun,precision_DT,precision_DT_tun,precision_RF,precision_RF_tun,precision_XGB,precision_XGB_tun],
    'f1_score':[f1_score_lr,f1_score_lr_tun,f1_score_DT,f1_score_DT_tun,f1_score_RF,f1_score_RF_tun,f1_score_XGB,f1_score_XGB_tun],
      "ROC_AUC":[roc_lr,roc_lr_tun,roc_DT,roc_DT_tun,roc_RF,roc_RF_tun,roc_XGB,roc_XGB_tun]}
result=pd.DataFrame(data=data,index=ind)
result

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

The recall score is the best evaluation metric for model selection because we want to avoid missing potential customers interested in buying insurance.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Random Forest performs the best among the four classification methods used.

# **Conclusion**

- Customer who are not perviosly insured are likely to be inetrested.
- Customers with vechicle age 1-2 years are more likely to interested as compared to the other two.
- We can see most of the people are from region 28.
- Customer paying higher premium if age of vehicle is greater than 2 years
- Vehicles in the age 1-2 year are more damaged compared to other two.
- People ages between from 31 to 50 are more likely to respond.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***