### <CENTER><h1><u>Classification via KNN & SVM</u></CENTER></h1>

<br>

<CENTER>(TEAM CONTRIBUTORS: CHAITANYA DEVARSHI, SHASHANK SHEKHAR, BITTERLEIN KONNOTH BIJU)</CENTER>

----

===================================================================================================================

<h2><u>Content</u></h2>

1. [Introduction](#1.-Introduction)

  1.1 [Problem Statement](#1.1-Problem-Statement)
  
  1.2 [Methodology](#1.2-Methodology)
  

2. [Data Loading & Preparation](#2.-Data-Loading-&-Preparation)

  2.1 [Read the Data](#2.1-Read-the-Data)
  

3. [EDA](#3.-Exploratory-Data-Analysis)

  3.1 [Missing Values](#3.1-Missing-Values)
  
  3.2 [Univariate](#3.2-Univariate-Analysis)
  
   - 3.2.1 [For numeric features](#3.2.1-Univariate-Analysis-for-numeric-features)
     
   - 3.2.2 [For Binary features](#3.2.2-Univariate-Analysis-for-Binary-features)
     
  3.3 [Bivariate](#3.3-Bivariate-Analysis)
  
  3.4 [Multivariate](#3.4-Multivariate-Analysis)


4. [Data Cleaning](#4.-Data-Cleaning)

  4.1 [Handling Outliers](#4.1-Handling-Outliers)

  4.2 [Handling Skewness](#4.2-Handling-Skewness)


5. [Prepped Data Review](#5.-Prepped-Data-Review)


6. [Dimensionality Reduction](#6.-Dimensionality-Reduction)

  6.1 [Variance Threshold](#6.1-Variance-Threshold)
  
  6.2 [Forward Elimination](#6.2-Forward-Elimination)


7. [Binary Logistic Regression Models](#7.-Binary-Logistic-Regression-Models)

  7.1 [1<sup>st</sup> Model](#7.1-1st-Model)
  
  7.2 [2<sup>nd</sup> Model](#7.2-2nd-Model)
  
  7.3 [3<sup>rd</sup> Model](#7.3-3rd-Model)
  
  
8. [Model Selection](#8.-Model-Selection)


9. [Conclusion](#Conclusion)


===================================================================================================================

## 1. Introduction

Like many industries, the insurance industry is always interested in broadening its relationships with existing customers. To that end, insurance companies will often attempt to sell additional products to their existing customers. For example, if you have a homeowner’s policy with a particular insurance company, they will likely try to also sell you an auto insurance policy, or perhaps a water damage supplemental policy to your homeowner’s policy, etc.


__Dataset Description__

The data set we will be using is sourced from a Kaggle contribution. The data set is comprised of more than 14,000 observations of 1 response/dependent variable (which indicates whether or not the new insurance product was purchased) and 14 explanatory/independent variables. The insurance company gathered data about customers to whom they offered the new product.
We are given information about whether they did or did not sign up for the new product, together with some customer information and information about their buying behavior of two other products.

A data dictionary for the dataset is provided below.

|Attribute  |Description          |
|-----------|---------------------|
|ID         |Unique customer identifier|
|TARGET     |Indicator of customer buying the new product (N = no, Y = yes)|
|Loyalty    |Customer loyalty level, from low to high (0 to 3), 99 = unclassified|
|Age        |Customer age in years|
|City       |Unique code per city (where the customer resides)|
|Age_p      |Age of customer’s partner in years|
|LOR        |Length of Relationship in years|
|LOR_m      |Length of customer’s relationship with company (in months)|
|Prod_A     |Customer previously bought Product A (0=no, 1=yes)|
|Type_A     |Type of product A|
|Turnover_A |Amount of money customer spent on Product A|
|Prod_B     |Customer previously bought Product B (0=no, 1=yes)|
|Type_B     |Type of product B|
|Turnover_B |Amount of money customer spent on Product B|
|Contract   |Type of contract|

----

<b> [Back to Content](#Content) </b>

## 1.1 Problem Statement

A large insurance company has given us a task with the development of a model that can predict whether or not a given existing customer is likely to purchase an additional insurance product from the company. The insurance company plans to use the output of such a model in an attempt to improve its customer retention and sales practices.By delving into a dataset containing various customer-related variables, we seek to unravel patterns and relationships that shed light on the factors influencing consumer decisions.

----

<b> [Back to Content](#Content) </b>

## 1.2 Methodology

<h3><u> To address this assignment, we will follow these below steps :- </u></h3>

1. **Load the dataset**: Upload the `M7_Data.csv` file from the DAV 6150 Github Repository.

2. **Read the dataset**: Using a Jupyter Notebook, read the dataset from the respective Github repository and load it into a Pandas DataFrame.

3. **Perform EDA**: Carry out Exploratory Data Analysis to examine the dataset's structure and understand the variables.

4. **Identify and rectify issues**: Detect data quality and integrity issues such as missing values or outliers during EDA, and take appropriate actions to address them.

5. **Prepped Data Review**: Here, we will cross check every thing and will make sure our data is ready for further analysis.

6. **Feature Scale, Selection & Dimensionality Reduction**: Applying feature selection techniques and perform dimensionality reduction to prepare the data for modeling.

7. **Binary Logistic Regression Modelling**: We will make 3 different models of Binary Logistic Regression.

8. **Models Selection**: Among the 3 different models, we will make our judgement on selecting one model.

9. **Conclusion**: We will conclude our work.

----

<b> [Back to Content](#Content) </b>

## 2. Data Loading & Preparation

In [1]:
# Importing basic Libraries.

import pandas as pd 
import numpy as np

# Importing Libraries for statistical analysis.

import statsmodels.api as sm
from scipy import stats

# Importing Libraries for machine learning models.

import sklearn
from sklearn import metrics
import imblearn
from imblearn.metrics import specificity_score
from sklearn.metrics import roc_auc_score, roc_curve, auc, classification_report
from sklearn.feature_selection import VarianceThreshold

# Importing Libraries for plotting the graphs.

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
%matplotlib inline

# Importing Libraries for Standarising and Normalising.

from sklearn.preprocessing import StandardScaler

# Import Library for PCA

from sklearn.decomposition import PCA

# Import missingno library for checking on missing values.

import missingno as msno

# Importing train_test_split .

from sklearn.model_selection import train_test_split, cross_val_score, KFold

# Importing Libraries for Forward elemination.

from mlxtend.feature_selection import SequentialFeatureSelector as sfs
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression

# Importing filterwarnings from warnings to ignore warnings.

import warnings
warnings.filterwarnings("ignore")

----

<b> [Back to Content](#Content) </b>

### 2.1 Read the Data

In [2]:
# Loading the data from the github repository DAV-6150.

# Based on Domain knowledge assigning a proper data types to the columns while loading the data.
column_types = {        
        'TARGET': object,    
        'loyalty': object,
        'ID': 'int64',
        'age': 'int64',
        'city': object,
        'LOR': 'int64',    
        'prod_A': object,
        'type_A': object,
        'type_B': object,
        'prod_B': object,    
        'turnover_A': float,
        'turnover_B': float,
        'contract': object,
        'age_P': 'int64',
        'lor_M': 'int64'  
    }

insurance_data = pd.read_csv("https://raw.githubusercontent.com/Shashank4075/DAV-6150/refs/heads/main/M7_Data.csv")

# Reorder the DataFrame columns. 
insurance_data = insurance_data[['TARGET', 'loyalty', 'ID', 'city', 'prod_A', 'type_A', 'type_B', 'prod_B', 'contract', 
                   'age', 'age_P', 'lor_M', 'LOR', 'turnover_A', 'turnover_B']]

# Making a copy of the dataset.
df = insurance_data.copy()

df.head()

HTTPError: HTTP Error 404: Not Found

In [None]:
# Identifying how many rows and columns the dataframe consist of.

df.shape

In [None]:
# Getting a concise summary of the DataFrame .

df.info()

**Dataset observation:**



- Index ranges from 0-14015.

- Total number of attributes are 15.

- Where, 12 are 'int', 2 are 'float' and 1 is 'object'.

- As of now there are no any missing values in any columns.

----

<b> [Back to Content](#Content) </b>

## 3. Exploratory Data Analysis

- Analyzing a data set for purposes of summarizing its characteristics, identifying relationships between its attributes, and discovering patterns, trends, outliers, missing values and invalid values within the data. 

In [None]:
# Checking columns names.

df.columns

----

<b> [Back to Content](#Content) </b>

### 3.1 Missing Values

In [None]:
# Checking for null values.

df.isnull().sum()

- As of now there are no nulls present.

In [None]:
# Checking the duplicate columns.

count_duplicate = df.duplicated().sum()

print(f"Number of duplicate rows :",count_duplicate)

- There are 3008 duplicate rows in the whole dataset, it should be removed.

In [None]:
# Removing the duplicate rows.

df = df[df.duplicated() == False]

# Checking the shape after removing the duplicate rows.
df.shape

In [None]:
# Check if any negative value exists in the entire DataFrame.

# Select only numeric columns.
numeric_df = df.select_dtypes(include=['number'])

# Find columns that have negative values.
columns_with_negatives = numeric_df.columns[(numeric_df < 0).any()]

if (numeric_df < 0).any().any():
    print("There are negative values in the DataFrame.")
else:
    print("There are no negative values in the DataFrame.")
    
# Print the column names that contain negative values.
print("Columns with negative values:", columns_with_negatives.tolist())

----

<b> [Back to Content](#Content) </b>

### 3.2 Univariate Analysis

In [None]:
df.head()

In [None]:
df.info()

----

<b> [Back to Content](#Content) </b>

### 3.2.1 For Numeric Features

In [None]:
# Create function to plot dist and box plot for all the numeric features. 

def box_dist_plot(df , column):
    
    
    """
    
    This function is to plot box-plot and distribution-plot for a given column, 
    column's median value, with count and percentage of null values. 
    
    Parameters :-
        df : Dataframe           # df contains Dataframe.
        column : str             # Column name which is to be ploted.
    
    """
    
    plt.style.use('ggplot')  
    
    plt.figure(figsize=(18, 7))

    # Box plot.
    plt.subplot(121)
    sns.boxplot(y = df[column])  # Create box plot
    plt.title(f'Box Plot of : {column}')

    # Distribution plot.
    plt.subplot(122)
    sns.histplot(df[column], bins=30, kde=True)  # Create histogram with KDE
    plt.title(f'Distribution Plot of : {column}')

    # Adjusting the layout.
    plt.tight_layout() 

    plt.show()  

    # To print statistics.
    print(df[column].describe())
    print('Median :', df[column].median())
    print()
    print('Total Number of null values :', df[column].isnull().sum(), 'count,', 
          round(df[column].isnull().mean() * 100, 2), '%')

In [None]:
box_dist_plot(df,'age')

- Here 'age' is the age of customers in years, data is right skewed and also it has an outliers, but they are valid data points.

In [None]:
box_dist_plot(df,'age_P')

- Here 'age_P' represents age of customers' partner in years, where data is right skewed and has a outliers. Thus, all the data points are valid.

__Note__ - As per the above 2 univariate analysis of 'age' and 'age_P', has exactly the same statistic value, will check if each data points are same in both the attributes, then we can drop one of them.

In [None]:
box_dist_plot(df,'LOR')

- Here 'LOR', it is the length of relationship in years with the insurance company, mostly the data of relationship with the insurance company is 0 - 2 years, so there are outliers but are valid points.

In [None]:
box_dist_plot(df,'lor_M')

- Here 'lor_M' it is the length of relationship in months with the insurance company, mostly the data of relationship with the insurance company is 0 - 30 months, so there are outliers but are valid points. By the way, this attribute is similar to the 'LOR' attributes in terms of observation, may be from both we can exclude one of them.

In [None]:
box_dist_plot(df,'turnover_A')

- Here 'turnover_A', it is a turnover of the sell of product-A, where highest turnover is around 5500 and lowest is around 300.

In [None]:
box_dist_plot(df,'turnover_B')

- Here 'turnover_B', it is a turnover of the sell of product-B, where highest turnover is around 12250 and lowest is around 190.

----

<b> [Back to Content](#Content) </b>

### 3.2.2 For Categorical Features

In [None]:
# Generates a count plot and displays the count of each category for a specified column in the dataframe.

def plot_category_counts(df, column):
    
    """
    A function to plot a countplot and  displays the count of each category 
    for a specified column in the dataframe.
    
        column : str
        The name of the categorical column to plot and count.
    """

    # Count plot for the specified column.
    plt.figure(figsize=(10, 6))
    sns.countplot(data=df, x=column, palette="viridis")
    

    # Set plot labels and title
    plt.xlabel(column)  
    plt.ylabel('Count')     
    plt.xticks(rotation=90, ha='right')
    plt.title(f'Count of {column}')

    # Display the plot
    plt.show()

    # Display count of each category
    counts = df[column].value_counts()
    print(f"\nCounts for {column}:\n{counts}")

    #For unique count of input
    unique_count = df[column].nunique()
    print(f"\nUnique for {column}:\n{unique_count}")

In [None]:
# calling the function
plot_category_counts(df, 'TARGET')

- There are 'Yes' values around 6016 and 8000 values of 'No' in our target variable.

In [None]:
# calling the function
plot_category_counts(df,'loyalty')

- It is the level of loyalty from low to high (0 to 3), and 99 indicates unclassified values, there are lot of unclassified values which we don't need to worry about it.

In [None]:
# calling the function
plot_category_counts(df,'city')

- As per the countplot, we can observe that most of the data is of one particular class. So, we will be checking the percentage of it.

In [None]:
# Calculating the percentage of the attribute 'city's unique values.

(df.city.value_counts()/len(df.city)) * 100

- As most of the data lies in 2, which means that 97.88% of the data is of the city-2. So, this particular city attribute is not of much use in training the models, we can get rid of this attribute.
- This also reflects that our dataset is bias to one particular city (city-2).

In [None]:
# calling the function
plot_category_counts(df,'prod_A')

- Here it shows the count of product-A is bought.

In [None]:
# calling the function
plot_category_counts(df,'type_A')

- Here it shows the count of each type of product-A, there are total 3 types of product-A.

In [None]:
# calling the function
plot_category_counts(df,'type_B')

- Here it shows the count of each type of product-B, there are total 4 types of product-B.

In [None]:
# calling the function
plot_category_counts(df,'prod_B')

- Here it shows the count of product-B which is purchased.

In [None]:
# calling the function
plot_category_counts(df,'contract')

- Here it is the count of contracts, but here is only one type of contract. So, it will not of our use, and furtherly we can get rid of it.
- We can say that our dataset only of contract type-2.

__Note__ - 'ID' attribute is unique identifier which has no role in model training so we will be excluding it under data cleaning, that is why we are not ploting anything for it.

----

<b> [Back to Content](#Content) </b>

### 3.3 Bivariate Analysis

In [None]:
# Count the values for each combination of categorical column.
count_data = df.groupby(['TARGET', 'prod_A']).size().reset_index(name='counts')

# Grouped bar plot
plt.figure(figsize=(10, 6))
sns.barplot(x='TARGET', y='counts', hue='prod_A', data=count_data, palette='viridis')

# Set plot labels and title
plt.xlabel('TARGET')
plt.ylabel('Counts')
plt.title('Grouped Bar Plot of TARGET and  Product of A')
plt.legend(title='prod_A')

# Display the plot
plt.show()

- Product A bought by customers are more when the response value is 'N' as compared to when the response column is 'Y'.

In [None]:
# Count the values for each combination of categorical column.
count_data = df.groupby(['TARGET', 'type_A']).size().reset_index(name='counts')

# Grouped bar plot
plt.figure(figsize=(10, 6))
sns.barplot(x='TARGET', y='counts', hue='type_A', data=count_data, palette='viridis')

# Set plot labels and title
plt.xlabel('TARGET')
plt.ylabel('Counts')
plt.title('Grouped Bar Plot of TARGET and Type of Product of A')
plt.legend(title='type_A')

# Display the plot
plt.show()

- Type of Product A ("3") counts are more when the response value is 'N' as compared to when reponse value is 'Y'.

In [None]:
# Count the values for each combination of categorical column.
count_data = df.groupby(['TARGET', 'prod_B']).size().reset_index(name='counts')

# Grouped bar plot
plt.figure(figsize=(10, 6))
sns.barplot(x='TARGET', y='counts', hue='prod_B', data=count_data, palette='viridis')

# Set plot labels and title
plt.xlabel('TARGET')
plt.ylabel('Counts')
plt.title('Grouped Bar Plot of TARGET and  Product of B')
plt.legend(title='prod_B')

# Display the plot
plt.show()

- Product B bought by customers is more when the response value is 'N' as compared to when the response value is 'Y'.

In [None]:
# Count the values for each combination of categorical column.
count_data = df.groupby(['TARGET', 'type_B']).size().reset_index(name='counts')

# Grouped bar plot
plt.figure(figsize=(10, 6))
sns.barplot(x='TARGET', y='counts', hue='type_B', data=count_data, palette='viridis')

# Set plot labels and title
plt.xlabel('TARGET')
plt.ylabel('Counts')
plt.title('Grouped Bar Plot of TARGET and Type of Product of B')
plt.legend(title='type_B')

# Display the plot
plt.show()

- Type of Product B ('3') count is more when the response value is 'N' as compared to when the response value is 'Y'.

In [None]:
# Creating the fuction to plot bargraph between Categorical and numerical columns or Categorical to Categorical.

def plot_bar(df, x_col, y_col, title="Bar Plot", x_label=None, y_label=None, color='c', 
             size=(10, 6), rotate_xticks=True, xticks_rotation=45): 
    
    # Create the bar plot
    plt.figure(figsize=size)
    sns.barplot(x=x_col, y=y_col, data=df, color=color)

    # Set the title and axis labels
    plt.title(title, fontsize=16)
    plt.xlabel(x_label if x_label else x_col, fontsize=14)
    plt.ylabel(y_label if y_label else y_col, fontsize=14)
    
#  rotate x-axis labels

    if rotate_xticks:
        plt.xticks(rotation=90, ha='right')
    
# Display the plot
    plt.tight_layout()
    plt.show()

In [None]:
#Calling the function to plot bargraph

plot_bar(df, 'TARGET', 'turnover_A', title="TARGET  vs Turnover of A  ", x_label="TARGET", y_label="turnover_A")

-  The graph shows that customers that purchased the new product(as indicated by Target =1 ) have spent more Product A than the customers that choose not to purchase the new product.

In [None]:
#Calling the function to plot bargraph

plot_bar(df, 'TARGET', 'turnover_B', title="TARGET  vs Turnover of B  ", x_label="TARGET", y_label="turnover_B")

-  The graph shows that customers that not  purchased the new product(as indicated by Target =N ) have spent more Product B than the customers that choose  to purchase the new product.

In [None]:
#Calling the function to plot bargraph

plot_bar(df, 'loyalty', 'turnover_A', title="loyalty  vs Turnover of A  ", x_label="loyalty", y_label="turnover_A")

- Turnover A of Product A is most when the loyality is 1.

In [None]:
#Calling the function to plot bargraph.

plot_bar(df, 'loyalty', 'LOR', title="loyalty  vs Length of relationship  ", x_label="loyalty", y_label="LOR")

- loyality is 1 when there is highest Length of Relationship.

In [None]:
# Calling the function to plot bargraph

plot_bar(df, 'loyalty', 'turnover_B', title="loyalty  vs Turnover of B ", x_label="loyalty", y_label="turnover_B")

- Turnover of Product B is most when the loyality is 2.

In [None]:
# Count the values for each combination of categorical column.
count_data = df.groupby(['TARGET', 'loyalty']).size().reset_index(name='counts')

# Grouped bar plot
plt.figure(figsize=(10, 6))
sns.barplot(x='TARGET', y='counts', hue='loyalty', data=count_data, palette='viridis')

# Set plot labels and title
plt.xlabel('TARGET')
plt.ylabel('Counts')
plt.title('Grouped Bar Plot of TARGET and loyalty')
plt.legend(title='loyalty')

# Display the plot
plt.show()

- In both cases number of non specified('99') loyality is maximum.

In [None]:
# Count the values for each combination of categorical column.
count_data = df.groupby(['prod_A', 'type_A']).size().reset_index(name='counts')

# Grouped bar plot
plt.figure(figsize=(10, 6))
sns.barplot(x='prod_A', y='counts', hue='type_A', data=count_data, palette='viridis')

# Set plot labels and title
plt.xlabel('prod_A')
plt.ylabel('Counts')
plt.title('Grouped Bar Plot of Product A and Type of Product A')
plt.legend(title='type_A')

# Display the plot
plt.show()

- In Product A maximum number of product type A bought is '3' and not bought is '0' 

In [None]:
# Count the values for each combination of categorical column.
count_data = df.groupby(['prod_B', 'type_B']).size().reset_index(name='counts')

# Grouped bar plot
plt.figure(figsize=(10, 6))
sns.barplot(x='prod_B', y='counts', hue='type_B', data=count_data, palette='viridis')

# Set plot labels and title
plt.xlabel('prod_B')
plt.ylabel('Counts')
plt.title('Grouped Bar Plot of Product B and Type of Product B')
plt.legend(title='type_B')

# Display the plot
plt.show()

-  In Product B maximum number of product type B bought is '3' and not bought is '0' 

In [None]:
# Plotting Scatterplot  to find the insight 

sns.scatterplot(x='age', y='LOR' ,data=df, color='c')  
plt.title('Scatter Plot:Age  vs Lenght of Relationship')
plt.show()

- Both the columns is strongly correlated to each other.

----

<b> [Back to Content](#Content) </b>

### 3.4 Multivariate Analysis

In [None]:
# Calculate the average of selected columns grouped by the binary column
averages = df.groupby('TARGET')[['age', 'LOR']].mean()

# Plot the averages using a barplot
averages.plot(kind='bar', figsize=(10, 6))
plt.title('Average of Columns Grouped by TARGET')
plt.xlabel('TARGET')
plt.ylabel('Average Value')
plt.xticks(rotation=0) 
plt.show()

- Average value of age is less and average value of length of relationship is more when the response variable is "N".
- Average value of age is more and average value of length of relationship is less when the response variable is "Y".

In [None]:
# Calculate the average of selected columns grouped by the binary column
averages = df.groupby('TARGET')[['turnover_A','turnover_B','age','LOR']].mean()

# Plot the averages using a barplot
averages.plot(kind='bar', figsize=(10, 6))
plt.title('Average of Columns Grouped by Target')
plt.xlabel('Target')
plt.ylabel('Average Value')
plt.xticks(rotation=0)  
plt.show()

- When the Target column has 0 input or the product is not purchased then the duration of relationship is high.

In [None]:
# Calculate the average of selected columns grouped by the binary column
averages = df.groupby('prod_A')[['turnover_A','age','LOR']].mean()

# Plot the averages using a barplot
averages.plot(kind='bar', figsize=(10, 6))
plt.title('Average of Columns Grouped by Product A')
plt.xlabel('Product A')
plt.ylabel('Average Value')
plt.xticks(rotation=0)  
plt.show()

- When the product A is purchased the average duration of relationship is high.
- When the product A is not purchased the average of age is high.

In [None]:
# Calculate the average of selected columns grouped by the binary column
averages = df.groupby('prod_B')[['turnover_B','age','LOR']].mean()

# Plot the averages using a barplot
averages.plot(kind='bar', figsize=(10, 6))
plt.title('Average of Columns Grouped by Product B')
plt.xlabel('Product B')
plt.ylabel('Average Value')
plt.xticks(rotation=0)  
plt.show()

- When the product B is purchased the average duration of relationship is high.

----

<b> [Back to Content](#Content) </b>

## 4. Data Cleaning

In [None]:
# Check for missing values in each columns.

df.isnull().sum()

In [None]:
# The lambda function converts 'Y' to 1 and any other value to 0 on 'Target' attribute.

df['TARGET'] = df['TARGET'].apply(lambda x: 1 if x == 'Y' else 0)

- Converting the 'Y' & 'N' to '1' & '0' of response variable('TARGET'), for better analysis.

In [None]:
# Checking the 'Target'.

df.TARGET.value_counts()

In [None]:
# Removing the unwanted attributes, which has been traced from the univariate analysis.

df = df.drop(columns=['ID', 'contract', 'city'], axis =1)

In [None]:
# Checking the each data points of 'age' & 'age_P', as mentioned earlier in the note from univariate analysis.

a = 0
p = 0

for i in range(len(df.age)):
    if df.age[i] == df.age_P[i]:
        a += 1
    else:
        p += 1

print(a)
print(p)

- So, here both the attributes are eaxctly the same value in each cells. So, we will be removing 'age_P'.

In [None]:
# Removing 'age_P' attribute.

df = df.drop(columns=['age_P'], axis =1)

In [None]:
# Removing 'lor_M' as detected from the bivariate analysis that - the correlation of LOR and lor_M is 1, 
# which means that they both have very strong correlation and provides similar information, only difference is that
# one entity is in years and other is in months. So, we can remove either of it.

df = df.drop(columns=['lor_M'], axis =1)

----

<b> [Back to Content](#Content) </b>

### 4.1 Handling Outliers

In [None]:
# Getting only numeric columns, except 'LOR'.
numeric_cols = ['age', 'LOR', 'turnover_A', 'turnover_B']

# Looping through numeric columns to get the lower and upper bound values.  
for col in numeric_cols:
    q1 = np.quantile(df[col], 0.25)
    q3 = np.quantile(df[col], 0.75)
    iqr = q3 - q1
    upper_bound = q3 + (1.2 * iqr)                  # Multiplying by 1.2 to not get the values in negative.
    lower_bound = q1 - (1.2 * iqr)
    range = [lower_bound, upper_bound]
    print(f"range in {col}:",range)
    
    # checking the maximum value 
    max_value = df[col].max()
    print(f"The maximum value in {col} is: {max_value}")

- From this, we can observe that the maximum values of  features such as age, LOR, turnover_A, and turnover_B  they are higher than upper bound values it should be potential outliers. However, we  believe  the numbers are acceptable and valid. For example:  12,249  is a valid value for turnover_B and also in age maximum value  can be 102. So, we retain the values for better analysis.

- Reference: https://medium.com/@akashmishra77/box-plots-detect-and-remove-outliers-from-distribution-a124ee88cf3e

----

<b> [Back to Content](#Content) </b>

### 4.2 Handling Skewness

In [None]:
# Checking for negative value.

numeric_cols = ['age', 'LOR', 'turnover_A', 'turnover_B']

negative_check = df[numeric_cols].apply(lambda x: (x < 0).any())

print(negative_check)

- All numeric attributes are non-negative.

In [None]:
# Checking for zero values.

zero_check = df[numeric_cols].apply(lambda x: (x == 0).any())

print(zero_check)

- Except LOR all features above don't have zero value.

In [None]:
# Applying  Box-Cox transformation to each numeric columns for handeling the skewness.

for col in numeric_cols:
    
    # Increasing  values by 1 to handle zeros
    df[col] += 1
    
    # Performing Box-Cox transformation and save the lambda value
    fitted_data, fitted_lambda = stats.boxcox(df[col])
    
    # Replacing original column with transformed data
    df[col] = fitted_data
    
    #  Print lambda value for each column
    print(f"Lambda value for {col}: {fitted_lambda}")

----

<b> [Back to Content](#Content) </b>

## 5. Prepped Data Review

In [None]:
# Checkig the cleaned dataframe.

df.head()

In [None]:
# Checking the shape of the df.

df.shape

In [None]:
# Checking every columns has the correct data types.

df.dtypes

In [None]:
# Checking the descriptive statistics.

df.describe()

In [None]:
# Ensure that there are no duplicates.

df.duplicated().sum()

In [None]:
# Ensuring that there is no null value present.

df.isnull().sum()

----

<b> [Back to Content](#Content) </b>

#### Ensuring Univariate

In [None]:
# calling the function
box_dist_plot(df, 'age')

- Age column still has outliers but it is valid data points.

In [None]:
# calling the function
box_dist_plot(df, 'LOR')

- LOR column still has outlier but it is valid data points.

In [None]:
# calling the function
box_dist_plot(df, 'turnover_A')

- Turnover A column still has outliers but it is valid data points.

In [None]:
# calling the function
box_dist_plot(df, 'turnover_B')

- Turnover B column still has outliers but it is valid data points.

In [None]:
# calling the function
plot_category_counts(df,'TARGET')

- Count of negative buying is maximum

In [None]:
# calling the function
plot_category_counts(df,'loyalty')

- Count of unspecified loyalty is maximum.

In [None]:
# calling the function
plot_category_counts(df,'prod_A')

- Count of buying product A is maximum

In [None]:
# calling the function
plot_category_counts(df,'type_A')

- Count of buying product of type A (3) is maximum

In [None]:
# calling the function
plot_category_counts(df,'prod_B')

- Count of buying product B is maximum

In [None]:
# calling the function
plot_category_counts(df,'type_B')

-  Count of buying product B of type B(3) is maximum.

----

<b> [Back to Content](#Content) </b>

#### Ensuring Bivariate

In [None]:
# Count the values for each combination of categorical column.
count_data = df.groupby(['TARGET', 'prod_A']).size().reset_index(name='counts')

# Grouped bar plot
plt.figure(figsize=(10, 6))
sns.barplot(x='TARGET', y='counts', hue='prod_A', data=count_data, palette='viridis')

# Set plot labels and title
plt.xlabel('TARGET')
plt.ylabel('Counts')
plt.title('Grouped Bar Plot of TARGET and  Product of A')
plt.legend(title='prod_A')

# Display the plot
plt.show()

- Product A bought by customers is more when the response value is 'N' as compared to when the response column is 'Y'.

In [None]:
# Count the values for each combination of categorical column.
count_data = df.groupby(['TARGET', 'type_A']).size().reset_index(name='counts')

# Grouped bar plot
plt.figure(figsize=(10, 6))
sns.barplot(x='TARGET', y='counts', hue='type_A', data=count_data, palette='viridis')

# Set plot labels and title
plt.xlabel('TARGET')
plt.ylabel('Counts')
plt.title('Grouped Bar Plot of TARGET and Type of Product of A')
plt.legend(title='type_A')

# Display the plot
plt.show()

- Type of Product A ("3") count is more when the response value is 'N' as compared to when reponse value is 'Y'.

In [None]:
# Count the values for each combination of categorical column.
count_data = df.groupby(['TARGET', 'prod_B']).size().reset_index(name='counts')

# Grouped bar plot
plt.figure(figsize=(10, 6))
sns.barplot(x='TARGET', y='counts', hue='prod_B', data=count_data, palette='viridis')

# Set plot labels and title
plt.xlabel('TARGET')
plt.ylabel('Counts')
plt.title('Grouped Bar Plot of TARGET and  Product of B')
plt.legend(title='prod_B')

# Display the plot
plt.show()

-  Product B bought by customers is more when the reponse value is 'N' as compared to when the response value is 'Y'.

In [None]:
# Count the values for each combination of categorical column.
count_data = df.groupby(['TARGET', 'type_B']).size().reset_index(name='counts')

# Grouped bar plot
plt.figure(figsize=(10, 6))
sns.barplot(x='TARGET', y='counts', hue='type_B', data=count_data, palette='viridis')

# Set plot labels and title
plt.xlabel('TARGET')
plt.ylabel('Counts')
plt.title('Grouped Bar Plot of TARGET and Type of Product of B')
plt.legend(title='type_B')

# Display the plot
plt.show()

- Type of Product B ('3') count  is more when the response value is 'N' as compared to when the response value is 'Y'.

In [None]:
# Creating the fuction to plot bargraph between Categorical and numerical columns or Categorical to Categorical.

def plot_bar(df, x_col, y_col, title="Bar Plot", x_label=None, y_label=None, color='b', size=(10, 6), rotate_xticks=True, xticks_rotation=45):
    
    
    
    # Create the bar plot
    plt.figure(figsize=size)
    sns.barplot(x=x_col, y=y_col, data=df, color=color)

    # Set the title and axis labels
    plt.title(title, fontsize=16)
    plt.xlabel(x_label if x_label else x_col, fontsize=14)
    plt.ylabel(y_label if y_label else y_col, fontsize=14)
    
#  rotate x-axis labels
 
    if rotate_xticks:
        plt.xticks(rotation=90, ha='right')
    
# Display the plot
    plt.tight_layout()
    plt.show()
    
    


In [None]:
#Calling the function to plot bargraph

plot_bar(df, 'TARGET', 'turnover_A', title="TARGET  vs Turnover of A  ", x_label="TARGET", y_label="turnover_A", color='g')


- The graph shows that customers that purchased the new product(as indicated by Target =1 ) have spent more Product A than the customers that choose not to purchase the new product.

In [None]:
#Calling the function to plot bargraph

plot_bar(df, 'TARGET', 'turnover_B', title="TARGET  vs Turnover of B  ", x_label="TARGET", y_label="turnover_B", color='g')


- The graph shows that customers that purchased the new product(as indicated by Target =1 ) have spent more Product B than the customers that choose not to purchase the new product.

In [None]:
# Count the values for each combination of categorical column.
count_data = df.groupby(['prod_A', 'type_A']).size().reset_index(name='counts')

# Grouped bar plot
plt.figure(figsize=(10, 6))
sns.barplot(x='prod_A', y='counts', hue='type_A', data=count_data, palette='viridis')

# Set plot labels and title
plt.xlabel('prod_A')
plt.ylabel('Counts')
plt.title('Grouped Bar Plot of Product A and Type of Product A')
plt.legend(title='type_A')

# Display the plot
plt.show()

- Product A bought is more when the type of product is '3' and not bought is more when the product type is '0'

In [None]:
# Count the values for each combination of categorical column.
count_data = df.groupby(['prod_B', 'type_B']).size().reset_index(name='counts')

# Grouped bar plot
plt.figure(figsize=(10, 6))
sns.barplot(x='prod_B', y='counts', hue='type_B', data=count_data, palette='viridis')

# Set plot labels and title
plt.xlabel('prod_B')
plt.ylabel('Counts')
plt.title('Grouped Bar Plot of Product B and Type of Product B')
plt.legend(title='type_B')

# Display the plot
plt.show()

-  Product B bought is more when the type of product  is '3' and not bought is more when the product type is '0'

In [None]:
# Plotting Scatterplot  to find the insight 

sns.scatterplot(x='age', y='LOR' ,data=df, color='c')  # code to plot bargraph
plt.title('Scatter Plot:Age  vs Lenght of Relationship')
plt.show()

- Both the columns is strongly correlated to each other.
- It might indicate a relationship where changes in one variable can be used to reliably predict changes in the other variable.

----

<b> [Back to Content](#Content) </b>

#### Ensuring Multivariate 

In [None]:
# Average Count the values for each combination of categorical column and numerical column.
# Step 1: Calculate the average of selected columns grouped by the binary column
averages = df.groupby('TARGET')[['age', 'LOR']].mean()

# Step 2: Plot the averages using a barplot
averages.plot(kind='bar', figsize=(10, 6))
plt.title('Average of Columns Grouped by TARGET')
plt.xlabel('TARGET')
plt.ylabel('Average Value')
plt.xticks(rotation=0)  # Keep the x-axis labels horizontal
plt.show()

- Average value of age is less and average value of length of relationship is more when the response variable is "N".
- Average value of age is more and average value of length of relationship is less when the response variable is "Y".

In [None]:
# Average Count the values for each combination of categorical column and numerical column.

# Step 1: Calculate the average of selected columns grouped by the binary column
averages = df.groupby('TARGET')[['turnover_A','turnover_B','age','LOR']].mean()

# Step 2: Plot the averages using a barplot
averages.plot(kind='bar', figsize=(10, 6))
plt.title('Average of Columns Grouped by Target')
plt.xlabel('Target')
plt.ylabel('Average Value')
plt.xticks(rotation=0)  # Keep the x-axis labels horizontal
plt.show()

-  When the Target column has 0 input or the product is not purchased then the average duration of relationship is high.

In [None]:
# Average Count the values for each combination of categorical column and numerical column.

# Step 1: Calculate the average of selected columns grouped by the binary column
averages = df.groupby('prod_A')[['turnover_A','age','LOR']].mean()

# Step 2: Plot the averages using a barplot
averages.plot(kind='bar', figsize=(10, 6))
plt.title('Average of Columns Grouped by Product A')
plt.xlabel('Product A')
plt.ylabel('Average Value')
plt.xticks(rotation=0)  # Keep the x-axis labels horizontal
plt.show()

- When the product A is purchased the average duration of relationship is high.
- When the product A is not purchased the average of age is high.

In [None]:
# Average Count the values for each combination of categorical column and numerical column.

# Step 1: Calculate the average of selected columns grouped by the binary column
averages = df.groupby('prod_B')[['turnover_B','age','LOR']].mean()

# Step 2: Plot the averages using a barplot
averages.plot(kind='bar', figsize=(10, 6))
plt.title('Average of Columns Grouped by Product B')
plt.xlabel('Product B')
plt.ylabel('Average Value')
plt.xticks(rotation=0)  # Keep the x-axis labels horizontal
plt.show()

- When the product B is purchased the average duration of relationship is high.

----

<b> [Back to Content](#Content) </b>

## 6. Dimensionality Reduction

In [None]:
# Re-anlysising 'loyalty' column.

print(df['loyalty'].value_counts(normalize=True))

- We will remove the loyalty column as it has 45% unclassified data, and it is type of ordinal data.



In [None]:
# Removing 'loyalty' from df.

df = df.drop(columns=['loyalty'], axis =1)

In [None]:
# Seperating 'TARGET' attribute from rest of the attributes, as it is the response variable.

y = df['TARGET']

y.head()

In [None]:
# Dropping the 'TARGET' column.

X = df.drop('TARGET', axis=1)

X.head(2)

In [None]:
# Changing the 'type_A' & 'type_B' unique values to identify clearly.

# Changing 'type_A' unique values to 1,3,&5 first three odd numbers.
X['type_A'] = X['type_A'].replace({0: 1, 3: 3, 6: 5})
print(X.type_A.unique())


# Changing 'type_B' unique values to 2,4,6,&8 first four even numbers.
X['type_B'] = X['type_B'].replace({0: 2, 3: 4, 6: 6, 9: 8})
print(X.type_B.unique())

In [None]:
# Get the dummy data from the categorical columns.
cat_cols = ['prod_A', 'type_A', 'prod_B', 'type_B']

# Converting the categorical columns to object type.
X[cat_cols] = X[cat_cols].astype('object')


X_cat_dummy = pd.get_dummies(X[cat_cols], drop_first=True).astype(int)

X_cat_dummy.head(2)

In [None]:
# Drop the original columns from the DataFrame.

X.drop(columns=cat_cols, inplace=True)

In [None]:
# Standarise the numeric attributes.

from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler()

X_std = std_scaler.fit_transform(X)

In [None]:
# Convert the nd array of X_std to a DataFrame with the desired column names
X_std = pd.DataFrame(X_std, columns=['age', 'LOR', 'turnover_A', 'turnover_B'])

# Optional: View the first few rows to ensure the data is correctly formatted
print(X_std.head())

In [None]:
# Concatenate the original DataFrame with the one-hot encoded columns with the numeric standarised data.

X = pd.concat([X_std, X_cat_dummy], axis=1)

X.head()

In [None]:
# Checking the shape of dataframe having independent attributes.

X.shape

In [None]:
# Create the train & test split.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
# Print the shapes of the resulting datasets.

print("Training dataset shapes ->  X: {}, y: {}".format(X_train.shape, y_train.shape))
print("Testing dataset shapes  ->  X: {}, y: {}".format(X_test.shape, y_test.shape))

----

<b> [Back to Content](#Content) </b>

### 6.1 Variance Threshold

In [None]:
# Creating the VarianceThreshold object (remove features with variance below the threshold)
selector = VarianceThreshold(threshold=0.01)

# Fit the selector to the data
selector.fit(X_train)

selector.get_support()

# Get the list of featurs with low variance.
low_var_cols = [col for col in X.columns if col not in X.columns[selector.get_support()]]

print(f"Total number of attributes with Low Variance : {len(low_var_cols)}")

In [None]:
# Dropping Low Variance attributes from X_train and X_test.

X_train = X_train.drop(low_var_cols, axis=1)

X_test = X_test.drop(low_var_cols, axis=1)

print(f"Shape of dataframe after removing the low variance columns : {X_train.shape}")
print()
print(f"Shape of dataframe after removing the low variance columns : {X_test.shape}")

----

<b> [Back to Content](#Content) </b>

### 6.2 Forward Elimination

In [None]:
# Calling the linear regression model for model1.

lreg = LinearRegression()
sfs1 = sfs(lreg, k_features='best', forward=True, verbose=2, scoring='r2')

In [None]:
# Fit the Sequential Feature Selector.

sfs1 = sfs1.fit(X_train, y_train)

In [None]:
# Extracting the names of the selected features after the fitting process.

feat_names1 = list(sfs1.k_feature_names_)
feat_names1

In [None]:
# Define the selected feature names for separation
selected_features = feat_names1

# Create a new DataFrame with the selected features
X_train_all_best = X_train[selected_features]
X_test_all_best = X_test[selected_features]

# Display the shape of the new DataFrame to confirm the selection
# Print the predicted values
display(X_train_all_best.shape)
display(X_test_all_best.shape)

In [None]:
# Calling the linear regression model

lreg = LinearRegression()
sfs2 = sfs(lreg, k_features=6, forward=True, verbose=2, scoring='r2')

In [None]:
# Fit the Sequential Feature Selector.

sfs2 = sfs2.fit(X_train, y_train)

In [None]:
# Extracting the names of the selected features after the fitting process.

feat_names2 = list(sfs2.k_feature_names_)
feat_names2

In [None]:
# Define the selected feature names for separation
selected_features = feat_names2

# Create a new DataFrame with the selected features
X_train_6_best = X_train[selected_features]
X_test_6_best = X_test[selected_features]

# Display the shape of the new DataFrame to confirm the selection
# Print the predicted values
display(X_train_6_best.shape)
display(X_test_6_best.shape)

In [None]:
# Calling the linear regression model

lreg = LinearRegression()
sfs3 = sfs(lreg, k_features=4, forward=True, verbose=2, scoring='r2')

In [None]:
# Fit the Sequential Feature Selector.

sfs3 = sfs3.fit(X_train, y_train)

In [None]:
# Extracting the names of the selected features after the fitting process.

feat_names3 = list(sfs3.k_feature_names_)
feat_names3

In [None]:
# Define the selected feature names for separation
selected_features = feat_names3

# Create a new DataFrame with the selected features
X_train_4_best = X_train[selected_features]
X_test_4_best = X_test[selected_features]

# Display the shape of the new DataFrame to confirm the selection
# Print the predicted values
display(X_train_4_best.shape)
display(X_test_4_best.shape)

In [None]:
# Checking the shapes of the final resulting datasets before training the models.

print("All best features Training dataset shapes ->  X: {}, y: {}".format(X_train_all_best.shape, y_train.shape))
print("All best features Testing dataset shapes  ->  X: {}, y: {}".format(X_test_all_best.shape, y_test.shape))

print()

print("Top 6 best features Training dataset shapes ->  X: {}, y: {}".format(X_train_6_best.shape, y_train.shape))
print("Top 6 best features Testing dataset shapes  ->  X: {}, y: {}".format(X_test_6_best.shape, y_test.shape))

print()

print("Top 4 best features Training dataset shapes ->  X: {}, y: {}".format(X_train_4_best.shape, y_train.shape))
print("Top 4 best features Testing dataset shapes  ->  X: {}, y: {}".format(X_test_4_best.shape, y_test.shape))

----

<b> [Back to Content](#Content) </b>

## 7. Binary Logistic Regression Models

In [None]:
# null error rate 

count = df[df['TARGET'] == 1]['TARGET'].value_counts().sum()

null_error_rate = 1 - (count / df.shape[0])

print(f'Null Error Rate:',null_error_rate)


- The null error rate of  to see whether the accuracy we  are attaining exceeds the null error rate. If not, our  model is unlikely to be of any value.

In [None]:
# Checking the balance of Actual Data class of y_train.
display(np.unique(y_train, return_counts=True))

print(5598 / len(y_train))

- Actual Data class of y_train is imbalance (72-28).

----

<b> [Back to Content](#Content) </b>

### 7.1 1<sup>st</sup> Model

In [None]:
# Loading Logistic Regression from the sklearn library into a variable and train.

model1 = LogisticRegression()

model1.fit(X_train_all_best, y_train)

In [None]:
# Calculating the different matrix of the model1 to the training data set using k-fold.

# 10 k-fold splits for training dataset
kf = KFold(n_splits=10, shuffle=True, random_state=42)

# k-fold cross validation.

accuracy = cross_val_score(model1, X_train_all_best, y_train, cv=kf, scoring='accuracy')

print(f"Accuracy of training dataset : {accuracy}")
print()
print(f"Accuracy of training dataset : {np.mean(accuracy)}")
print()
print()


precision = cross_val_score(model1, X_train_all_best, y_train, cv=kf, scoring='precision')

print(f"Precision of training dataset : {precision}")
print()
print(f"Precision of training dataset : {np.mean(precision)}")
print()
print()


recall = cross_val_score(model1, X_train_all_best, y_train, cv=kf, scoring='recall')

print(f"Recall of training dataset : {recall}")
print()
print(f"Recall of training dataset : {np.mean(recall)}")
print()
print()


f1 = cross_val_score(model1, X_train_all_best, y_train, cv=kf, scoring='f1')

print(f"F1 scores of training dataset : {f1}")
print()
print(f"F1 score of training dataset : {np.mean(f1)}")
print()
print()


roc_auc = cross_val_score(model1, X_train_all_best, y_train, cv=kf, scoring='roc_auc')

print(f"Roc-Auc scores of training dataset : {roc_auc}")
print()
print(f"Roc-Auc score of training dataset : {np.mean(roc_auc)}")


In [None]:
# Examining the model1 coefficients.

print(feat_names1)
model1.coef_

- Features such as age, LOR, turnover_A, and turnover_B show a positive coefficient, implying that as the values of these features increase, the probability or magnitude of the TARGET variable also increases. This indicates a higher likelihood of a customer purchasing a new product.

- For example, when a customer's age is higher, they may be more inclined to buy a new product, such as health or vehicle insurance. The same logic applies to turnovers if a customer contributes a high turnover, it increases the likelihood of purchasing a new product, as this suggests the customer trusts the insurance company.

- Looking at the features with negative coefficients, such as type_A_3 and Prod_B_1, it implies that as the count of type_3 products of A increases, the likelihood of a customer purchasing a new product decreases. The same scenario applies to Prod_B_1, where an increase in the count reduces the probability of a new product purchase.

In [None]:
# Calculate ROC curve for training set of model1.

X_train_prob_all_best = model1.predict_proba(X_train_all_best)[:, 1] 

fpr, tpr, thresholds = roc_curve(y_train, X_train_prob_all_best) 
roc_auc = auc(fpr, tpr)
# Plot the ROC curve
plt.figure()  
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--', label='No Skill')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for MODEL 1 (Training set)')
plt.legend()
plt.show()

----

<b> [Back to Content](#Content) </b>

### 7.2 2<sup>nd</sup> Model

In [None]:
# Loading Logistic Regression from the sklearn library into a variable and train.

model2 = LogisticRegression()

model2.fit(X_train_6_best, y_train)

In [None]:
# Calculating the different matrix of the model2 to the training data set using k-fold.

# 10 k-fold splits for training dataset
kf = KFold(n_splits=10, shuffle=True, random_state=42)

# k-fold cross validation.

accuracy = cross_val_score(model2, X_train_6_best, y_train, cv=kf, scoring='accuracy')

print(f"Accuracy of training dataset : {accuracy}")
print()
print(f"Accuracy of training dataset : {np.mean(accuracy)}")
print()
print()


precision = cross_val_score(model2, X_train_6_best, y_train, cv=kf, scoring='precision')

print(f"Precision of training dataset : {precision}")
print()
print(f"Precision of training dataset : {np.mean(precision)}")
print()
print()


recall = cross_val_score(model2, X_train_6_best, y_train, cv=kf, scoring='recall')

print(f"Recall of training dataset : {recall}")
print()
print(f"Recall of training dataset : {np.mean(recall)}")
print()
print()


f1 = cross_val_score(model2, X_train_6_best, y_train, cv=kf, scoring='f1')

print(f"F1 scores of training dataset : {f1}")
print()
print(f"F1 score of training dataset : {np.mean(f1)}")
print()
print()


roc_auc = cross_val_score(model2, X_train_6_best, y_train, cv=kf, scoring='roc_auc')

print(f"Roc-Auc scores of training dataset : {roc_auc}")
print()
print(f"Roc-Auc score of training dataset : {np.mean(roc_auc)}")


In [None]:
# Examining the model2 coefficients.

print(feat_names2)
model2.coef_

- Observing the coefficients, they are all positive except for Prod_B_1, which means that as the value of these features increases, the magnitude of the response variable, in this case, the 'TARGET', will also increase.

- Turnover_A and turnover_B have positive coefficients, which implies that as these increase, the chances of a customer buying a product also increase.

- When a customer's age is higher, they may be more inclined to buy a new product, such as health or vehicle insurance.

- For Prod_B_1, since it has a negative coefficient, the likelihood of a customer purchasing a new product decreases.


In [None]:
# Calculate ROC curve for training set of model2.

X_train_prob_6_best = model2.predict_proba(X_train_6_best)[:, 1] 

fpr, tpr, thresholds = roc_curve(y_train, X_train_prob_6_best) 
roc_auc = auc(fpr, tpr)
# Plot the ROC curve
plt.figure()  
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--', label='No Skill')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for MODEL 2 (Training set)')
plt.legend()
plt.show()

----

<b> [Back to Content](#Content) </b>

### 7.3 3<sup>rd</sup> Model

In [None]:
# Loading Logistic Regression from the sklearn library into a variable and train.

model3 = LogisticRegression()

model3.fit(X_train_4_best, y_train)

In [None]:
# Calculating the different matrix of the model2 to the training data set using k-fold.

# 10 k-fold splits for training dataset
kf = KFold(n_splits=10, shuffle=True, random_state=42)

# k-fold cross validation.

accuracy = cross_val_score(model3, X_train_4_best, y_train, cv=kf, scoring='accuracy')

print(f"Accuracy of training dataset : {accuracy}")
print()
print(f"Accuracy of training dataset : {np.mean(accuracy)}")
print()
print()


precision = cross_val_score(model3, X_train_4_best, y_train, cv=kf, scoring='precision')

print(f"Precision of training dataset : {precision}")
print()
print(f"Precision of training dataset : {np.mean(precision)}")
print()
print()


recall = cross_val_score(model3, X_train_4_best, y_train, cv=kf, scoring='recall')

print(f"Recall of training dataset : {recall}")
print()
print(f"Recall of training dataset : {np.mean(recall)}")
print()
print()


f1 = cross_val_score(model3, X_train_4_best, y_train, cv=kf, scoring='f1')

print(f"F1 scores of training dataset : {f1}")
print()
print(f"F1 score of training dataset : {np.mean(f1)}")
print()
print()


roc_auc = cross_val_score(model3, X_train_4_best, y_train, cv=kf, scoring='roc_auc')

print(f"Roc-Auc scores of training dataset : {roc_auc}")
print()
print(f"Roc-Auc score of training dataset : {np.mean(roc_auc)}")


In [None]:
# Examining the model3 coefficients.

print(feat_names3)
model3.coef_

- In these features, except for Prod_B_1, all of the other features have positive coefficients, which means that as the values of these explanatory variables increase, the magnitude of the TARGET also increases.

- When a customer's age is higher, they may be more inclined to buy a new product, such as health or vehicle insurance.

In [None]:
# Calculate ROC curve for training set of model3.

X_train_prob_4_best = model3.predict_proba(X_train_4_best)[:, 1] 

fpr, tpr, thresholds = roc_curve(y_train, X_train_prob_4_best) 
roc_auc = auc(fpr, tpr)
# Plot the ROC curve
plt.figure()  
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--', label='No Skill')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for MODEL 3 (Training set)')
plt.legend()
plt.show()

----

<b> [Back to Content](#Content) </b>

## 8. Model Selection

- As per the above different measuring parameters, we have selected the model-2 as our best model.

- Among all three models' parameter of the training set - Accuracy, Precision, Recall, F1 scores, Roc-Auc scores have almost model1 & model2 are approximatly same with very minute difference, whereas model3 has least. Thus, with less number of attributes model2 is giving almost similar result as of model1(with greater number of attrributes).

- Now, we will predict as per the training for the test of our selected model-2.

In [None]:
# Checking the balance of Actual Data class of y_test.
display(np.unique(y_test, return_counts=True))

print(2402 / len(y_test))

- Actual Data class of y_test is imbalance (72-28).

In [None]:
# Getting predictions for the test data for 6 best.

y_pred_6_best = model2.predict(X_test_6_best)
y_pred_6_best

In [None]:
# Create Confusion Matrix using crosstab() function and display it.

confusion_matrix = pd.crosstab(y_test, y_pred_6_best, rownames=['actual'], colnames=['predicted'])
print('\033[1m-: Confusion Matrix :-\033[0m')
display(confusion_matrix.head())
print()

In [None]:
# # To print the different parameters on the basis of confusion matrix.

def model_parameters():  
    
    print('\033[1mFinal Model Parameters:- \033[0m')
    print()

    # To print Accuracy.
    print('\033[1mAccuracy : \033[0m', round(metrics.accuracy_score(list(y_test), list(y_pred_6_best)), 3))
    print()

    # To print Precision.
    print('\033[1mPrecision : \033[0m', round(metrics.precision_score(list(y_test), list(y_pred_6_best)), 3))
    print()

    # To print Sensitivity.
    print('\033[1mSensitivity : \033[0m', round(metrics.recall_score(list(y_test), list(y_pred_6_best)), 3))
    print()

    # To print Specificity.
    print('\033[1mSpecificity : \033[0m', round(imblearn.metrics.specificity_score(list(y_test), 
                                                                                   list(y_pred_6_best)), 3))
    print()

    # To print F1 Score.
    print('\033[1mF1 Score : \033[0m', round(metrics.f1_score(list(y_test), list(y_pred_6_best)), 3))
    print()
    
model_parameters()

- `Accuracy(81.2%)` - It means that 81.2% of times our model predicts correctly, but the distribution of 1's and 0's of actual class is approx 70-30. So, we have imblanced dataset, therefore it is not a good metric for any judgement.

- `Precision(70.7%)` - It’s correct about 70.7% of the time, so it’s decent at avoiding false positives(that is wrongly predicting the positive class when it should be negative).

- `Sensitivity(52.9%)` - It means that our model only catches about 52.9% of the actual positive cases, means it misses a lot of true positives.

- `Specificity(91.8%)` - Our model is great at identifying the negatives, correctly classifying about 92% of non-positive cases.

- `F1 Score(60.5%)` - This shows an overall balance between precision and recall is averageof about 60.5%, suggesting our model is troubling with consistently identifying positive cases.

In [None]:
# Calculate ROC curve for training set of model.

y_pred_prob = model2.predict_proba(X_test_6_best)[:, 1] 
# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob) 
roc_auc = auc(fpr, tpr)
# Plot the ROC curve
plt.figure()  
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--', label='No Skill')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for selected MODEL')
plt.legend()
plt.show()

- AUC of 0.84 indicates that our model is performing well in distinguishing between classes.

__RESULT__

- Our model is good at avoiding false positives but it misses many true positives, so it needs improvement in capturing those positive cases without compromising its strong negative predictions. 
- Therefore, our model has high specificity and nice precision, which means it performs well at minimizing false positives and classifying negative cases. 
- Though, we can enhance our model's ability to detect more positive cases by reevaluting.

----

<b> [Back to Content](#Content) </b>

## Conclusion

1. The dataset has 3008 duplicate values.

2. These columns were non predictive in training model (ID,age_p,loyalty,city,lor_m,contract)

3. Loyalty column has 45% of unclassified values('99') in it.

4. 'Length of relationship' has strong or positive correlation with 'Age' columns.

5. We have selected the model on the basis of following metrices:-
   a. Accuracy - 81.2%
   b. Precision - 70.7%
   c. Sensitivity - 52.9%
   d. AUC - 0.84
   e. Specificity - 91.8%
   f. F1 Score - 60.5%
   
6. To increase customer retention and improve insurance sales, focusing on improving recall would help capture more potential buyers.

7. Improving recall will help the company reach more customers who might be interested in buying extra insurance plans, even if it means also reaching some customers who might not buy anything. Finding the right balance between precision and recall, shown by an F1 score of 60.5%, is important for increasing insurance sales.

<b> [TOP⬆️](#Classification-via-KNN-&-SVM) </b>

---
<h3><center>THE END</center></h3>

===================================================================================================================