## <span style="color:#90BE6D"><b> Dataset Description:</span> 

This dataset provides comprehensive information about customers of a bank (Universal Bank), designed to analyze their financial behavior and response to a personal loan offer. The data includes not only demographic characteristics of the customers but also their banking behavior and level of engagement with the bank's financial products.

ID: Customer ID

Age: Customer's age in completed years

Experience: Years of work experience

Income: Annual income (in thousands of dollars)

Zipcode: ZIP code of the customer’s residence

Family: Number of family members

CCAvg: Average monthly spending using the credit card (in thousands of dollars)

Education: Education level (1: Bachelor's, 2: Master's, 3: Advanced/Professional degree)

Mortgage: Value of home mortgage (if any), in thousands of dollars

Securities Account: Does the customer have a securities account with the bank?

CD Account: Does the customer have a certificate of deposit (CD) account with the bank?

Online: Does the customer use internet banking services?

CreditCard: Does the customer use a credit card issued by the bank?

Personal Loan: Did the customer accept the personal loan offered in the previous campaign? (Target variable)

## <span style="color:#90BE6D"><b> Analysis Objective:</span> 

The objective of this project is not merely to predict personal loan acceptance. Rather, the aim is to uncover hidden patterns in customer financial behavior and use them to build a model that can estimate the likelihood of loan acceptance. Beyond prediction, the goal is to assist the marketing team in crafting more accurate, targeted, and personalized financial offers. This analysis can contribute to improving the conversion rates of future campaigns and identifying high-value customers.

In this notebook, we use logistic regression, KNN and Navie Bayes classification models

## <b> Importting Libraries and Packages

In [None]:
# --- Data manipulation ---
import numpy as np
import pandas as pd

# --- Visualization libraries ---
import matplotlib.pyplot as plt
import matplotlib as mpl
import plotly.express as px
import seaborn as sns

# --- Model selection & evaluation tools ---
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold, RandomizedSearchCV

# --- Models ---
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import ComplementNB
from sklearn.ensemble import VotingClassifier

# --- Preprocessing & transformation ---
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler, OneHotEncoder, FunctionTransformer

# --- Evaluation metrics ---
from sklearn.metrics import (
    classification_report, confusion_matrix, ConfusionMatrixDisplay,
    roc_auc_score, roc_curve, auc, accuracy_score, precision_score, recall_score, f1_score,
    precision_recall_curve, average_precision_score
)  

# --- Model interpretation ---
from sklearn.inspection import permutation_importance

# --- Geospatial visualization ---
import folium
from branca.element import Element

# --- IPython display tools ---
from IPython.display import display, HTML

# --- Misc ---
from scipy.stats import randint

# --- Warning configuration ---
import warnings
warnings.filterwarnings('ignore')

## <b> Load Dataset

In [None]:
# Load the dataset from a CSV file
data = pd.read_csv('Bank_Personal_Loan_Modelling.csv')
data

## <b> Preprocessing Data

In [None]:
# Create a Dataframe of dataset
df = pd.DataFrame(data)

In [None]:
# Display information about the DataFrame
df.info()

In [None]:
# Generate a statistical summary of numerical columns in the DataFrame
df.describe(include='number')

In [None]:
# Generate a statistical summary of categorical (non-numeric) columns in the DataFrame
df.describe(include='object')

### <b> Data Cleaning

In [None]:
#Count NaN values
missing_values = df.isnull().sum()
missing_values

<b><b> conclusion

🔶 The dataset contains 5,000 rows and 14 columns, with "Personal Loan" as the target variable.

🔶 Out of the 14 columns, 13 are numerical and 1 column ("CCAvg") is of object type.

🔶 The columns <b>"ID", "Age", "Experience", "Income", "CC_Avg", and "Mortgage"</b> are numerical, while the columns <b>"Family", "Education", "Zip_Code", Personal_Loan , Securities Account , CD_Account , Online , Credit_Card</b> are categorical numerical columns.

🔶 The column <b>"CCAvg"</b> is a decimal data, so we need to change the "/" to "." and change the data type to float.

🔶 The "Experience" column contains negative values, which are incorrect and need to be corrected or removed.

🔶 There are no missing values in the data.

🔶 The "ID" and "ZipCode" columns do not have any impact on the target variable, which is "Personal Loan", so we will remove them.

🔶 CCAVG represents the average monthly credit card spending, but Income represents the annual income. To keep the units of the attributes the same, we need to convert the average monthly credit card spending to an annualized amount.

### <b> Change in columns

In [None]:
# A deep copy of the DataFrame is created to apply modifications to it.
df1 = df.copy(deep = True)

In [None]:
df1['CCAvg'] = df1['CCAvg'].str.replace('/', '.').astype(float)
df1

In [None]:
# Generate a statistical summary of all columns, including both numeric and categorical
df1.describe(include='all')

In [None]:
#Count of negative values in the 'Experience' column by each distinct value
negative_counts = df1[df1['Experience'] < 0]['Experience'].value_counts()

print(negative_counts)

In [None]:
# Convert negative values in the 'Experience' column to positive (absolute value)
df1['Experience'] = df1['Experience'].abs()
df1

In [None]:
# Counting the number of duplicate rows
num_duplicates = df1.duplicated().sum()

print(f" The number of duplicate rows in the DataFrame.: {num_duplicates}")

In [None]:
# Remove 'ID'column from dataframe
df1 = df1.drop(['ID'], axis=1)

In [None]:
# Converting the "CCAvg" unit from monthly to yearly.
df1['CCAvg'] *= 12
# Rename the 'CCAvg' column to 'CCAvg_Annual'
df1.rename(columns={'CCAvg': 'CCAvg_Annual'}, inplace=True)
df1

### <b> Noise Data

In [None]:
# Set the visual style for the plots using seaborn
sns.set(style="whitegrid", palette="pastel")

plt.figure(figsize=(14, 6))

# Plot a histogram for ZIP Code distribution
hist_data = sns.histplot(data=df1, x='ZIP Code', bins=50, kde=False, color='#90BE6D', edgecolor='black')

# Add value labels on top of each bar
for patch in hist_data.patches:
    height = patch.get_height()
    if height > 0:
        plt.text(patch.get_x() + patch.get_width() / 2, height + 1,
                 int(height), ha='center', va='bottom', fontsize=9, fontweight='bold')

# Identify ZIP Codes that are potentially invalid
highlighted = df1[df1['ZIP Code'] < 20000]
if not highlighted.empty:
    zip_val = highlighted['ZIP Code'].values[0]

    # Highlight the unusual ZIP Code with a circle
    plt.scatter([zip_val], [5], s=600, facecolors='none', edgecolors='#EA9010', linewidths=2, label=f'Noisy data: {zip_val}')

    # Add the legend to explain the annotation
    plt.legend(loc='upper left', fontsize=11)

# Add title and labels to the plot
plt.title('Distribution of ZIP Codes with Annotation', fontsize=12, fontweight='bold')
plt.xlabel('ZIP Code', fontsize=10)
plt.ylabel('Number of Customers', fontsize=10)

plt.tight_layout()
plt.show()

🔶 The ZIP Code column has a minimum value that is a significant outlier compared to the mean. 

🔶 After examining the distribution of this column, we confirmed that it is indeed an outlier, so we will remove it.

In [None]:
# Remove rows where the 'ZIP Code' is less than 20000
df1.drop(df1[df1['ZIP Code']<20000].index, inplace=True)
# Reset the index of the DataFrame after removing the rows, and drop the old index column
df1.reset_index(drop=True, inplace =True)

In [None]:
# Set higher resolution for the plots
mpl.rcParams['figure.dpi'] = 200  # Higher resolutionا

# List of selected features to plot
selected_features = ["Age", "Experience", "Income", "CCAvg_Annual", "Mortgage"]
n = len(selected_features)

# Create subplots with one column for boxplots and one column for histograms + KDE
fig, axes = plt.subplots(n, 2, figsize=(12, n * 3))
sns.set_style("whitegrid")

for i, col in enumerate(selected_features):
    # Box plot
    sns.boxplot(
        data=df1,
        x=col,
        ax=axes[i, 0],
        color='#90BE6D',
        flierprops=dict(marker='o', markerfacecolor='#EA9010', markeredgecolor='#EA9010', markersize=5)
    )
    axes[i, 0].tick_params(axis='x', labelrotation=90)
    axes[i, 0].set_xlabel(col, fontweight='bold')
    axes[i, 0].grid(True)  # Enable grid

    # Histogram + KDE
    sns.histplot(
        df1[col],
        ax=axes[i, 1],
        color='#90BE6D',
        stat='density', # Normalize the histogram
        bins=30,
        edgecolor='black'
    )
    sns.kdeplot(
        df1[col],
        ax=axes[i, 1],
        color='#EA9010',
        linewidth=2 # Set line width for KDE
    )
    axes[i, 1].tick_params(axis='x', labelrotation=90)
    axes[i, 1].set_xlabel(col, fontweight='bold')
    axes[i, 1].grid(True)

# Set a title for the entire figure
fig.suptitle('Distribution Plots for Numerical Features', fontsize=18, fontweight='bold')
plt.tight_layout(rect=[0, 0, 1, 0.98])
plt.show()

🔶 Age

The average age of customers is around 45 years.

The age distribution spans between approximately 25 and 70 years and appears to follow a uniform distribution. The boxplot shows that the data is nearly symmetrical, with no significant outliers.


🔶 Experience

The average work experience of customers is about 20 years.

Customer work experience data is symmetrically distributed with a slight right skew.

The distribution is similar to that of age, but its range is limited between 0 and about 45 years. The absence of negative or outlier values indicates proper data cleaning.


🔶 Income

The average income of customers is approximately $60,000 per year.

Customer income ranges from $10,000 to $180,000 per year.

Most of the data is concentrated between 0 and 100k.

The income distribution is highly right-skewed, meaning that most people earn below $100K, and a few individuals with very high incomes create a long tail (indicating many outliers at the upper end).

This distribution is not normal and follows a power law (Pareto Principle – 80/20). Outliers in the data can introduce bias in models sensitive to distance, such as KNN or regression.


🔶 CCAvg

The CCAvg Annual is about $18,000.

CCAvg Annual has a distribution with a strong left skew and many outliers.

Similar to income, the CCAvg Annual distribution is highly skewed. Many outliers above 30 are observed, with most customers having lower values.


🔶 Mortgage

Mortgage data has a median of around 0.

It has a distribution with a very strong left skew and many outliers.

The distribution is highly imbalanced; most values are close to zero, while a few values are very high (up to more than 600). The boxplot indicates a significant number of outliers.

🔶 Therefore, we use the IQR (Interquartile Range) method to identify and remove outliers.

In [None]:
# Step 1 & 2: Calculate Q1, Q3, and IQR
Q1 = df1['Mortgage'].quantile(0.25)
Q3 = df1['Mortgage'].quantile(0.75)
IQR = Q3 - Q1

# Step 3: Calculate lower and upper bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Step 4: Filter out outliers to keep only non-outlier values
df1_cleaned = df1[(df1['Mortgage'] >= lower_bound) & (df1['Mortgage'] <= upper_bound)]

# Optional: Print the number of removed outliers
print(f"Number of removed outliers: {len(df1) - len(df1_cleaned)}")

🔶 Data outside the lower_bound and upper_bound are considered outliers.

🔶 The IQR (Interquartile Range) method is one of the most common techniques for detecting outliers and has several important advantages over many other methods:

🔶 1. Robustness to outliers

Unlike the mean and standard deviation, which are themselves influenced by outliers, quartiles (Q1 and Q3) are not sensitive to them. This means:

IQR maintains stable performance even in the presence of outliers.

🔶 2. No distributional assumptions

The IQR method does not assume that the data follows any specific distribution (e.g., normal distribution). Therefore:

It works well for data that is skewed, imbalanced, or multimodal.

🔶 3. Simple and interpretable

IQR is calculated using only the quartiles, and its logic is easy to understand even for non-experts. Simply put:

Anything that deviates too far from the “typical” range of the data is considered an outlier.

🔶 4. Suitable for positive-only data (e.g., income, prices, loans)

Some methods (like log transformation or Z-score) struggle with negative or zero values. However:

IQR can be applied directly to such data without requiring prior transformations.

In [None]:
# Set higher resolution for the plots
mpl.rcParams['figure.dpi'] = 200  # Higher clarity.

# List of selected categorical features to plot
selected_features = ["Family", "Education", "Personal Loan", "Securities Account", "CD Account", "Online", "CreditCard"]

n = len(selected_features)

# create figure
fig, axes = plt.subplots(n, 2, figsize=(12, n * 3))
sns.set_style("whitegrid")

for i, col in enumerate(selected_features):
    # Count plot
    sns.countplot(
        data=df1_cleaned,
        x=col,
        ax=axes[i, 0],
        palette=['#90BE6D'],
        edgecolor='black'
    )
    axes[i, 0].tick_params(axis='x', labelrotation=0)
    axes[i, 0].set_xlabel(col, fontweight='bold')
    axes[i, 0].set_ylabel("Count", fontweight='bold')

    # Histogram + KDE
    sns.histplot(
        df1_cleaned[col],
        ax=axes[i, 1],
        color='#90BE6D',
        stat='density', # Normalize the histogram to show the density
        bins=30,
        edgecolor='black'
    )
    sns.kdeplot(
        df1_cleaned[col],
        ax=axes[i, 1],
        color='#EA9010',
        linewidth=2 # Set line width for the KDE plot
    )
    axes[i, 1].tick_params(axis='x', labelrotation=0)
    axes[i, 1].set_xlabel(col, fontweight='bold')

# Set a main title for the entire figure
fig.suptitle('Distribution Plots for Categorical Features', fontsize=18, fontweight='bold')
plt.tight_layout(rect=[0, 0, 1, 0.98])
plt.show()

🔶 Family

The distribution of "Family" is relatively uniform, meaning that customers are almost equally distributed in the 1 to 4 member groups. This indicates that this feature lacks severe bias towards a particular value and has a reasonable spread. It is likely informative on its own and does not require special processing.

🔶 Education

The values of "Education" only include 1, 2, and 3, representing an ordinal classification, probably indicating "Bachelor's", "Master's", and "Doctorate". The distribution is fairly normal, but group 1 has the highest frequency. This suggests that most customers have an average level of education.

🔶 Personal Loan

This is a binary variable (0 or 1), and its distribution is highly imbalanced. A very large number of people do not have a personal loan (majority class = 0), while only a small percentage (minority class) have a loan. This is an indication of class imbalance that should be addressed using techniques like oversampling or weight adjustment.

🔶 Securities Account

In this variable, the vast majority of users do not have a securities account. Similar to "Personal Loan", this represents a rare binary feature.

🔶 CD Account

This variable also has a highly imbalanced distribution, meaning only a small number of customers have a CD account.

🔶 Online

The distribution is relatively balanced, with almost half of the customers using online services.

🔶 CreditCard

In this variable, a large percentage of customers do not have a credit card, while a smaller portion have one.

There aren't any noisy data.

## <b> Exploratory Data Analysis (EDA)

In [None]:
# create a new dataframe of selected columns
selected_columns = df1_cleaned[["Age", "Experience", "Income", "Family", "CCAvg_Annual", "Education", "Mortgage", "Securities Account", "CD Account", "Online", "CreditCard", "Personal Loan"]]

In [None]:
# Calculate correlation
Corrmat = selected_columns.corr()

In [None]:
# Plotting correlation with heatmap
plt.figure(figsize = (10, 5), dpi = 200)
plt.title('Heatmap of Correlation', size=18, weight='bold')
# Define a custom color palette for the heatmap
custom_cmap = sns.color_palette(["#90BE6D", "#C9E3AC", "#EA9010"], as_cmap=True)
sns.heatmap(Corrmat, annot=True, fmt=".2f", linewidth=0.5, cmap=custom_cmap, mask = np.triu(np.ones_like(Corrmat)))

🔶 The target variable shows a significant correlation with the variables "Income", "CCAvg_Annual", and "CD Account".

🔶 There is a very strong correlation between "Age" and "Experience" (0.99), as well as between "Income" and "CCAvg_Annual" (0.65).

In [None]:
# Set the style for seaborn plots
sns.set_style("whitegrid")   

# Set figure size and resolution
plt.figure(figsize=(7, 3), dpi=150)

# Create the count plot for 'Personal Loan' with hue
ax = sns.countplot(
    data=df1_cleaned,
    x='Personal Loan',
    hue='Personal Loan',
    palette={0: '#90BE6D', 1: '#EA9010'} # Custom colors for 'No' and 'Yes'
)

# Title and labels
plt.title('Count of Personal Loan Applicants', fontsize=12, fontweight='bold')
plt.xlabel('Personal Loan (0 = No, 1 = Yes)', fontsize=10)
plt.ylabel('Count', fontsize=9)

# Remove legend
ax.legend_.remove()

# Customize grid
ax.grid(True, axis='y', linewidth=1.2, color='lightgray')

# Set color and thickness of spines
for spine in ['top', 'right', 'left', 'bottom']:
    ax.spines[spine].set_color('lightgray')
    ax.spines[spine].set_linewidth(1.2)

# Add percentage annotations
total = df1_cleaned['Personal Loan'].value_counts().sum()
for p in ax.patches:
    count = p.get_height()
    percent = (count / total) * 100
    x = p.get_x() + p.get_width() / 2
    y = count
    if count > 0:
        ax.text(x, y + 5, f'{percent:.1f}%', ha='center', fontsize=9, fontweight='bold')

plt.tight_layout()
plt.show()

🔶 Only about 500 people (9.6%) in this dataset have received a loan, and around 4500 people (90.4%) have not received a loan.

🔶 Therefore, it can be said that the data is <b>imbalanced. but small amount.

In [None]:
# Set the appearance for the plots
sns.set_theme(style="ticks")

# Create a pairplot to show relationships between selected numerical features and 'Personal Loan' status
pair_plot = sns.pairplot(
    df1_cleaned,  
    hue="Personal Loan",
    vars=["Age", "Experience", "Income", "CCAvg_Annual", "Mortgage"],  
    corner=True, 
    palette={0: "#90BE6D", 1: "#EA9010"} # Custom color palette for 'No' and 'Yes' categories of 'Personal Loan'
)

# Set the main title for the plot
plt.suptitle("Relationship between Numerical Features and Personal Loan Status", fontsize=18, fontweight='bold')

# Customize grid
for ax in plt.gcf().axes:
    ax.grid(True, which='both', axis='both', linestyle='--', linewidth=0.7, color='lightgray')

plt.tight_layout()
plt.show()

In [None]:
# Set the appearance for the plots
sns.set_theme(style="ticks")

# Create a pairplot to show relationships between selected categorical features and 'Personal Loan' status
pair_plot = sns.pairplot(
    df1_cleaned,  
    hue="Personal Loan",  
    vars=["Family", "Education", "Personal Loan", "Securities Account", "CD Account", "Online", "CreditCard"],  
    corner=True, 
    palette={0: "#90BE6D", 1: "#EA9010"}
)

# Set the main title for the plot
plt.suptitle("Relationship between Categorical Features and Personal Loan Status", fontsize=18, fontweight='bold')

# Customize grid
for ax in plt.gcf().axes:
    ax.grid(True, which='both', axis='both', linestyle='--', linewidth=0.7, color='lightgray')

plt.tight_layout()
plt.show()

🔶 In this project, we use the ZIP Code data for visualizing geographic locations and analyzing data points. After extracting the necessary information, we remove the ZIP Code column from the DataFrame.

In [None]:
# List of unique ZIP Codes present in the dataset
unique_zips = df1_cleaned['ZIP Code'].unique()

# Reference dataset
# can download this file from public sources like: https://simplemaps.com/data/us-zips
zip_df = pd.read_csv('uszips.csv')

# Ensuring ZIP Code is numeric
zip_df['zip'] = zip_df['zip'].astype(int)

# Filter only the ZIP Codes that are present in the main dataset
zip_map_df = zip_df[zip_df['zip'].isin(unique_zips)]

# Merge the latitude, longitude, city, state_name, and county_name information with the original dataframe
merged_df = df1_cleaned.merge(zip_map_df[['zip', 'lat', 'lng']], 
                              left_on='ZIP Code', right_on='zip', how='left')

# Drop the redundant 'zip' column after the merge
merged_df.drop(columns='zip', inplace=True)

merged_df

In [None]:
#Count NaN values
missing_values2 = merged_df.isnull().sum()
missing_values2

In [None]:
# Identify missing ZIP Codes in df1_cleaned that are not in zip_df
missing_zips = df1_cleaned[~df1_cleaned['ZIP Code'].isin(zip_df['zip'])]
print(missing_zips['ZIP Code'].value_counts())

In [None]:
# Creating a map from ZIP Code to complete geographic information
zip_info_map = merged_df.dropna(subset=['lat', 'lng']).drop_duplicates(subset=['ZIP Code'])[
    ['ZIP Code', 'lat', 'lng']
].set_index('ZIP Code')

# Function to fill missing geographic info row by row
def fill_geo_info(row):
    zip_code = row['ZIP Code']
    if pd.isna(row['lat']) or pd.isna(row['lng']):    # If lat or lng are missing, try to fill them based on the ZIP Code
        if zip_code in zip_info_map.index: 
            row['lat'] = zip_info_map.loc[zip_code, 'lat']
            row['lng'] = zip_info_map.loc[zip_code, 'lng']
    return row

# Step 1: Fill missing geographic info based on available data
merged_df = merged_df.apply(fill_geo_info, axis=1)

# Step 2: Fill remaining missing values with default values
mean_lat = merged_df['lat'].mean()
mean_lng = merged_df['lng'].mean()

# Filling remaining missing values with defaults
merged_df['lat'].fillna(mean_lat, inplace=True)
merged_df['lng'].fillna(mean_lng, inplace=True)

merged_df

🔶 We filled the missing values in the 'lat' and 'lng' columns with the mean.

In [None]:
#Count NaN values
missing_values3 = merged_df.isnull().sum()
missing_values3

In [None]:
# Remove the 'ZIP Code' column from the DataFrame
merged_df.drop(columns='ZIP Code', axis=1)

In [None]:
# Determine the bounds of the map based on the latitude and longitude
bounds = [[merged_df['lat'].min(), merged_df['lng'].min()],
          [merged_df['lat'].max(), merged_df['lng'].max()]]

# Create the base map
m = folium.Map(zoom_start=6)
m.fit_bounds(bounds)

# Function to generate an icon based on the Personal Loan status
def get_icon(personal_loan):
    if personal_loan == 1:
        return folium.Icon(color='#EA9010', icon='star', prefix='fa') # Icon for customers with a loan (orange color)
    else:
        return folium.Icon(color='#90BE6D', icon='star', prefix='fa') # Icon for customers without a loan (green color)

# Add markers to the map
for _, row in merged_df.iterrows():
    popup_text = (
        f"<b>ZIP:</b> {row['ZIP Code']}<br>"
        f"<b>Income:</b> ${row['Income']}<br>"
        f"<b>Loan Status:</b> {'Yes' if row['Personal Loan'] == 1 else 'No'}"
    )
    
    folium.CircleMarker(
    location=[row['lat'], row['lng']],
    radius=10, 
    color='#EA9010' if row['Personal Loan'] == 1 else '#90BE6D',
    opacity=0.7 if row['Personal Loan'] == 0 else 1.0,          # stroke opacity
    fill=True,
    fill_color='#EA9010' if row['Personal Loan'] == 1 else '#90BE6D',
    fill_opacity=0.7 if row['Personal Loan'] == 0 else 0.8,
    popup=popup_text
    ).add_to(m)

# Legend for Personal Loan status with clear positioning and size adjustments
legend_html = """
<div style="
    position: fixed;
    bottom: 40px;
    left: 40px;
    width: 200px;
    height: 100px;
    background-color: white;
    border: 2px solid #ccc;
    border-radius: 8px;
    z-index: 9999;
    font-size: 14px;
    font-family: Arial, sans-serif;
    box-shadow: 2px 2px 6px rgba(0,0,0,0.3);
    padding: 10px;
">
<b style="font-size:16px;">Personal Loan</b><br>
<i class="fa fa-circle fa-lg" style="color:#EA9010"></i> Yes<br>
<i class="fa fa-circle fa-lg" style="color:#90BE6D"></i> No
</div>
"""

# Add the legend to the map
m.get_root().html.add_child(Element(legend_html))

display(m)

In [None]:
# List of unique ZIP Codes present in the dataset
unique_zips = df1_cleaned['ZIP Code'].unique()

# Reference dataset
# can download this file from public sources like: https://simplemaps.com/data/us-zips
zip_df = pd.read_csv('uszips.csv')

# Ensuring ZIP Code is numeric
zip_df['zip'] = zip_df['zip'].astype(int)

# Filter only the ZIP Codes that are present in the main dataset
zip_map_df = zip_df[zip_df['zip'].isin(unique_zips)]

merged_df2 = merged_df.merge(zip_map_df[['zip', 'city']], 
                              left_on='ZIP Code', right_on='zip', how='left')
merged_df2

In [None]:
# Define custom colors for the 'Personal Loan' values
custom_palette = {
    1: '#EA9010',      # Customers with a loan
    0: '#90BE6D'       # Customers without a loan
}

# Get the 10 most frequent cities
top_cities = merged_df2['city'].value_counts().nlargest(20).index

# Filter data to include only those cities
filtered_df = merged_df2[merged_df2['city'].isin(top_cities)]

# Create the bar chart
plt.figure(figsize=(10, 6), dpi=200)
ax = sns.countplot(data=filtered_df, x='city', hue='Personal Loan', palette=custom_palette)
plt.title('Top 20 Cities by Personal Loan Distribution', fontsize=12, fontweight='bold')
plt.xlabel('City')
plt.ylabel('Count')
plt.xticks(rotation=90, ha='right')
plt.grid(True)

# Set color and thickness of spines
for spine in ['top', 'right', 'left', 'bottom']:
    ax.spines[spine].set_color('lightgray')
    ax.spines[spine].set_linewidth(1.2)
    
plt.tight_layout()
plt.show()

## <b> Normalize

In [None]:
# Split the dataset into features (X) and target (y)
x = merged_df.drop('Personal Loan', axis=1)
y = merged_df['Personal Loan']

In [None]:
# ---------------------------
# 1. Defining columns based on distribution type
# ---------------------------

# Continuous features with a normal distribution
normal_features = ['Age', 'Experience']

# Continuous features with outliers and skewed data
skewed_features = ['Income', 'CCAvg_Annual', 'Mortgage']

# Discrete or categorical features
categorical_features = ['Family', 'Education']

# Binary features that do not require normalization
binary_features = ['Securities Account', 'CD Account', 'Online', 'CreditCard']

# Geographic features to be normalized
geo_features = ['lat', 'lng']

# ---------------------------
# 2. Defining transformers 
# ---------------------------

# Log transform + robust scaler for skewed numerical data
log_and_robust = Pipeline(steps=[
    ('log', FunctionTransformer(func=lambda x: np.log1p(x))),
    ('robust', RobustScaler())
])

# Standard scaler for normally distributed and geo data
standard = StandardScaler()
# One-hot encoder for categorical features
categorical = OneHotEncoder(drop=None, sparse_output=False)

# ---------------------------
# 3. Combine all in ColumnTransformer
# ---------------------------

preprocessor = ColumnTransformer(transformers=[
    ('normals', standard, normal_features),
    ('skewed', log_and_robust, skewed_features),
    ('categorical', categorical, categorical_features),
    ('geo', standard, geo_features),
    ('passthrough', 'passthrough', binary_features)
])

# Fit and transform
X_preprocessed = preprocessor.fit_transform(x)

# Save the preprocessor as the scaler for Logistic Regression and knn
scaler_log_reg = preprocessor
scaler_knn = preprocessor

# ---------------------------
# 4. Create final DataFrame with column names
# ---------------------------

output_columns = (
    normal_features +
    skewed_features +
    list(preprocessor.named_transformers_['categorical'].get_feature_names_out(categorical_features)) +
    geo_features +
    binary_features
)

final_df = pd.DataFrame(X_preprocessed, columns=output_columns)
final_df['Personal Loan'] = y.values

print(final_df.head())

## <b> Logistic Regression Model

In [None]:
# Split the dataset into features (X) and target (y)
x = final_df.drop('Personal Loan', axis=1)
y = final_df['Personal Loan']

In [None]:
# Testing multiple test sizes to find the best value
test_size_list = [0.1, 0.2, 0.3, 0.4, 0.5]

test_sizes = []
test_aucs = []
train_aucs = []

for i in test_size_list:
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=i, random_state=0, stratify=y)
    # Initialize and train logistic regression model
    log_reg_model = LogisticRegression(random_state=0, max_iter=1000, class_weight='balanced') # Handles class imbalance
    log_reg_model.fit(x_train, y_train)
    
    # Predicting probabilities for the training and test sets
    train_proba = log_reg_model.predict_proba(x_train)[:, 1]
    test_proba = log_reg_model.predict_proba(x_test)[:, 1]

    # Calculating the AUC score for the training and test sets
    train_auc = roc_auc_score(y_train, train_proba)
    test_auc = roc_auc_score(y_test, test_proba)

    # Store results
    test_sizes.append(i)
    train_aucs.append(train_auc)
    test_aucs.append(test_auc)
    # Predict binary labels for classification report
    y_pred = log_reg_model.predict(x_test)

    print(f"\n🔶 Test size: {i}")
    print(confusion_matrix(y_test, y_pred))
    print(classification_report(y_test, y_pred))
    print("ROC-AUC Score:", roc_auc_score(y_test, log_reg_model.predict_proba(x_test)[:, 1]))

🔶 Best test size is 0.4.

In [None]:
# Find the best test size based on the highest test AUC
best_index = np.argmax(test_aucs)
best_test_size = test_sizes[best_index]
best_test_auc = test_aucs[best_index]

# Plotting the chart
plt.figure(figsize=(10, 4), dpi=200)
plt.plot(test_sizes, train_aucs, marker='o', label='Train ROC-AUC', color='#90BE6D')
plt.plot(test_sizes, test_aucs, marker='s', label='Test ROC-AUC', color='#EA9010')

# Annotate the best test size
plt.annotate(
    f'Best Test Size = {best_test_size}',
    xy=(best_test_size, best_test_auc),
    xytext=(best_test_size - 0.05, best_test_auc - 0.03),
    arrowprops=dict(facecolor='blue', edgecolor='blue', arrowstyle='->', lw=1.5),
    fontsize=10, color='blue', fontweight='bold'
)

# Setting font sizes and layout
plt.title('Train vs Test ROC-AUC Across Different Test Sizes for Logistic Regression', fontsize=12, fontweight='bold')
plt.xlabel('Test Size', fontsize=10)
plt.ylabel('ROC-AUC Score', fontsize=10)

plt.ylim(0.90, 1.00)
plt.legend(fontsize=8)
plt.grid(True)
plt.xticks(fontsize=8)
plt.yticks(fontsize=8)

# Customize spines
for spine in plt.gca().spines.values():
    spine.set_color('lightgray')
    spine.set_linewidth(1.5)

plt.tight_layout()
plt.show()

🔶 Test Size = 0.1

Train ROC-AUC ≈ 0.972

Test ROC-AUC ≈ 0.979

The difference between Train and Test is small and positive. The model performs slightly better on the test set, which may suggest slight underfitting, but overall indicates good generalization.

🔶 Test Size = 0.2

Train ROC-AUC ≈ 0.972

Test ROC-AUC ≈ 0.972

Both values are nearly identical. The model shows a good balance between training and testing performance, indicating stable behavior.

🔶 Test Size = 0.3

Train ROC-AUC ≈ 0.971

Test ROC-AUC ≈ 0.978

The test ROC-AUC is higher again. Despite a slight drop in training performance, the model generalizes well, suggesting improved robustness.

⭐ Test Size = 0.4 ←(←(Best Point)

Train ROC-AUC ≈ 0.969

Test ROC-AUC ≈ 0.979

This test size yields the highest ROC-AUC on the test set. Although the gap between Train and Test increases, it’s still acceptable. No clear signs of overfitting.

This point shows the best generalization performance.

🔶 Test Size = 0.5

Train ROC-AUC ≈ 0.968

Test ROC-AUC ≈ 0.978

Test performance remains high, but training ROC-AUC drops slightly more. Still a good balance, though slightly less ideal than at 0.4.

🔶 Conclusion:

Test Size = 0.4 is the best choice, providing the highest test ROC-AUC.

The model demonstrates excellent generalization and performance at this split.

### <b> Hyperparameter Tuning

In [None]:
# Define hyperparameters
class_weights = [None, 'balanced']
C_range = [0.01, 0.1, 1, 10, 100]
l1_ratios = np.arange(0, 1.1, 0.2)

# Create parameter grid covering different solvers and penalties
param_grid = [
    {'solver': ['lbfgs', 'newton-cg', 'sag', 'saga'], 
     'penalty': ['none'], 
     'class_weight': class_weights, 
     'max_iter': [500, 1000]
    },
    {'solver': ['lbfgs', 'newton-cg', 'sag'], 
     'penalty': ['l2'], 
     'C': C_range, 
     'class_weight': class_weights, 
     'max_iter': [500, 1000]
    },
    {'solver': ['liblinear'], 
     'penalty': ['l1', 'l2'], 
     'C': C_range, 
     'class_weight': class_weights, 
     'max_iter': [500, 1000]
    },
    {'solver': ['saga'], 
     'penalty': ['l1', 'l2'], 
     'C': C_range, 
     'class_weight': class_weights, 
     'max_iter': [500, 1000]
    },
    {'solver': ['saga'], 
     'penalty': ['elasticnet'], 
     'C': C_range, 
     'l1_ratio': l1_ratios, 
     'class_weight': class_weights, 
     'max_iter': [500, 1000]
    }
]

# Try different CV values and store results
cv_values = [3, 5, 7, 10]
cv_results = []

# Base model
log_reg_model = LogisticRegression(random_state=0)

for cv in cv_values:
    print(f"\n🔶 Running GridSearchCV with cv={cv}")
    grid = GridSearchCV(
        estimator=log_reg_model,
        param_grid=param_grid,
        cv=StratifiedKFold(n_splits=cv, shuffle=True, random_state=0),
        scoring='roc_auc',
        return_train_score=True,
        n_jobs=-1,
        verbose=0
    )
    
    grid.fit(x, y)

    # Extract best index and corresponding metrics
    best_idx = grid.best_index_
    train_auc = grid.cv_results_['mean_train_score'][best_idx]
    val_auc = grid.cv_results_['mean_test_score'][best_idx]

    # Save result
    cv_results.append({
        'cv': cv,
        'grid': grid,
        'train_auc': train_auc,
        'val_auc': val_auc
    })

    print(f" cv={cv} ➤ Train ROC-AUC: {train_auc:.4f}, Validation ROC-AUC: {val_auc:.4f}")

In [None]:
# Initialize variables to track the best results
best_cv = None
best_model = None
best_score = 0

# Loop through cv_results to find the best ROC-AUC score
for result in cv_results:
    if result['val_auc'] > best_score:
        best_score = result['val_auc']
        best_cv = result['cv']
        best_model = result['grid']
        
print("\n🔶 Best number of folds (cv):", best_cv)
print("🔶 The best combination of hyperparameters:", best_model.best_params_)
print("🔶 Highest ROC-AUC score:", best_score)

In [None]:
# Extracting data for plotting
cv_vals = [r['cv'] for r in cv_results]
train_auc_scores = [r['train_auc'] for r in cv_results]
val_auc_scores = [r['val_auc'] for r in cv_results]

# Find the best CV fold (based on validation AUC)
best_idx = np.argmax(val_auc_scores)
best_cv = cv_vals[best_idx]
best_val_auc = val_auc_scores[best_idx]

# Plotting
plt.figure(figsize=(10, 4), dpi=200)
plt.plot(cv_vals, train_auc_scores, marker='o', label='Train ROC-AUC', color='#90BE6D')
plt.plot(cv_vals, val_auc_scores, marker='s', label='Validation ROC-AUC', color='#EA9010')

# Labels and title
plt.xlabel('CV folds (k)', fontsize=10)
plt.ylabel('ROC-AUC Score', fontsize=10)
plt.title('Train vs Validation ROC-AUC Across Different CV Values for Logistic Regression', fontsize=12, fontweight='bold')
plt.ylim(min(min(train_auc_scores), min(val_auc_scores)) - 0.01, 1.0)
plt.xticks(cv_vals)
plt.grid(True)
plt.legend()

# Annotate best CV
plt.annotate(
    f'Best CV = {best_cv}', 
    xy=(best_cv, best_val_auc), 
    xytext=(best_cv - 0.4, best_val_auc + 0.015), 
    arrowprops=dict(facecolor='blue', edgecolor='blue', arrowstyle='->', lw=1.5),
    fontsize=10, color='blue', fontweight='bold'
)

plt.xticks(fontsize=8)
plt.yticks(fontsize=8)
for spine in plt.gca().spines.values():
    spine.set_color('lightgray')
    spine.set_linewidth(1.5)

plt.tight_layout()
plt.show()

🔶 CV = 3

Train ROC-AUC ≈ 0.973

Validation ROC-AUC ≈ 0.970

A small gap is present, but both scores are high. This suggests decent performance but possibly higher variance due to fewer folds.

🔶 CV = 5

Train ROC-AUC ≈ 0.973

Validation ROC-AUC ≈ 0.970

Very similar to CV=3. No significant improvement, but the model remains stable.

🔶 CV = 7

Train ROC-AUC ≈ 0.973

Validation ROC-AUC ≈ 0.971

A slight increase in validation ROC-AUC is observed, suggesting better performance with more folds.

⭐ CV = 10 ←(Best point)

Train ROC-AUC ≈ 0.973

Validation ROC-AUC ≈ 0.9715

This value gives the highest validation ROC-AUC.

The train-validation gap is minimal, indicating excellent generalization.

Best choice for cross-validation split in this experiment.

🔶 Conclusion:

CV = 10 provides the most reliable validation score and lowest variance.

The model generalizes consistently across all folds, showing strong robustness and stability.

In [None]:
grid = cv_results[cv_vals.index(10)]['grid']
results = grid.cv_results_

# Filter only rows where model used:
# - L1 penalty
# - liblinear solver
# - balanced class weight
# - 500 iterations
filtered_idxs = [
    i for i, params in enumerate(results['params'])
    if params.get('penalty') == 'l1' and
       params.get('solver') == 'liblinear' and
       params.get('class_weight') == 'balanced' and
       params.get('max_iter') == 500
]

# Extract values of C and their corresponding scores
C_vals = [results['params'][i]['C'] for i in filtered_idxs]
val_scores = [results['mean_test_score'][i] for i in filtered_idxs]
train_scores = [results['mean_train_score'][i] for i in filtered_idxs]

# Identify best C based on validation ROC-AUC
best_idx = np.argmax(val_scores)
best_C = C_vals[best_idx]
best_val_score = val_scores[best_idx]

# Plotting
plt.figure(figsize=(10, 4), dpi=200)
plt.plot(C_vals, train_scores, marker='o', label='Train ROC-AUC', color='#90BE6D')
plt.plot(C_vals, val_scores, marker='s', label='Validation ROC-AUC', color='#EA9010')

plt.xscale('log')  # تبدیل مقیاس محور x به لگاریتمی
plt.xlabel('C (log scale)', fontsize=10)
plt.ylabel('ROC-AUC Score', fontsize=10)
plt.title('Train vs Validation ROC-AUC Across Different C values\n(CV=10, Best Params)', fontsize=12, fontweight='bold')
plt.grid(True)
plt.legend()

# Annotate best C value on the chart
plt.annotate(
    f'Best C = {best_C}',
    xy=(best_C, best_val_score),
    xytext=(best_C, best_val_score - 0.02),
    arrowprops=dict(facecolor='blue', edgecolor='blue', arrowstyle='->', lw=1.5),
    fontsize=10, color='blue', fontweight='bold'
)

# Customize Y-axis limits
plt.ylim(min(min(train_scores), min(val_scores)) - 0.01, 1.0)

for spine in plt.gca().spines.values():
    spine.set_color('lightgray')
    spine.set_linewidth(1.5)

plt.tight_layout()
plt.show()

🔶 C is the inverse of regularization strength in Logistic Regression. Smaller values imply stronger regularization.

🔶 This chart examines how different values of C affect the ROC-AUC score on both training and validation sets.

🔶 C = 0.01

Train ROC-AUC ≈ 0.960

Validation ROC-AUC ≈ 0.958

Both scores are the lowest among the tested values.

Indicates underfitting — the model is too constrained and unable to learn enough from the data.

⭐ C = 0.1 ←(Best Point)

Train ROC-AUC ≈ 0.972

Validation ROC-AUC ≈ 0.971

This is where validation ROC-AUC is highest, and the train-validation gap is minimal.

Suggests a good balance between bias and variance.

Best value for C, as it provides strong generalization with optimal complexity.

🔶 C = 1, 10, 100

Train ROC-AUC stays slightly above 0.973

Validation ROC-AUC slightly decreases (~0.970)

The model starts overfitting slightly: it performs better on the training set but not better on validation.

Increasing C reduces regularization, allowing the model to fit training data more closely — but not benefiting test performance.


🔶 C = 0.1 is the most suitable choice.

🔶 It gives the highest validation ROC-AUC, with a minimal gap from training ROC-AUC.

🔶 This value ensures the model is neither too simple nor too complex — ideal for generalization.

### <b> Final Training and Inference

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=best_test_size, random_state=0, stratify=y)

In [None]:
# Build a Logistic Regression model with the best C value obtained from GridSearchCV
log_reg_model = LogisticRegression(random_state=0, class_weight='balanced')

# Load the best model found from GridSearchCV
best_log_reg_model = grid.best_estimator_
print("Best Parameters:", grid.best_params_)

# Predicting the probability of each test sample belonging to the positive class (1)
y_score = best_log_reg_model.predict_proba(x_test)[:, 1]

# Calculating the values needed to plot the ROC curve: False Positive Rate, True Positive Rate
fpr, tpr, _ = roc_curve(y_test, y_score)
roc_auc = auc(fpr, tpr) # Calculating the Area Under the ROC Curve (ROC-AUC)

# Calculating the values needed to plot the Precision-Recall curve
precision, recall, _ = precision_recall_curve(y_test, y_score)
avg_prec = average_precision_score(y_test, y_score) # Average Precision (AP)

fig, axes = plt.subplots(1, 2, figsize=(14, 5), dpi=150)

# ROC Curve
axes[0].plot(fpr, tpr, label=f'ROC AUC = {roc_auc:.2f}', color='#90BE6D')
axes[0].plot([0, 1], [0, 1], linestyle='--', color='gray')
axes[0].set_xlabel('False Positive Rate', fontsize=10)
axes[0].set_ylabel('True Positive Rate', fontsize=10)
axes[0].set_title('ROC Curve', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(True)

# Precision-Recall Curve
axes[1].plot(recall, precision, label=f'AP = {avg_prec:.2f}', color='#EA9010')
axes[1].set_xlabel('Recall', fontsize=10)
axes[1].set_ylabel('Precision', fontsize=10)
axes[1].set_title('Precision-Recall Curve', fontsize=12, fontweight='bold')
axes[1].legend()
axes[1].grid(True)

for ax in axes:
    for spine in ax.spines.values():
        spine.set_edgecolor('#D3D3D3') 
        spine.set_linewidth(1.5)
 
plt.suptitle('Performance Analysis of Tuned Logistic Regression Model', fontsize=16, fontweight='bold')
plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.show()

🔶 ROC Curve (Receiver Operating Characteristic)

AUC (Area Under Curve) = 0.98

The ROC curve shows a very steep rise towards the top-left corner, indicating a strong trade-off between True Positive Rate (TPR) and False Positive Rate (FPR).

AUC = 0.98 is considered excellent — meaning the model is highly capable of distinguishing between the two classes (loan vs no-loan).

The curve being far from the diagonal line (random guess line) confirms that the model performs significantly better than random chance.

Interpretation:
The model has excellent discrimination ability. It rarely misclassifies positive and negative samples.

🔶 Precision-Recall Curve

AP (Average Precision) = 0.87

The curve maintains high precision at various levels of recall, especially in the 0.0–0.6 recall range.

A gradual decline in precision at higher recall levels (beyond 0.6) indicates some trade-off between catching all positives and avoiding false positives.

Interpretation:

A high AP score suggests that the model is effective even in imbalanced scenarios (if applicable), maintaining a good balance between identifying true positives and avoiding false alarms.

🔶 Overall Conclusion:

ROC AUC = 0.98 → Excellent general classification performance.

AP = 0.87 → Strong performance in ranking and probability calibration.

The model is reliable for both balanced and slightly imbalanced class distributions.

### <b> Evaluation

In [None]:
y_pred = best_log_reg_model.predict(x_test)

In [None]:
# Calculation and display of evaluation metrics
print("🔶 Accuracy:", accuracy_score(y_test, y_pred))
print("🔶 Precision:", precision_score(y_test, y_pred, average='weighted'))
print("🔶 Recall:", recall_score(y_test, y_pred, average='weighted'))
print("🔶 F1 Score:", f1_score(y_test, y_pred, average='weighted'))
print("\n🔶 Classification Report:\n", classification_report(y_test, y_pred))

# Confusion Matrix Plot
cm = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(4, 4), dpi=150)  

sns.heatmap(cm, annot=True, fmt='d', cmap='Greens', cbar=False, square=True,
            annot_kws={"size": 10}, ax=ax)

ax.set_xlabel("Predicted Label", fontsize=10)
ax.set_ylabel("True Label", fontsize=10)
ax.set_title("Confusion Matrix", fontsize=12, fontweight='bold')

ax.tick_params(axis='both', labelsize=9)

plt.tight_layout()
plt.show()

🔶 Overall Metrics

🔶 Accuracy: 89.76%

→ The model correctly predicts ~90% of total instances.

🔶 Precision: 95.08%

→ When the model predicts a person won't take a loan (class 0), it's correct 95% of the time.

🔶 Recall: 89.75%

→ The model identifies 89.75% of all actual class 0 and class 1 correctly.

🔶 F1 Score: 91.31%

→ Balanced measure of precision and recall, showing strong performance.

🔶 Class 0 (No Loan):

Very high precision (1.00): No false positives.

Good recall (0.89): Most non-loan applicants are detected.

🔶 Class 1 (Loan):

Low precision (0.44): A lot of false positives — the model often wrongly predicts loan takers.

High recall (0.95): Almost all actual loan takers are caught.

🔶 Implication: The model is very cautious — it prefers to catch all loan takers (high recall), even at the cost of wrongly labeling many non-loan users as loan-takers (low precision for class 1).

🔶 Needs improvement in precision for class 1, possibly through threshold tuning, resampling, or cost-sensitive learning.

### <b> Feature Selection

In [None]:
# Creating a color palette between yellow and green
colors = sns.color_palette("YlGn", n_colors=10)  # Yellow to Green

# Extracting model coefficients (i.e., feature importance in Logistic Regression)
importance = best_log_reg_model.coef_[0]
features = (
    normal_features +
    skewed_features +
    list(preprocessor.named_transformers_['categorical'].get_feature_names_out(categorical_features)) +
    geo_features +
    binary_features
) # Since there is only one output class, coef_ has shape (1, n_features)

# Create a DataFrame of features and their corresponding importance
coeff_df = pd.DataFrame({'Feature': features, 'Importance': importance})
coeff_df['AbsImportance'] = np.abs(coeff_df['Importance']) # Add a new column with the absolute value of importance for sorting
coeff_df.sort_values('AbsImportance', ascending=False, inplace=True) # Sort in descending order based on the absolute importance

plt.figure(figsize=(10, 4), dpi=200)
sns.barplot(x='Importance', y='Feature', data=coeff_df.head(10), palette=colors)
plt.title('Top 10 Important Features', fontsize=12, fontweight='bold') 

for spine in plt.gca().spines.values():
    spine.set_edgecolor('#D3D3D3') 
    spine.set_linewidth(1.5)  

plt.xlabel('Importance', fontsize=10)  
plt.ylabel('Feature', fontsize=10)   

plt.tick_params(axis='x', labelsize=8)  
plt.tick_params(axis='y', labelsize=8)  

plt.tight_layout()
plt.show()

🔶 Strongest predictors:

Income and CD Account are dominant. People with higher incomes and more financial assets are much more likely to take loans (possibly for investments or larger purchases).

🔶Education and Family:

education level 1 and smaller family sizes are negatively associated, which could suggest lower financial need or risk aversion.

🔶 Spending & financial tools:

Features like CCAvg_Annual, CreditCard, and Securities Account reflect user activity and financial engagement, mildly influencing loan prediction.

🔶 Takeaway
Your model heavily relies on income and financial asset indicators, followed by demographics (like education and family) and spending behavior. This is generally intuitive and aligned with real-world loan behavior.

## <b> KNN Model

In [None]:
# Initialize empty lists to store results
test_sizes_knn = []
test_aucs_knn = []
train_aucs_knn = []

for i in test_size_list:
    x_train_knn, x_test_knn, y_train_knn, y_test_knn = train_test_split(x, y, test_size=i, random_state=0)

    # Create KNN model with n_neighbors=11 (default for KNN classifier)
    knn_model = KNeighborsClassifier(n_neighbors=27)
    knn_model.fit(x_train_knn, y_train_knn) # Fit model on the training data

    # Predicting probabilities
    train_proba_knn = knn_model.predict_proba(x_train_knn)[:, 1] # Probability for the positive class in training set
    test_proba_knn = knn_model.predict_proba(x_test_knn)[:, 1]   # Probability for the positive class in test set
    
    # Calculating ROC-AUC
    train_auc_knn = roc_auc_score(y_train_knn, train_proba_knn)
    test_auc_knn = roc_auc_score(y_test_knn, test_proba_knn)

    # Append results to the lists
    test_sizes_knn.append(i)
    train_aucs_knn.append(train_auc_knn)
    test_aucs_knn.append(test_auc_knn)

    # Predict labels for the test set (for classification report and confusion matrix)
    y_pred_knn = knn_model.predict(x_test_knn)

    # Output the results for the current test size
    print(f"\n🔶 Test size: {i}")
    print(confusion_matrix(y_test_knn, y_pred_knn))
    print(classification_report(y_test_knn, y_pred_knn))
    print("ROC-AUC Score:", roc_auc_score(y_test_knn, knn_model.predict_proba(x_test_knn)[:, 1]))

In [None]:
# Find the best test size based on the highest test AUC
best_index_knn = np.argmax(test_aucs_knn)
best_test_size_knn = test_sizes_knn[best_index_knn]
best_test_auc_knn = test_aucs_knn[best_index_knn]

# Plotting the same chart
plt.figure(figsize=(10, 4), dpi=200)
plt.plot(test_sizes_knn, train_aucs_knn, marker='o', label='Train ROC-AUC', color='#90BE6D')
plt.plot(test_sizes_knn, test_aucs_knn, marker='s', label='Test ROC-AUC', color='#EA9010')

plt.annotate(
    f'Best Test Size = {best_test_size_knn}',
    xy=(best_test_size_knn, best_test_auc_knn),
    xytext=(best_test_size_knn + 0.05, best_test_auc_knn + 0.01),
    arrowprops=dict(facecolor='blue', edgecolor='blue', arrowstyle='->', lw=1.5),
    fontsize=10, color='blue', fontweight='bold'
)

plt.title('Train vs Test ROC-AUC Across Different Test Sizes for KNN', fontsize=12, fontweight='bold')
plt.xlabel('Test Size', fontsize=10)
plt.ylabel('ROC-AUC Score', fontsize=10)

plt.ylim(0.85, 1.00)
plt.legend(fontsize=8)
plt.grid(True)
plt.xticks(fontsize=8)
plt.yticks(fontsize=8)

# Customize spines
for spine in plt.gca().spines.values():
    spine.set_color('lightgray')
    spine.set_linewidth(1.5)

plt.tight_layout()
plt.show()

🔸 Test Size = 0.10

Train ROC-AUC ≈ 0.976

Test ROC-AUC ≈ 0.935

Significant gap: The model performs much better on training than testing — signs of overfitting.

🔸 Test Size = 0.20

Train ROC-AUC ≈ 0.975

Test ROC-AUC ≈ 0.944

Slight improvement in test performance, but gap remains noticeable.

🔸 Test Size = 0.30

Train ROC-AUC ≈ 0.976

Test ROC-AUC ≈ 0.946

Test performance increases slightly; the model still likely overfits somewhat.

⭐ Test Size = 0.40 ←(Best Point)

Train ROC-AUC ≈ 0.976

Test ROC-AUC ≈ 0.955 ← Highest test performance

Best generalization point: relatively low gap, best validation result.

🔸 Test Size = 0.50

Train ROC-AUC ≈ 0.976

Test ROC-AUC ≈ 0.954

Slight drop compared to 0.4. Still close, but not optimal.

🔸 Conclusion

Best Test Size = 0.4, as it gives the highest Test ROC-AUC (≈ 0.955).

The gap between Train and Test scores narrows as test size increases — a positive sign for model generalization.

At smaller test sizes (e.g., 0.1–0.2), the KNN model shows overfitting.

With 40% test data, the model is better balanced and generalizes well to unseen data.

### <b> Hyperparameter Tuning

In [None]:
# Define the parameter distributions for RandomizedSearchCV
param_distributions_knn = {
    'n_neighbors': randint(3, 30), # Number of neighbors to use (randomized between 3 and 30)
    'weights': ['uniform', 'distance'],  # 'uniform' or 'distance' weight function for neighbors
    'algorithm': ['auto', 'brute'], # Algorithm used to compute nearest neighbors: 'auto' or 'brute'
    # 'ball_tree' and 'kd_tree' algorithms may work better with numeric data, but we use 'auto' and 'brute' here for simplicity
    'leaf_size': randint(20, 41), # Size of leaf nodes for tree-based algorithms (randomized between 20 and 41)
    'metric': ['minkowski', 'euclidean', 'manhattan', 'chebyshev'], # Distance metrics for KNN
    'p': randint(1, 3)  # Only effective when metric='minkowski', p=1 for Manhattan distance, p=2 for Euclidean distance
}

# List to store results of different cross-validation settings
cv_results_knn = []

# Initialize the KNN classifier model
knn_model = KNeighborsClassifier()

for cv in cv_values:
    print(f"\n🔶 Running RandomizedSearchCV for KNN with cv={cv}")
    # Initialize RandomizedSearchCV with specified parameters
    search = RandomizedSearchCV(
        estimator=knn_model,
        param_distributions=param_distributions_knn,
        n_iter=30,
        cv=StratifiedKFold(n_splits=cv, shuffle=True, random_state=0),
        scoring='roc_auc',
        return_train_score=True,
        n_jobs=-1,
        random_state=0,
        verbose=0
    )

    # Fit the RandomizedSearchCV with the training data
    search.fit(x, y)

    # Get the best index and corresponding AUC scores
    best_idx_knn = search.best_index_ # Index of the best model based on validation AUC
    train_auc_knn = search.cv_results_['mean_train_score'][best_idx_knn] # Train ROC-AUC score
    val_auc_knn = search.cv_results_['mean_test_score'][best_idx_knn] # Validation ROC-AUC score

    # Store results for each cross-validation value
    cv_results_knn.append({
        'cv': cv,
        'grid': search,
        'train_auc': train_auc_knn,
        'val_auc': val_auc_knn
    })

    print(f" cv={cv} ➤ Train ROC-AUC: {train_auc_knn:.4f}, Validation ROC-AUC: {val_auc_knn:.4f}")

In [None]:
# Extract data for plotting the results
cv_vals_knn = [r['cv'] for r in cv_results_knn]
train_auc_scores_knn = [r['train_auc'] for r in cv_results_knn]
val_auc_scores_knn = [r['val_auc'] for r in cv_results_knn]

# Find the best cv based on the highest Validation AUC
best_idx_knn = np.argmax(val_auc_scores_knn)
best_cv_knn = cv_vals_knn[best_idx_knn]
best_val_auc_knn = val_auc_scores_knn[best_idx_knn]
best_model_knn = cv_results_knn[best_idx_knn]['grid']

print("\n🔶 Best number of folds (cv):", best_cv_knn)
print("🔶 best combination of hyperparameters: ", best_model_knn.best_params_)
print("🔶 Highest ROC-AUC score: ", best_val_auc_knn)      

In [None]:
# Plotting ROC-AUC scores for different CV fold values in KNN
plt.figure(figsize=(10, 4), dpi=200)
# Plot train ROC-AUC scores
plt.plot(cv_vals_knn, train_auc_scores_knn, marker='o', label='Train ROC-AUC', color='#90BE6D')
# Plot validation ROC-AUC scores
plt.plot(cv_vals_knn, val_auc_scores_knn, marker='s', label='Validation ROC-AUC', color='#EA9010')

plt.xlabel('CV folds (k)', fontsize=10)
plt.ylabel('ROC-AUC Score', fontsize=10)
plt.title('KNN: Train vs Validation ROC-AUC Across Different CV Values for KNN', fontsize=12, fontweight='bold')
plt.ylim(min(min(train_auc_scores_knn), min(val_auc_scores_knn)) - 0.01, 1.0) # Y-axis range based on minimum value
plt.xticks(cv_vals_knn) # X-axis ticks based on CV values used

plt.grid(True)
plt.legend()

# Annotating the best CV fold on the plot
plt.annotate(
    f'Best CV = {best_cv_knn}', 
    xy=(best_cv_knn, best_val_auc_knn), 
    xytext=(best_cv_knn - 0.5, best_val_auc_knn - 0.015), 
    arrowprops=dict(facecolor='blue', edgecolor='blue', arrowstyle='->', lw=1.5),
    fontsize=10, color='blue', fontweight='bold'
)

plt.xticks(fontsize=8)
plt.yticks(fontsize=8)
for spine in plt.gca().spines.values():
    spine.set_color('lightgray')
    spine.set_linewidth(1.5)

plt.tight_layout()
plt.show()

🔸 CV = 3

Train ROC-AUC ≈ 0.978

Validation ROC-AUC ≈ 0.952

The gap is noticeable but not excessive — moderate generalization.

🔸 CV = 5

Train ROC-AUC spikes ≈ 1.00

Validation ROC-AUC ≈ 0.958

A large gap appears: the model fits the training data almost perfectly, suggesting overfitting.

🔸 CV = 7

Train ROC-AUC drops ≈ 0.982

Validation ROC-AUC ≈ 0.957

The gap narrows again, showing improved generalization.

⭐ CV = 10 ←(Best Point)

Train ROC-AUC ≈ 0.982

Validation ROC-AUC ≈ 0.958 ← Highest validation performance

Consistent and balanced — best trade-off between bias and variance.

🔸 Conclusion

Best CV = 10 is chosen because it achieves the highest Validation ROC-AUC while maintaining stable training performance.

Lower CV values (like 3 or 5) introduce instability:

CV=5 especially causes overfitting (Train ROC-AUC ≈ 1.00).

Higher CV values (7 and 10) lead to better generalization and reliable evaluation.

In [None]:
# Extract the grid corresponding to cv=10
grid = cv_results_knn[cv_vals_knn.index(10)]['grid']
results_knn = grid.cv_results_

# Filter only rows with the same hyperparameters as the best (except n_neighbors)
filtered_idxs_knn = [
    i for i, params in enumerate(results_knn['params'])
    if params.get('algorithm') == 'brute' and
       params.get('metric') == 'manhattan' and
       params.get('p') == 2 and
       params.get('weights') == 'uniform'
]


# Check if any rows match the filtering criteria
if not filtered_idxs_knn:
    print("No rows found with the specified parameter combination.")
else:
    # Extract n_neighbors and corresponding scores
    n_vals_knn = [results_knn['params'][i]['n_neighbors'] for i in filtered_idxs_knn]
    val_scores_knn= [results_knn['mean_test_score'][i] for i in filtered_idxs_knn]
    train_scores_knn = [results_knn['mean_train_score'][i] for i in filtered_idxs_knn]

    # Sort data based on n_neighbors for cleaner plots
    sorted_data_knn = sorted(zip(n_vals_knn, train_scores_knn, val_scores_knn))
    n_vals_knn, train_scores_knn, val_scores_knn = zip(*sorted_data_knn)

    # Identify the best n_neighbors based on validation score
    best_idx_knn = np.argmax(val_scores_knn)
    best_n = n_vals_knn[best_idx_knn]
    best_val_score = val_scores_knn[best_idx_knn]
    
   # Plotting Train and Validation ROC-AUC against n_neighbors
    plt.figure(figsize=(10, 4), dpi=200)
    plt.plot(n_vals_knn, train_scores_knn, marker='o', label='Train ROC-AUC', color='#90BE6D')
    plt.plot(n_vals_knn, val_scores_knn, marker='s', label='Validation ROC-AUC', color='#EA9010')

    plt.xlabel('n_neighbors', fontsize=10)
    plt.ylabel('ROC-AUC Score', fontsize=10)
    plt.title('Train vs Validation ROC-AUC Across Different n_neighbors\n(CV=10, Similar to Best Params)', fontsize=12, fontweight='bold')
    plt.grid(True)
    plt.legend()

    # Annotate best n
    plt.annotate(
        f'Best n = {best_n}',
        xy=(best_n, best_val_auc_knn),
        xytext=(best_n - 1.2, best_val_auc_knn - 0.02),
        arrowprops=dict(facecolor='blue', edgecolor='blue', arrowstyle='->', lw=1.5),
        fontsize=10, color='blue', fontweight='bold'
    )

    plt.ylim(min(min(train_scores_knn), min(val_scores_knn)) - 0.01, 1.0)
    for spine in plt.gca().spines.values():
        spine.set_color('lightgray')
        spine.set_linewidth(1.5)

    plt.tight_layout()
    plt.show()

🔸 Training ROC-AUC decreases as the number of neighbors increases:

This is expected. Larger n_neighbors values lead to smoother decision boundaries, meaning the model becomes less flexible and slightly underfits the training data.

🔸 Validation ROC-AUC increases with larger n_neighbors:

The validation performance improves steadily from n=5 to n=25.

This indicates better generalization and reduced overfitting.

🔸 The gap between Train and Validation scores narrows:

At lower n values, the training score is much higher than validation, which is a sign of overfitting.

As n increases, the model sacrifices a bit of training accuracy in favor of better performance on unseen data (validation), which is desirable.

🔸 Conclusion:

The best value for n_neighbors in this setup is 25, as it yields the highest validation ROC-AUC.

The decreasing training performance alongside increasing validation performance suggests the model is moving away from overfitting and achieving better generalization.

This is a healthy trade-off between bias and variance in KNN models.

### <b> Final Training and Inference

In [None]:
x_train_knn, x_test_knn, y_train_knn, y_test_knn = train_test_split(x, y, test_size=best_test_size_knn, random_state=0, stratify=y)

In [None]:
# Use the best hyperparameters obtained from RandomizedSearchCV
best_params_knn = {'algorithm': 'brute', 'leaf_size': 24, 'metric': 'manhattan',
               'n_neighbors': 25, 'p': 2, 'weights': 'uniform'}

# Build the KNN model using the best parameters
best_knn_model = KNeighborsClassifier(**best_params_knn)
best_knn_model.fit(x_train_knn, y_train_knn)

print("Best Parameters:", grid.best_params_)

# Predict the probability of the positive class for test data
y_score_knn = best_knn_model.predict_proba(x_test_knn)[:, 1]

# Calculate ROC Curve and AUC score
fpr, tpr, _ = roc_curve(y_test_knn, y_score_knn)
roc_auc = auc(fpr, tpr)

# Calculate Precision-Recall curve and Average Precision score
precision, recall, _ = precision_recall_curve(y_test_knn, y_score_knn)
avg_prec = average_precision_score(y_test_knn, y_score_knn)

# Plot ROC and Precision-Recall curves
fig, axes = plt.subplots(1, 2, figsize=(14, 5), dpi=200)

# ROC Curve
axes[0].plot(fpr, tpr, label=f'ROC AUC = {roc_auc:.2f}', color='#90BE6D')
axes[0].plot([0, 1], [0, 1], linestyle='--', color='gray')
axes[0].set_xlabel('False Positive Rate', fontsize=10)
axes[0].set_ylabel('True Positive Rate', fontsize=10)
axes[0].set_title('ROC Curve', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(True)

# Precision-Recall Curve
axes[1].plot(recall, precision, label=f'AP = {avg_prec:.2f}', color='#EA9010')
axes[1].set_xlabel('Recall', fontsize=10)
axes[1].set_ylabel('Precision', fontsize=10)
axes[1].set_title('Precision-Recall Curve', fontsize=12, fontweight='bold')
axes[1].legend()
axes[1].grid(True)

for ax in axes:
    for spine in ax.spines.values():
        spine.set_edgecolor('#D3D3D3') 
        spine.set_linewidth(1.5)

plt.suptitle('Performance Analysis of Tuned KNN Model', fontsize=16, fontweight='bold')
plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.show()

🔶 ROC Curve

ROC AUC = 0.96

This is an excellent score, indicating the classifier has a very strong ability to distinguish between the positive and negative classes.

The curve is well above the diagonal (random chance line), which is a good sign.

A True Positive Rate (TPR) near 1 with a low False Positive Rate (FPR) at many thresholds suggests low Type II error and good sensitivity.

🔶 Precision-Recall Curve

Average Precision (AP) = 0.80

This is also a solid result, especially for imbalanced datasets (which PR curves are particularly helpful for).

The curve starts high (indicating strong precision at low recall levels), but as recall increases, precision gradually decreases — a typical trade-off.

This suggests the model maintains good performance but precision drops when trying to capture more positives (i.e., higher recall).

🔶 Overall Interpretation

High ROC AUC (0.96) + Good AP (0.80) indicate that the model performs very well in both:

Ranking ability (ROC)

Precision-recall balance, especially under class imbalance scenarios.

This model seems well-calibrated and generalizes effectively — likely a result of good hyperparameter tuning and cross-validation.

### <b>Evaluation

In [None]:
# Predict final class labels on the test data
y_pred_knn = best_knn_model.predict(x_test_knn)

# محاسبه و نمایش متریک‌های ارزیابی
print("🔶 Accuracy:", accuracy_score(y_test_knn, y_pred_knn))
print("🔶 Precision:", precision_score(y_test_knn, y_pred_knn, average='weighted'))
print("🔶 Recall:", recall_score(y_test_knn, y_pred_knn, average='weighted'))
print("🔶 F1 Score:", f1_score(y_test_knn, y_pred_knn, average='weighted'))
print("\n🔶 Classification Report:\n", classification_report(y_test_knn, y_pred_knn))

# ماتریس آشفتگی (Confusion Matrix)
cm_knn = confusion_matrix(y_test_knn, y_pred_knn)
fig, ax = plt.subplots(figsize=(4, 4), dpi=150)

sns.heatmap(cm_knn, annot=True, fmt='d', cmap='Greens', cbar=False, square=True,
            annot_kws={"size": 10}, ax=ax)

ax.set_xlabel("Predicted Label", fontsize=10)
ax.set_ylabel("True Label", fontsize=10)
ax.set_title("Confusion Matrix", fontsize=12, fontweight='bold')
ax.tick_params(axis='both', labelsize=9)

plt.tight_layout()
plt.show()

🔶 Overall Metrics 

🔶 Accuracy: 0.923 

→ Very high overall, but accuracy alone can be misleading in imbalanced datasets.

🔶 Precision: 0.928 

→ The model is precise when predicting the positive class (few false positives).

🔶 Recall: 0.923 

→ Indicates the model identifies most true instances overall — but let’s break it down class-wise.

🔶 F1 Score: 0.908 

→ Strong harmonic mean of precision and recall.

🔶 The model perfectly identifies all class 0 instances (True Negatives), but fails to identify class 1 instances (positives).

🔶 Recall for class 1 is only 6%, meaning the model is missing 94% of true positive cases 

🔶 Precision for class 1 is technically perfect (1.00), but that’s because it only predicted 10 instances as class 1 — and they all happened to be correct. This is a side effect of extreme underprediction.

🔶his confirms the model is biased toward the majority class (class 0). It barely detects class 1, despite having relatively high overall accuracy — a classic case of performance being inflated due to class imbalance.

🔶 Conclusion:

Accuracy is misleading here — the model performs poorly on the minority class.

Although ROC AUC (from the previous image) is 0.96, the classification threshold likely needs to be adjusted to improve recall for class 1.
Consider:

Adjusting the decision threshold

Using resampling techniques (e.g., SMOTE)

Applying class weighting or focal loss

Evaluating cost-sensitive metrics

### <b> Feature selection

In [None]:
# Compute feature importance using Permutation Importance
result = permutation_importance(best_knn_model, x_test_knn, y_test_knn, n_repeats=10, random_state=0)

# Create a DataFrame from the results
perm_df = pd.DataFrame({
    'Feature': x.columns,
    'Importance': result.importances_mean,
    'Std': result.importances_std
})
# Add a column with absolute importance values and sort
perm_df['AbsImportance'] = np.abs(perm_df['Importance'])
perm_df.sort_values('AbsImportance', ascending=False, inplace=True)

# Plot the top 10 most important features
plt.figure(figsize=(10, 4), dpi=200)
sns.barplot(x='Importance', y='Feature', data=perm_df.head(10), palette="YlGn")
plt.title('Top 10 Important Features (KNN)', fontsize=12, fontweight='bold')

# Customize plot appearance
for spine in plt.gca().spines.values():
    spine.set_edgecolor('#D3D3D3') 
    spine.set_linewidth(1.5)

plt.xlabel('Importance', fontsize=10)
plt.ylabel('Feature', fontsize=10)
plt.tick_params(axis='x', labelsize=8)
plt.tick_params(axis='y', labelsize=8)

plt.tight_layout()
plt.show()

🔶 Income and CCAvg_Annual (average annual credit card usage) are by far the most influential features in the KNN model. This indicates that financial status plays a critical role in prediction.

🔶 Demographic features such as Age, Experience, and Geographic features (lat, lng) are moderately important but much less impactful than financial indicators.

🔶 Family composition (Family_4.0, Family_1.0) and education level (Education_2.0) have minimal importance, showing low sensitivity in the model.

🔶 Interestingly, the CD Account feature (indicating whether a customer has a Certificate of Deposit account) has slight but positive importance, suggesting a small contribution to the output.

## <b> Complement Naive Bayes Model

In [None]:
# Split the dataset into features (X) and target (y)
x = merged_df.drop('Personal Loan', axis=1)
y = merged_df['Personal Loan']
x

In [None]:
# 1. Defining the feature lists for different types of data
normal_features = ['Age', 'Experience']
skewed_features = ['Income', 'CCAvg_Annual', 'Mortgage']
categorical_features = ['Family', 'Education']
binary_features = ['Securities Account', 'CD Account', 'Online', 'CreditCard']
geo_features = ['lat', 'lng']

# 2. Defining transformers
# Transformer for features with normal distribution, applying MinMax scaling
minmax = MinMaxScaler()

# Transformer for skewed features, applying logarithmic transformation followed by MinMax scaling
log_and_minmax = Pipeline(steps=[
    ('log', FunctionTransformer(func=lambda x: np.log1p(x))), # Applying log transformation
    ('minmax', MinMaxScaler()) # Scaling the transformed data
])

# OneHotEncoder for categorical features
categorical = OneHotEncoder(drop=None, sparse_output=False)

# 3. Defining the ColumnTransformer to apply different transformations to different feature sets
preprocessor_cnb = ColumnTransformer(transformers=[
    ('normals', minmax, normal_features), # Apply MinMax scaling to normal features
    ('skewed', log_and_minmax, skewed_features), # Apply log transformation + MinMax scaling to skewed features
    ('categorical', categorical, categorical_features), # Apply OneHot encoding to categorical features
    ('geo', minmax, geo_features), # Apply MinMax scaling to geographical features
    ('passthrough', 'passthrough', binary_features) # Leave binary features unchanged (no transformation)
])

# 4. Applying preprocessing to the dataset
# Make a copy of the input dataset to avoid modifying the original dataset
x_cnb = x.copy(deep=True)
# Apply the defined transformations to the data
X_cnb_preprocessed = preprocessor_cnb.fit_transform(x_cnb)

# Save the preprocessor for later use
scaler_cnb = preprocessor_cnb

# ---------------------------
# 4. Create final DataFrame with column names
# ---------------------------

# Gather all output feature names including those from OneHotEncoder
output_columns = (
    normal_features +
    skewed_features +
    list(preprocessor.named_transformers_['categorical'].get_feature_names_out(categorical_features)) +
    geo_features +
    binary_features
)

# Create a new DataFrame with transformed features and target column
final_df2 = pd.DataFrame(X_cnb_preprocessed, columns=output_columns)
final_df2['Personal Loan'] = y.values

print(final_df2.head())

In [None]:
# Separate the dataset into features (X) and target variable (y)
x = final_df2.drop('Personal Loan', axis=1)
y = final_df2['Personal Loan']

In [None]:
# Initialize lists to store evaluation results
test_sizes_cnb = []
test_aucs_cnb = []
train_aucs_cnb = []

for i in test_size_list:
    x_train_cnb, x_test_cnb, y_train_cnb, y_test_cnb = train_test_split(
        x, y, test_size=i, random_state=0, stratify=y
    )
    
    # Initialize and train the Complement Naive Bayes model
    cnb_model = ComplementNB()
    cnb_model.fit(x_train_cnb, y_train_cnb)

    # Predict class probabilities for training and test sets
    train_proba_cnb = cnb_model.predict_proba(x_train_cnb)[:, 1]
    test_proba_cnb = cnb_model.predict_proba(x_test_cnb)[:, 1]

    # Calculate ROC-AUC scores for training and testing
    train_auc_cnb = roc_auc_score(y_train_cnb, train_proba_cnb)
    test_auc_cnb = roc_auc_score(y_test_cnb, test_proba_cnb)

    # Store results
    test_sizes_cnb.append(i)
    train_aucs_cnb.append(train_auc_cnb)
    test_aucs_cnb.append(test_auc_cnb)

    # Predict class labels on the test set
    y_pred_cnb = cnb_model.predict(x_test_cnb)

    print(f"\n🔶 Test size: {i}")
    print(confusion_matrix(y_test_cnb, y_pred_cnb))
    print(classification_report(y_test_cnb, y_pred_cnb))
    print(f"ROC-AUC Score: {test_auc_cnb:.4f}")

In [None]:
# Find the best test size based on the highest ROC-AUC score on the test set
best_index_cnb = np.argmax(test_aucs_cnb)
best_test_size_cnb = test_sizes_cnb[best_index_cnb]
best_test_auc_cnb = test_aucs_cnb[best_index_cnb]

# Plot the ROC-AUC scores for train and test sets across different test sizes
plt.figure(figsize=(10, 4), dpi=200)
plt.plot(test_sizes_cnb, train_aucs_cnb, marker='o', label='Train ROC-AUC', color='#90BE6D') # Plot training ROC-AUC
plt.plot(test_sizes_cnb, test_aucs_cnb, marker='s', label='Test ROC-AUC', color='#EA9010') # Plot test ROC-AUC

plt.annotate(
    f'Best Test Size = {best_test_size_cnb}',
    xy=(best_test_size_cnb, best_test_auc_cnb),
    xytext=(best_test_size_cnb - 0.1, best_test_auc_cnb + 0.01),
    arrowprops=dict(facecolor='blue', edgecolor='blue', arrowstyle='->', lw=1.5),
    fontsize=10, color='blue', fontweight='bold'
)

# Set plot titles and labels
plt.title('Train vs Test ROC-AUC Across Different Test Sizes for ComplementNB', fontsize=12, fontweight='bold')
plt.xlabel('Test Size', fontsize=10)
plt.ylabel('ROC-AUC Score', fontsize=10)

plt.ylim(0.75, 0.85)
plt.legend(fontsize=8)
plt.grid(True)
plt.xticks(fontsize=8)
plt.yticks(fontsize=8)

for spine in plt.gca().spines.values():
    spine.set_color('lightgray')
    spine.set_linewidth(1.5)

plt.tight_layout()
plt.show()

🔶 Test Size = 0.1

Train ROC-AUC ≈ 0.793

Test ROC-AUC ≈ 0.778

There is a small positive gap between the test and train scores. The model performs slightly better on the test set, which might suggest mild underfitting, but overall it indicates decent generalization.

🔶 Test Size = 0.2

Train ROC-AUC ≈ 0.794

Test ROC-AUC ≈ 0.785

The scores are very close, indicating a balanced performance. The model handles both train and test data consistently.

🔶 Test Size = 0.3

Train ROC-AUC ≈ 0.796

Test ROC-AUC ≈ 0.775

Slight improvement in training score, but test performance dips slightly. This could hint at mild overfitting, though still within an acceptable range.

🔶 Test Size = 0.4

Train ROC-AUC ≈ 0.798

Test ROC-AUC ≈ 0.781

The gap between train and test scores increases slightly, but the test performance improves. This suggests improved generalization, though with a bit more overfitting compared to previous sizes.

⭐ Test Size = 0.5 ←(Best Point)

Train ROC-AUC ≈ 0.801

Test ROC-AUC ≈ 0.798

This is the best point based on the ROC-AUC for the test set. The small gap between train and test indicates excellent generalization. The model achieves its peak test performance here.

🔶 Conclusion:

The best test size is 0.5, where the Test ROC-AUC reaches its highest value (~0.798).

The model generalizes best at this point, with only a slight increase in training performance.

Although smaller test sizes (like 0.1 or 0.2) offer stability, the 0.5 test split provides the most balanced and optimal model performance.

### <b> Hyperparameter Tuning

In [None]:
# Define the hyperparameter grid for Complement Naive Bayes
param_grid_cnb = {
    'alpha': [0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0],
    'fit_prior': [True, False], # Whether to learn class prior probabilities or not
    'norm': [True, False]       # Whether to normalize feature values
}

# List to store the results for each CV value
cv_results_cnb = []

# Initialize the model
cnb_model = ComplementNB()

# Perform grid search over different cross-validation values
for cv in cv_values:
    print(f"\n🔶 Running GridSearchCV for ComplementNB with cv={cv}")

    # Setup GridSearchCV with Stratified K-Fold
    grid = GridSearchCV(
        estimator=cnb_model,
        param_grid=param_grid_cnb,
        cv=StratifiedKFold(n_splits=cv, shuffle=True, random_state=0),
        scoring='roc_auc',
        return_train_score=True,
        n_jobs=-1,
        verbose=0
    )
    
    # Fit the model to data
    grid.fit(x, y)
    
    # Extract the best training and validation scores
    best_idx_cnb = grid.best_index_
    train_auc_cnb = grid.cv_results_['mean_train_score'][best_idx_cnb]
    val_auc_cnb = grid.cv_results_['mean_test_score'][best_idx_cnb]

    # Store the results
    cv_results_cnb.append({
        'cv': cv,
        'grid': grid,
        'train_auc': train_auc_cnb,
        'val_auc': val_auc_cnb
    })

    print(f" cv={cv} ➤ Train ROC-AUC: {train_auc_cnb:.4f}, Validation ROC-AUC: {val_auc_cnb:.4f}")

In [None]:
# پیدا کردن بهترین مدل بر اساس بیشترین AUC اعتبارسنجی
best_entry = max(cv_results_cnb, key=lambda x: x['val_auc'])
best_cv_cnb = best_entry['cv']
best_model_cnb = best_entry['grid']
best_score_cnb = best_entry['val_auc']

# نمایش نتایج
print("\n🔶 Best number of folds (cv):", best_cv_cnb)
print("🔶 best combination of hyperparameters:", best_model_cnb.best_params_)
print("🔶 Highest ROC-AUC score:", round(best_score_cnb, 4))

In [None]:
# Extract data from CV results
cv_vals_cnb = [r['cv'] for r in cv_results_cnb]                   # List of CV values
train_auc_scores_cnb = [r['train_auc'] for r in cv_results_cnb]   # Corresponding train AUCs
val_auc_scores_cnb = [r['val_auc'] for r in cv_results_cnb]       # Corresponding validation AUCs

# Identify the best CV value based on highest validation AUC
best_idx_cnb = np.argmax(val_auc_scores_cnb)
best_cv_cnb = cv_vals_cnb[best_idx_cnb]
best_val_auc_cnb = val_auc_scores_cnb[best_idx_cnb]

# Plotting the train and validation ROC-AUC scores across CV folds
plt.figure(figsize=(10, 4), dpi=200)
plt.plot(cv_vals_cnb, train_auc_scores_cnb, marker='o', label='Train ROC-AUC', color='#90BE6D')
plt.plot(cv_vals_cnb, val_auc_scores_cnb, marker='s', label='Validation ROC-AUC', color='#EA9010')

# Chart labels and title
plt.xlabel('CV folds (k)', fontsize=10)
plt.ylabel('ROC-AUC Score', fontsize=10)
plt.title('Train vs Validation ROC-AUC Across Different CV Values for ComplementNB', fontsize=12, fontweight='bold')
plt.ylim(min(min(train_auc_scores_cnb), min(val_auc_scores_cnb)) - 0.01, 0.9)
plt.xticks(cv_vals_cnb)
plt.grid(True)
plt.legend()

# Annotate best CV
plt.annotate(
    f'Best CV = {best_cv_cnb}', 
    xy=(best_cv_cnb, best_val_auc_cnb), 
    xytext=(best_cv_cnb - 0.8, best_val_auc_cnb + 0.05), 
    arrowprops=dict(facecolor='blue', edgecolor='blue', arrowstyle='->', lw=1.5),
    fontsize=10, color='blue', fontweight='bold'
)

plt.xticks(fontsize=8)
plt.yticks(fontsize=8)
for spine in plt.gca().spines.values():
    spine.set_color('lightgray')
    spine.set_linewidth(1.5)

plt.tight_layout()
plt.show()

🔶 CV = 3

Train ROC-AUC ≈ 0.793

Validation ROC-AUC ≈ 0.782

A small gap exists between train and validation. Validation score is slightly lower, indicating the model generalizes reasonably well, though may slightly underfit with fewer folds.

🔶 CV = 5

Train ROC-AUC ≈ 0.793

Validation ROC-AUC ≈ 0.784

Performance on the validation set improves slightly. Still a balanced outcome, and the results remain stable.

🔶 CV = 7

Train ROC-AUC ≈ 0.793

Validation ROC-AUC ≈ 0.785

The scores remain consistent. Increasing k improves validation performance very slightly. The model maintains stability without noticeable overfitting.

⭐ CV = 10 ←(Best Point)

Train ROC-AUC ≈ 0.793

Validation ROC-AUC ≈ 0.786

This is the highest Validation ROC-AUC among all CV folds. The gap between train and validation scores is minimal, indicating excellent generalization and reliable performance.

🔶 Conclusion:

The best CV fold is k = 10, where the Validation ROC-AUC peaks at ~0.786.

Across all values of k, both Train and Validation scores are remarkably stable, which shows that the model is robust and not highly sensitive to the number of CV folds.

Using more folds (like k=10) slightly improves validation performance while maintaining model stability.

In [None]:
# Retrieve the GridSearchCV object for CV=10
grid = cv_results_cnb[cv_vals_cnb.index(10)]['grid']
results_cnb = grid.cv_results_

# Filter entries where 'norm' and 'fit_prior' are both True
filtered_idxs_cnb = [
    i for i, params in enumerate(results_cnb['params'])
    if params.get('norm') == True and
       params.get('fit_prior') == True
]

# Extract alpha values and corresponding mean train/validation ROC-AUC scores
alpha_vals_cnb = [results_cnb['params'][i]['alpha'] for i in filtered_idxs_cnb]
val_scores_cnb = [results_cnb['mean_test_score'][i] for i in filtered_idxs_cnb]
train_scores_cnb = [results_cnb['mean_train_score'][i] for i in filtered_idxs_cnb]

# Identify the alpha value that yields the highest validation ROC-AUC
best_idx_cnb = np.argmax(val_scores_cnb)
best_alpha_cnb = alpha_vals_cnb[best_idx_cnb]
best_val_score_cnb = val_scores_cnb[best_idx_cnb]

# Plot train and validation ROC-AUC scores across different alpha values
plt.figure(figsize=(10, 4), dpi=200)
plt.plot(alpha_vals_cnb, train_scores_cnb, marker='o', label='Train ROC-AUC', color='#90BE6D')
plt.plot(alpha_vals_cnb, val_scores_cnb, marker='s', label='Validation ROC-AUC', color='#EA9010')

# Use logarithmic scale for alpha axis (since alpha spans wide values)
plt.xscale('log')
plt.xlabel('Alpha (log scale)', fontsize=10)
plt.ylabel('ROC-AUC Score', fontsize=10)
plt.title('Train vs Validation ROC-AUC Across Different Alpha Values\n(CV=10, Fixed Params)', fontsize=12, fontweight='bold')
plt.grid(True)
plt.legend()

# Annotate best alpha
plt.annotate(
    f'Best α = {best_alpha_cnb}',
    xy=(best_alpha_cnb, best_val_score_cnb),
    xytext=(best_alpha_cnb, best_val_score_cnb + 0.02),
    arrowprops=dict(facecolor='blue', edgecolor='blue', arrowstyle='->', lw=1.5),
    fontsize=10, color='blue', fontweight='bold'
)

plt.ylim(min(min(train_scores_cnb), min(val_scores_cnb)) - 0.01, 0.9)

for spine in plt.gca().spines.values():
    spine.set_color('lightgray')
    spine.set_linewidth(1.5)

plt.tight_layout()
plt.show()

⭐ α = 0.01 ←(Best Point)

Train ROC-AUC ≈ 0.792

Validation ROC-AUC ≈ 0.782

This is the best-performing model across the tested values.

Minimal gap between training and validation suggests low variance and good generalization.

🔸 α = 0.1 to 10 (rightward on log scale)

Gradual decline in both training and validation ROC-AUC scores.

As α increases, the model gets more regularized (i.e., coefficients are penalized more).

This leads to underfitting, especially at α = 10 (lowest scores).

The validation score drops more sharply than training as α increases → reduced model flexibility.

🔸 Trend Observed

Train score is consistently slightly higher than validation, which is normal.

The gap between curves remains small → no major signs of overfitting.

However, higher α values harm performance due to excessive regularization.

🔸 Conclusion:

The model achieves its optimal balance at α = 0.01, where both training and validation ROC-AUC scores are highest.

### <b> Final Training and Inference

In [None]:
x_train_cnb, x_test_cnb, y_train_cnb, y_test_cnb = train_test_split(x, y, test_size=best_test_size_cnb, random_state=0, stratify=y)

In [None]:
# Assume 'grid' contains the GridSearchCV results for ComplementNB
best_cnb_model = grid.best_estimator_  # Best model from GridSearchCV
print("Best Parameters:", grid.best_params_)

# Predict probabilities for the positive class (class 1)
y_score_cnb = best_cnb_model.predict_proba(x_test_cnb)[:, 1]

# Compute ROC curve metrics
fpr, tpr, _ = roc_curve(y_test_cnb, y_score_cnb)
roc_auc = auc(fpr, tpr)

# Compute Precision-Recall curve metrics
precision, recall, _ = precision_recall_curve(y_test_cnb, y_score_cnb)
avg_prec = average_precision_score(y_test_cnb, y_score_cnb)

# Plotting ROC and Precision-Recall curves side by side
fig, axes = plt.subplots(1, 2, figsize=(14, 5), dpi=200)

# ROC Curve
axes[0].plot(fpr, tpr, label=f'ROC AUC = {roc_auc:.2f}', color='#90BE6D')
axes[0].plot([0, 1], [0, 1], linestyle='--', color='gray')
axes[0].set_xlabel('False Positive Rate', fontsize=10)
axes[0].set_ylabel('True Positive Rate', fontsize=10)
axes[0].set_title('ROC Curve', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(True)

# Precision-Recall Curve
axes[1].plot(recall, precision, label=f'AP = {avg_prec:.2f}', color='#EA9010')
axes[1].set_xlabel('Recall', fontsize=10)
axes[1].set_ylabel('Precision', fontsize=10)
axes[1].set_title('Precision-Recall Curve', fontsize=12, fontweight='bold')
axes[1].legend()
axes[1].grid(True)

for ax in axes:
    for spine in ax.spines.values():
        spine.set_edgecolor('#D3D3D3')
        spine.set_linewidth(1.5)

plt.suptitle('Performance Analysis of Tuned Complement Naive Bayes Model', fontsize=16, fontweight='bold')
plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.show()

🔸ROC Curve

ROC AUC = 0.80

This means the model has moderate discriminatory power. AUC of 0.80 indicates that the classifier is able to distinguish between the positive and negative classes reasonably well, but it's not highly strong.

The curve rises above the diagonal (random guess), but not steeply.

Still, there's room for improvement — especially compared to models with ROC AUC ≥ 0.90.

🔸 Precision-Recall Curve

Average Precision (AP) = 0.39

This is relatively low, particularly in imbalanced datasets where precision-recall metrics are more informative than ROC.

The precision drops significantly as recall increases, indicating the model struggles to maintain high precision when trying to retrieve more positive cases.

This suggests a high false positive rate when recall is prioritized.

🔸 Conclusion
The Complement Naive Bayes model performs moderately well in terms of ROC AUC, but poorly in terms of precision-recall.

This could indicate the model is not well suited for tasks(like this) where positive class detection is critical or when the dataset is imbalanced.

### <b>Evaluation

In [None]:
# Predict class labels using the best ComplementNB model
y_pred_cnb = best_cnb_model.predict(x_test_cnb)

# Compute and display evaluation metrics
print("🔶 Accuracy:", accuracy_score(y_test_cnb, y_pred_cnb))
print("🔶 Precision:", precision_score(y_test_cnb, y_pred_cnb, average='weighted'))
print("🔶 Recall:", recall_score(y_test_cnb, y_pred_cnb, average='weighted'))
print("🔶 F1 Score:", f1_score(y_test_cnb, y_pred_cnb, average='weighted'))
print("\n🔶 Classification Report:\n", classification_report(y_test_cnb, y_pred_cnb))

# Confusion Matrix
cm_cnb = confusion_matrix(y_test_cnb, y_pred_cnb)
# Plot the confusion matrix as a heatmap
fig, ax = plt.subplots(figsize=(4, 4), dpi=150)

sns.heatmap(cm_cnb, annot=True, fmt='d', cmap='Greens', cbar=False, square=True,
            annot_kws={"size": 10}, ax=ax)

ax.set_xlabel("Predicted Label", fontsize=10)
ax.set_ylabel("True Label", fontsize=10)
ax.set_title("Confusion Matrix", fontsize=12, fontweight='bold')
ax.tick_params(axis='both', labelsize=9)

plt.tight_layout()
plt.show()

🔶 Overall Metrics:

🔶 Accuracy: 0.6666 

→ About 66.7% of the total predictions were correct.

🔶 Precision: 0.9028 

→ High precision means that when the model predicts class 1, it's usually correct.

🔶 Recall: 0.6666

→ The model retrieved about 66.6% of the actual positive cases.

🔶 F1 Score: 0.7411 

→ Harmonic mean of precision and recall.

🔶 For class 0, the model performs well overall, especially in precision (0.97).

🔶 For class 1, performance is very poor in precision (0.17), meaning a lot of false positives (samples predicted as 1 but are actually 0).

🔶 Conclusion:

The model favors the majority class (class 0) heavily.

Performance for minority class (class 1) is poor, especially in precision.

### <b> Feature Selection

In [None]:
# Get the feature names from the dataset
features = x.columns

# Compute the log-probability difference between class 1 and class 0
# This difference is used as a measure of the relative importance of each feature
log_prob_diff = best_cnb_model.feature_log_prob_[1] - best_cnb_model.feature_log_prob_[0]

# Create a DataFrame to store feature names and their importance scores
coeff_df = pd.DataFrame({'Feature': features, 'Importance': log_prob_diff})
# Calculate the absolute importance to help with sorting
coeff_df['AbsImportance'] = np.abs(coeff_df['Importance'])  
# Sort the DataFrame by absolute importance in descending order
coeff_df.sort_values('AbsImportance', ascending=False, inplace=True)

# Generate a color palette with 10 greenish shades for the plot
colors = sns.color_palette("YlGn", n_colors=10)

# Plot the top 10 most important features using a horizontal barplot
plt.figure(figsize=(10, 4), dpi=200)
sns.barplot(x='Importance', y='Feature', data=coeff_df.head(10), palette=colors)
# Set the title of the plot
plt.title('Top 10 Important Features (ComplementNB)', fontsize=12, fontweight='bold')

for spine in plt.gca().spines.values():
    spine.set_edgecolor('#D3D3D3')
    spine.set_linewidth(1.5)

plt.xlabel('Log Probability Difference', fontsize=10)
plt.ylabel('Feature', fontsize=10)
plt.tick_params(axis='x', labelsize=8)
plt.tick_params(axis='y', labelsize=8)
plt.tight_layout()
plt.show()

🔶 CD Account is by far the most influential feature in the ComplementNB model. Its strong positive log-probability difference shows that having a Certificate of Deposit account is a key indicator of class 1. This feature contributes heavily to positive predictions.

🔶 Education_1.0 and Mortgage both have strong negative influence, meaning they are more associated with class 0. This suggests that individuals with lower education levels or active mortgages are more likely to be predicted as class 0.

🔶 Financial features like Income and CCAvg_Annual (average annual credit card spending) show moderate importance. These variables still influence the model’s output, particularly in favoring class 1, but less strongly than CD Account.

🔶 Demographic and household structure features such as Family size (Family_1.0, Family_2.0, Family_3.0) and Education levels (Education_2.0, Education_3.0) show relatively minor contributions. Their log-probability differences are small, indicating low sensitivity in this model.

🔶 Overall, the ComplementNB model leans heavily on a small set of high-impact financial features, with demographic details playing a supporting but limited role.

## <b> Conclusion

In [None]:
# List to store the final evaluation results of the models
results = []

# Define a function to evaluate a model by calculating and storing various metrics
def evaluate_model(name, model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    # If the model supports probability estimates, get the probability for class 1
    y_proba = model.predict_proba(X_test)[:, 1] if hasattr(model, "predict_proba") else None

    # Store the evaluation metrics in the results list
    results.append({
        'Model': name,
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred, average='weighted'),
        'Recall': recall_score(y_test, y_pred, average='weighted'),
        'F1 Score': f1_score(y_test, y_pred, average='weighted'),
        'ROC AUC': roc_auc_score(y_test, y_proba) if y_proba is not None else 'N/A'
    })

# Assumption: Only KNN model has been trained so far, and we want to evaluate it
evaluate_model("Logistic Regression", best_log_reg_model, x_train, x_test, y_train, y_test)
evaluate_model("KNN", best_knn_model, x_train_knn, x_test_knn, y_train_knn, y_test_knn)
evaluate_model("Complement Naive Bayes", best_cnb_model, x_train_cnb, x_test_cnb, y_train_cnb, y_test_cnb)

# Convert the results list to a DataFrame for tabular display
results_df = pd.DataFrame(results)

print("🔶 Final Evaluation Metrics Table:")
display(results_df.round(4))

🔶 So, Logistic Regression is the best model among the three.
    
🔶 Although the KNN model has a higher accuracy than Logistic Regression, the ROC AUC score is a more appropriate evaluation metric.

🔶 We are dealing with an imbalanced classification problem (90% negative class and 10% positive class), and:

🔶 ROC AUC is a better choice because:

<b>.</b> It is independent of the threshold.

<b>.</b> It measures the model’s ability to correctly distinguish between classes.

<b>.</b> Even if a model simply predicts only the majority class (resulting in high accuracy), its ROC AUC will still be low.

In [None]:
# Set up a figure with two subplots side by side
fig, axes = plt.subplots(1, 2, figsize=(10, 4))

# Plot the first confusion matrix (e.g., Logistic Regression)
sns.heatmap(cm, annot=True, fmt='d', cmap='Greens', cbar=False, ax=axes[0])
axes[0].set_title('Confusion Matrix - Logistic Regression')
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Actual')

# Plot the second confusion matrix (e.g., KNN)
sns.heatmap(cm_knn, annot=True, fmt='d', cmap='Greens', cbar=False, ax=axes[1])
axes[1].set_title('Confusion Matrix - KNN')
axes[1].set_xlabel('Predicted')
axes[1].set_ylabel('Actual')

# Adjust layout to prevent overlapping
plt.tight_layout()
plt.show()

🔶 However, by comparing the confusion matrices of all three models, we can say that a combination of Logistic Regression and KNN would yield the best possible outcome. Because our target in this problem is class 1 and in the logistic regression model 148 cases were correctly identified as class 1 while in the KNN model only 10 cases and a large number of errors were identified as class 0. On the other hand, in terms of cost, the KNN model is better because it does not make any mistakes in giving loan offers to those who do not accept the loan (TN=0) while in the logistic model 186 cases will be given loan offers incorrectly.

🔶 But the Naive Bayes model does not perform well at all on this dataset, although the Naive Bayes complement can handle imbalanced data, but it still does not perform well.

## <b>Ensemble Model(Logistic Regression + KNN)

🔶 One way to build an ensemble model is VotingClassifier

🔶 VotingClassifier is more useful for unbalanced data

🔶 In VotingClassifier, because multiple models judge the data:

The effect of noise or outliers is reduced.

The prediction error is statistically reduced.

🔶 In VotingClassifier, soft voting can be used (i.e., averaging the class probabilities). This makes:

The outputs smoother and more accurate.

Models that are more confident in their predictions have more weight.

In [None]:
# Split the dataset into features (X) and target (y)
x = final_df.drop('Personal Loan', axis=1)
y = final_df['Personal Loan']

In [None]:
x_train_log_reg_knn, x_test_log_reg_knn, y_train_log_reg_knn, y_test_log_reg_knn = train_test_split(x, y, test_size=0.4, random_state=0, stratify=y)

voting_clf = VotingClassifier(
    estimators=[('log_reg', best_log_reg_model), ('knn', best_knn_model)],
    voting='soft'
)

voting_clf.fit(x_train_log_reg_knn, y_train_log_reg_knn)

# Predict probabilities for the positive class (class 1)
y_score_log_reg_knn = voting_clf.predict_proba(x_test_log_reg_knn)[:, 1]

# Compute ROC curve metrics
fpr, tpr, _ = roc_curve(y_test_log_reg_knn, y_score_log_reg_knn)
roc_auc = auc(fpr, tpr)

# Compute Precision-Recall curve metrics
precision, recall, _ = precision_recall_curve(y_test_log_reg_knn, y_score_log_reg_knn)
avg_prec = average_precision_score(y_test_log_reg_knn, y_score_log_reg_knn)

# Plotting ROC and Precision-Recall curves side by side
fig, axes = plt.subplots(1, 2, figsize=(14, 5), dpi=200)

# ROC Curve
axes[0].plot(fpr, tpr, label=f'ROC AUC = {roc_auc:.2f}', color='#90BE6D')
axes[0].plot([0, 1], [0, 1], linestyle='--', color='gray')
axes[0].set_xlabel('False Positive Rate', fontsize=10)
axes[0].set_ylabel('True Positive Rate', fontsize=10)
axes[0].set_title('ROC Curve', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(True)

# Precision-Recall Curve
axes[1].plot(recall, precision, label=f'AP = {avg_prec:.2f}', color='#EA9010')
axes[1].set_xlabel('Recall', fontsize=10)
axes[1].set_ylabel('Precision', fontsize=10)
axes[1].set_title('Precision-Recall Curve', fontsize=12, fontweight='bold')
axes[1].legend()
axes[1].grid(True)

for ax in axes:
    for spine in ax.spines.values():
        spine.set_edgecolor('#D3D3D3')
        spine.set_linewidth(1.5)

plt.suptitle('Performance Analysis of Tuned Ensemble Model', fontsize=16, fontweight='bold')
plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.show()

🔸 ROC Curve

ROC AUC = 0.99

This indicates excellent discriminatory power. AUC close to 1 means the model is highly capable of distinguishing between the positive and negative classes.

The curve hugs the top-left corner, which is ideal, showing a very low false positive rate and high true positive rate.

Such a steep and early rise in the ROC curve suggests the model consistently ranks positive instances higher than negatives.

🔸 Precision-Recall Curve

Average Precision (AP) = 0.92

This is a very strong result, especially in scenarios where data is imbalanced(like this). Precision remains high even as recall increases, which suggests the model is effective at identifying positives without introducing many false positives.

The curve only drops toward the end, indicating the model maintains high precision across most recall levels.

This implies a low false positive rate, even when trying to capture a large portion of the true positives.

🔸 Conclusion

The Tuned Ensemble Model shows outstanding performance. Both ROC AUC and Average Precision are well above typical thresholds, indicating it is highly effective at both ranking and identifying positive cases.

In [None]:
y_pred_log_reg_knn = voting_clf.predict(x_test_log_reg_knn)

# Compute and display evaluation metrics
print("🔶 Accuracy:", accuracy_score(y_test_log_reg_knn, y_pred_log_reg_knn))
print("🔶 Precision:", precision_score(y_test_log_reg_knn, y_pred_log_reg_knn, average='weighted'))
print("🔶 Recall:", recall_score(y_test_log_reg_knn, y_pred_log_reg_knn, average='weighted'))
print("🔶 F1 Score:", f1_score(y_test_log_reg_knn, y_pred_log_reg_knn, average='weighted'))
print("\n🔶 Classification Report:\n", classification_report(y_test_log_reg_knn, y_pred_log_reg_knn))

# Confusion Matrix
cm_log_reg_knn = confusion_matrix(y_test_log_reg_knn, y_pred_log_reg_knn)
# Plot the confusion matrix as a heatmap
fig, ax = plt.subplots(figsize=(4, 4), dpi=150)

sns.heatmap(cm_log_reg_knn, annot=True, fmt='d', cmap='Greens', cbar=False, square=True,
            annot_kws={"size": 10}, ax=ax)

ax.set_xlabel("Predicted Label", fontsize=10)
ax.set_ylabel("True Label", fontsize=10)
ax.set_title("Confusion Matrix", fontsize=12, fontweight='bold')
ax.tick_params(axis='both', labelsize=9)

plt.tight_layout()
plt.show()

🔶 Overall Metrics:

🔶 Accuracy: 0.9761

→ About 97.6% of total predictions are correct — a very high overall accuracy.

🔶 Precision: 0.9755

→ The model is highly precise; when it predicts a sample as positive, it's usually correct.

🔶 Recall: 0.9761

→ The model retrieves almost all actual positive samples, indicating very strong sensitivity.

🔶 F1 Score: 0.9748

→ The harmonic mean of precision and recall is excellent, confirming a good balance between the two.

🔶 Class 0 (Majority Class)

Precision: 0.98

Recall: 1.00

F1-score: 0.99

→ The model is nearly perfect in identifying the negative class, with just 7 false positives out of 1,729.

🔶 Class 1 (Minority Class)

Precision: 0.94

Recall: 0.75

F1-score: 0.84

→ Strong precision means most predicted positives are correct, but recall of 0.75 shows it misses about 25% of true positives (38 out of 155). This could still be improved depending on application sensitivity.

🔶 Conclusion:

The Tuned Ensemble Model demonstrates strong, reliable performance across all key metrics.

Class 0 detection is nearly flawless

Class 1 detection is also strong, but with slightly lower recall — suggesting room to enhance detection of minority class positives.

This model is well-balanced, handles class imbalance reasonably well, and is suitable.

In [None]:
# List to store the final evaluation results of the models
results = []

# Define a function to evaluate a model by calculating and storing various metrics
def evaluate_model(name, model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    # If the model supports probability estimates, get the probability for class 1
    y_proba = model.predict_proba(X_test)[:, 1] if hasattr(model, "predict_proba") else None

    # Store the evaluation metrics in the results list
    results.append({
        'Model': name,
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred, average='weighted'),
        'Recall': recall_score(y_test, y_pred, average='weighted'),
        'F1 Score': f1_score(y_test, y_pred, average='weighted'),
        'ROC AUC': roc_auc_score(y_test, y_proba) if y_proba is not None else 'N/A'
    })

# Assumption: Only KNN model has been trained so far, and we want to evaluate it
evaluate_model("Logistic Regression", best_log_reg_model, x_train, x_test, y_train, y_test)
evaluate_model("KNN", best_knn_model, x_train_knn, x_test_knn, y_train_knn, y_test_knn)
evaluate_model("Complement Naive Bayes", best_cnb_model, x_train_cnb, x_test_cnb, y_train_cnb, y_test_cnb)
evaluate_model("Ensemble Model(Logistic Regression + KNN)", voting_clf, x_train_log_reg_knn, x_test_log_reg_knn, y_train_log_reg_knn, y_test_log_reg_knn)

# Convert the results list to a DataFrame for tabular display
results_df = pd.DataFrame(results)

print("🔶 Final Evaluation Metrics Table:")
display(results_df.round(4))

## <b> Predict for User input

In [None]:
def predict_personal_loan(model_name='Logistic Regression'):
    """
    Gets input from user and predicts Personal Loan status using selected model.
    """
    # Get user input as a DataFrame (assumes get_user_input is defined elsewhere)
    input_df = get_user_input()
    
    # Preprocessing and model selection based on user's chosen model
    if model_name == 'Logistic Regression':
        x_input = scaler_log_reg.transform(input_df) # Apply the logistic regression scaler
        model = best_log_reg_model                   # Use the best logistic regression model
    elif model_name == 'KNN':
        x_input = scaler_knn.transform(input_df)     # Apply the KNN scaler
        model = best_knn_model                            # Use the best KNN model
    elif model_name == 'Complement Naive Bayes':
        x_input = scaler_cnb.transform(input_df)      # Apply the Naive Bayes scaler
        model = best_cnb_model                       # Use the best ComplementNB model
    elif model_name == 'Ensemble Model(Logistic Regression + KNN)':
        x_input = scaler_log_reg.transform(input_df)
        model = voting_clf
    else:
        raise ValueError("Invalid model name. Choose from: 'Logistic Regression', 'KNN', 'Complement Naive Bayes'")

    # Make prediction using the selected model
    prediction = model.predict(x_input)[0]
    # If the model supports probability output, get the probability of class 1 (taking loan)
    proba = model.predict_proba(x_input)[0][1] if hasattr(model, "predict_proba") else None

    print(f"\nModel Used: {model_name}")
    print(f"Prediction: {'Will take loan (1)' if prediction == 1 else 'Will NOT take loan (0)'}")
    if proba is not None:
        print(f"Probability of taking loan: {proba:.2f}\n\n")

    return prediction

In [None]:
def get_user_input():
    """
    Gets input from user for all required features and returns as DataFrame.
    """
    # List of all features expected by the model
    feature_names = [
        'ID', 'Age', 'Experience', 'Income', 'ZIP Code', 'Family', 'CCAvg_Annual', 'Education',
        'Mortgage', 'Securities Account', 'CD Account', 'Online', 'CreditCard'
    ]

    user_data = {} # Dictionary to store user inputs
    for feature in feature_names:
        while True:
            try:
                if feature == 'CCAvg_Annual':
                    value = input(f"🔸 Enter value for '{feature}' (fraction like 1/2 or integer only): ").strip()
                    if '/' in value: # Convert fraction like '1/2' to float
                        num, denom = value.split('/')
                        value = float(num) / float(denom)
                    elif value.isdigit():
                        value = float(value)
                    else:
                        raise ValueError("Only fractions or integers are allowed.")
                else:
                    value = float(input(f"🔸 Enter value for '{feature}': "))
                
                user_data[feature] = value
                break
            except ValueError as e:
                print(f"Invalid input: {e}") # Handle non-numeric inputs

    # Convert to DataFrame
    input_df = pd.DataFrame([user_data])

    # Add lat/lng based on Zip Code
    zip_code = int(user_data['ZIP Code'])
    if zip_code in zip_df['zip'].values:
        lat = zip_df.loc[zip_df['zip'] == zip_code, 'lat'].values[0]
        lng = zip_df.loc[zip_df['zip'] == zip_code, 'lng'].values[0]
    else:
        raise ValueError("Zip Code not found in reference data!")

    # Add lat/lng
    input_df['lat'] = lat
    input_df['lng'] = lng

    # Add any engineered feature like CCAvg_Annual if your model uses it
    input_df['CCAvg_Annual'] = input_df['CCAvg_Annual'] * 12

    # Drop zip code if model doesn't use it
    input_df.drop(columns=['ZIP Code', 'ID'], inplace=True)

    return input_df

predict_personal_loan(model_name='Logistic Regression')
predict_personal_loan(model_name='KNN')
predict_personal_loan(model_name='Complement Naive Bayes')
predict_personal_loan(model_name='Ensemble Model(Logistic Regression + KNN)')