---
# **Bank Churn Prediction using Machine Learning**:
An efficient approach

<div >
  <img src="https://res.cloudinary.com/dhditogyd/image/upload/v1738308892/Bank_Churn_ML_notebook_cover_xfnax4.png" width="100%" height=" 100%;"/>
</div>

---





Author: [Muhammad Faizan](https://www.linkedin.com/in/mrfaizanyousaf/)

<div >
  <img src="https://res.cloudinary.com/dhditogyd/image/upload/v1735402856/Passport_photo_jsvsip.png" width="20%" height=" 20%;"/>
</div>

**Muhammad Faizan**

🎓 **3rd Year BS Computer Science** student at the **University of Agriculture, Faisalabad**  
💻 Enthusiast in **Machine Learning, Data Engineering, and Data Analytics**


🌐 **Connect with Me**

[Kaggle](https://www.kaggle.com/faizanyousafonly/) | [LinkedIn](https://www.linkedin.com/in/mrfaizanyousaf/) | [GitHub](https://github.com/faizan-yousaf/)  




💬 **Contact Me**
- **Email:** faizanyousaf815@gmail.com
- **WhatsApp:** [+92 306 537 5389](https://wa.me/923065375389)


🔗 **Let’s Collaborate:**  
I'm always open to queries, collaborations, and discussions. Let's build something amazing together!




## Meta-Data (About Dataset)

## Context: 
This dataset is designed to predict **customer churn** in the banking industry. It contains essential information about customers who either left the bank or continued to stay. Below is a breakdown of the dataset's attributes:

### Content:
🔍 Access the dataset:
[Dataset_link](https://www.kaggle.com/competitions/playground-series-s4e1/data)

#### Column Descriptions:
| Attribute | Description |
| --- | --- |
| **👤 Customer ID** | A unique identifier for each customer. |
| **🔠 Surname** | The customer's last name. |
| **📈 Credit Score** | A numerical value representing the customer's credit score. |
| **🌍 Geography** | The country where the customer resides (France, Spain, or Germany). |
| **🚻 Gender** | The customer's gender (Male or Female). |
| **🎂 Age** | The customer's age. |
| **📆 Tenure** | The number of years the customer has been with the bank. |
| **💰 Balance** | The customer's account balance. |
| **📦 NumOfProducts** | The number of bank products the customer uses (e.g., savings account, credit card). |
| **💳 HasCrCard** | Whether the customer has a credit card (1 = yes, 0 = no). |
| **✅ IsActiveMember** | Whether the customer is an active member (1 = yes, 0 = no). |
| **💵 EstimatedSalary** | The estimated salary of the customer. |
| **🚪 Exited** | Whether the customer has churned (1 = yes, 0 = no). |




### Acknowledgements
### Creators:

* Authors = Walter Reade and Ashley Chow,



### Citation Request:

Walter Reade and Ashley Chow. Binary Classification with a Bank Churn Dataset . [Cited](https://kaggle.com/competitions/playground-series-s4e1), 2024. Kaggle.




## Aims and Objectives:

We will fill this after doing the EDA and Data Preprocessing.


# 🗺️ **Project Roadmap: What’s Happening in This Notebook?**  

Welcome to my Machine Learning project notebook! Here’s a quick overview of what we’ll be covering in this notebook. I’ve already completed some steps, and now we’ll focus on the remaining tasks to prepare for Kaggle submission and GitHub sharing. Let’s dive in!  


## ✅ **1. Setting Up the Environment**  
- **What’s done:** All necessary libraries (`pandas`, `numpy`, `scikit-learn`, etc.) are imported, and the dataset is loaded.  
- **What’s next:** We’ll ensure everything is ready for the next steps.  


## 📉 **2. Exploratory Data Analysis (EDA)**  
- **What’s done:** I’ve already performed a detailed EDA in a separate notebook. You can check it out here: [My EDA Notebook]( https://www.kaggle.com/code/faizanyousafonly/customer-churn-secrets-an-eda-journey).  
- **What’s next:** We’ll briefly recap key insights from the EDA to set the stage for preprocessing.  


## 🛠️ **3. Data Preprocessing**  
- **What’s done:** Identified missing values, outliers, and feature relationships during EDA.  
- **What’s next:** We’ll handle missing values, encode categorical variables, and split the data into training and testing sets.  


## 🤖 **4. Model Building**  
- **What’s done:** Explored different algorithms during EDA to identify the best candidate.  
- **What’s next:** We’ll train the chosen model, tune hyperparameters, and evaluate its performance.  


## 📊 **5. Model Evaluation**  
- **What’s done:** Defined evaluation metrics based on the problem type (classification/regression).  
- **What’s next:** We’ll generate predictions, create visualizations, and analyze the model’s performance.  


## 🚀 **6. Preparing for Kaggle Submission**  
- **What’s done:** Understood Kaggle’s submission format and requirements.  
- **What’s next:** We’ll save the model, format predictions, and create a submission CSV file for Kaggle.  


## 📂 **7. Uploading to GitHub**  
- **What’s done:** Organized project files and prepared documentation.  
- **What’s next:** We’ll upload the notebook, dataset, and submission files to GitHub and write a clear `README.md`.  


## 🌟 **8. Sharing the Project**  
- **What’s done:** Prepared links for sharing on Kaggle and GitHub.  
- **What’s next:** We’ll share the project with the community and celebrate our hard work! 🎉  

---

This notebook is the final step in my Machine Learning project journey. Let’s get started! 🚀  

---

## ✅ **1. Setting Up the Environment** 

Let's start the project by importing all the libraries that we will use in this project.

In [2]:
# import libraries:

# 1. to handel the data:
import numpy as np
import pandas as pd

# 2. to visualize the data:
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

# 3. to preprocess the data:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# 4. to build the model:
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV

# 5. for classification task:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier

# 6. Metrics:

from sklearn.metrics import accuracy_score, precision_score, recall_score, r2_score, f1_score , classification_report, root_mean_squared_error, mean_absolute_error, mean_absolute_percentage_error

# 7. to ignore the warnings:
import warnings
warnings.filterwarnings("ignore")

print("Libraries have been loaded successfully")

# # 8. Display all rows and columns: (uncomment if you want the whole output in the cells, I don't prefer it while uploading my notebook on Kaggle or Github)
# pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None)

Libraries have been loaded successfully


In [3]:
# Setting a style for the plots
sns.set_theme(style="whitegrid")

# 1. 📊 Loading and Peeking at the Data

In [11]:
# 🚀 Step 1: Loading the Data
print("Loading the dataset... 🕵️‍♂️")
df_train = pd.read_csv("../cleaned/train_preprocessed.csv")
df_test = pd.read_csv("../cleaned/test_preprocessed.csv")
submission = pd.read_csv("../dataset/sample_submission.csv")
print("Dataset loaded successfully!")

Loading the dataset... 🕵️‍♂️
Dataset loaded successfully!


> Let's have a look on each dataset: 👀


In [12]:
print("The preprocessed training dataset:")
display(df_train.head())

print("The preprocessed testing dataset:")
display(df_test.head())

print("The submission dataset:")
display(submission.head())


The preprocessed training dataset:


Unnamed: 0,id,CustomerId,Surname,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,AgeCategory,CreditScoreCategory,BalanceCategory,SalaryCategory,Geography_Germany,Geography_Spain,Gender_Male
0,0,15674932,Okwudilichukwu,668,33.0,3,0.0,2,1.0,0.0,181449.97,0,2,1.0,3.0,5.0,0.0,0.0,1.0
1,1,15749177,Okwudiliolisa,627,33.0,1,0.0,2,1.0,1.0,49503.5,0,2,1.0,3.0,1.0,0.0,0.0,1.0
2,2,15694510,Hsueh,678,40.0,10,0.0,2,1.0,0.0,184866.69,0,3,2.0,3.0,5.0,0.0,0.0,1.0
3,3,15741417,Kao,581,34.0,2,148882.54,1,1.0,1.0,84560.88,0,2,1.0,1.0,2.0,0.0,0.0,1.0
4,4,15766172,Chiemenam,716,33.0,5,0.0,2,1.0,1.0,15068.83,0,2,2.0,3.0,6.0,0.0,1.0,1.0


The preprocessed testing dataset:


Unnamed: 0,id,CustomerId,Surname,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,AgeCategory,CreditScoreCategory,BalanceCategory,SalaryCategory,Geography_Germany,Geography_Spain,Gender_Male
0,165034,15773898,Lucchese,586,23.0,2,0.0,2,0.0,1.0,160976.75,2,1.0,3.0,0.0,0.0,0.0,0.0
1,165035,15782418,Nott,683,46.0,2,0.0,1,1.0,0.0,72549.27,3,2.0,3.0,2.0,0.0,0.0,0.0
2,165036,15807120,K?,656,34.0,7,0.0,2,1.0,0.0,138882.09,2,1.0,3.0,4.0,0.0,0.0,0.0
3,165037,15808905,O'Donnell,681,36.0,8,0.0,1,1.0,0.0,113931.57,3,2.0,3.0,3.0,0.0,0.0,1.0
4,165038,15607314,Higgins,752,38.0,10,121263.62,1,1.0,0.0,139431.0,3,3.0,1.0,4.0,1.0,0.0,1.0


The submission dataset:


Unnamed: 0,id,Exited
0,165034,0.5
1,165035,0.5
2,165036,0.5
3,165037,0.5
4,165038,0.5


In [13]:
# check the shape of each:

print(f"The shape of the training dataset: {df_train.shape}")

print(f"The shape of the testing dataset: {df_test.shape}")

print(f"The shape of the submission dataset: {submission.shape}")

The shape of the training dataset: (165034, 19)
The shape of the testing dataset: (110023, 18)
The shape of the submission dataset: (110023, 2)


In [15]:
# check the columns of each and print them in a presentable form:

print("The columns of the training dataset:")

for i, col in enumerate(df_train.columns):
    print(f"{i+1}. {col}")

print("\n\nThe columns of the testing dataset:")
for i, col in enumerate(df_test.columns):
    print(f"{i+1}. {col}")

print("\n\nThe columns of the submission dataset:")
for i, col in enumerate(submission.columns):
    print(f"{i+1}. {col}")


The columns of the training dataset:
1. id
2. CustomerId
3. Surname
4. CreditScore
5. Age
6. Tenure
7. Balance
8. NumOfProducts
9. HasCrCard
10. IsActiveMember
11. EstimatedSalary
12. Exited
13. AgeCategory
14. CreditScoreCategory
15. BalanceCategory
16. SalaryCategory
17. Geography_Germany
18. Geography_Spain
19. Gender_Male


The columns of the testing dataset:
1. id
2. CustomerId
3. Surname
4. CreditScore
5. Age
6. Tenure
7. Balance
8. NumOfProducts
9. HasCrCard
10. IsActiveMember
11. EstimatedSalary
12. AgeCategory
13. CreditScoreCategory
14. BalanceCategory
15. SalaryCategory
16. Geography_Germany
17. Geography_Spain
18. Gender_Male


The columns of the submission dataset:
1. id
2. Exited


## 📉 **2. Exploratory Data Analysis (EDA)**

I’ve already performed a detailed EDA in a separate notebook. You can check it out here: [My EDA Notebook]( https://www.kaggle.com/code/faizanyousafonly/customer-churn-secrets-an-eda-journey).

We’ll briefly recap key insights from the EDA to set the stage for preprocessing.

## **Observations RECAP**: 🔎

>Data Overview 🗂️

1. **Data Types** 🧬:
    * The dataset includes a mix of numerical (`int64`, `float64`) and categorical (`object`) data types. 
2.  Key columns:
    * `CustomerId` (int64)
    * `Surname` (object)
    * `CreditScore` (int64)
    * `Geography`, `Gender` (object)
    * `Age`, `Tenure`, `Balance`, `NumOfProducts`, `EstimatedSalary` (float64)
    * `HasCrCard`, `IsActiveMember`, `Exited` (int64)

3. **Missing Values** 🚨:
    * No missing values detected in the dataset. This ensures the dataset is complete and ready for analysis without any need for imputation or handling missing data.

4.  **Duplicate Values** 🧩:
    * The dataset does not contain any duplicate records. Each entry represents a unique customer, ensuring the integrity of the analysis.


5. There are `165034` observations in the dataset.

6. **Age Column:** 🎂
    * The average age of the individuals is 38 years which shows that most of the individuals are young adults.
    * The mininum age is `18 years`. (a teenager)
    * The maximum age is `92 years`. (an old man)

7. **CustomerID:** 🆔
    * There were more than `fifteen million and five hundred` records from where the dataset is taken (quite astonishing right? 👀)  
    * The minimum ID count in our dataset is `15565701`
    * The maximum ID count in our dataset is `15815690`

8. **CreditScore:** 💹
    * The minimum credit score among the customer is : `350`
    * The maximum credit score among the customer is: `850`

9. **Tenure:** 📅
    * The minimum time a customer was using our bank : `0 years`
    * The maximum time a customer was using our bank : `10 years`

10. **Balance:** 💰
    * The minimum balance of a customer : `0.0` (That's probly ME... 😂)
    * The maximum balance of a customer : `250898.09`
    * The median is also : 0.0 (which is kinda curious... why many of the people have `Zero balance`🤔)
      * Either they don't use their bank account anymore
      * Or They might don't trust this bank (that's a possibility)

11. **NumOfProducts:** 📦
    * The minimum number of bank products any customer has: `1`
    * The maximum number of bank products any customer has: `4`

12. **HasCrCard:** 💳
    * There are `124428` customers who have a credit card. (most of the customers have a credit card)
    * **`75.4%`** percent customers have a credit card.
    * There are `40606` customers who don't have a credit card. 
    * **`24.6%`** percent customers don't have a credit card.  

13. **IsActiveMember:** ✌🏻
    *  Inactive users:  `82885`  with a percentage of **`50.2`**
    *  Active users: `82149` with a percentage of **`49.8`**
       * >NOTE: It shows that half of the whole customers are inactive means they don't use the bank at all... That's the reason why we had so many accounts with `Zero Balance` because these people don't use their accounts...

14. **EstimatedSalary:** 💵
    * The minimum salary is : '11.58' 
    * The maximum salary is: '199992.48'

15. **Exited** 🚪
    * There are `34921` customers who churn the bank (`21.2 %`)
    * There are `130113` customers who stayed in the bank (`78.8%`)
  
>Groupby Analysis 🔍

1.  **Geography vs. Churn** 🌍🚪  
  - Customers from **Germany** have the highest churn rate.  
  - Customers from **France** show the lowest churn rate.

2.  **Gender vs. Churn** 👥🚪  
  - **Female** customers tend to churn more often than **male** customers.

3.  **Age vs. Churn** 🎂🚪  
  - Customers aged **50 and above** are significantly more likely to leave.

4.  **Tenure vs. Churn** 📅🚪  
  - Customers with a tenure of **1-2 years** are more prone to churn.  
  - Those with **9-10 years** tenure are less likely to exit.

5.  **Balance vs. Churn** 💰🚪  
  - Customers with a **zero balance** have a higher likelihood of exiting.

6.  **Number of Products vs. Churn** 🛒🚪  
  - Customers holding **only one product** are more likely to churn.  
  - Those with **multiple products** show better retention.

7.  **IsActiveMember vs. Churn** 🟢🚪  
  - **Inactive members** have a substantially higher churn rate compared to active ones.  

> Feature Engineering 🔧

- **Categorical Encoding** 🗂️:
  - Converted `Geography` and `Gender` into numerical features using one-hot encoding. This process ensures that categorical variables are appropriately represented in the model, with no redundant categories due to the `drop_first=True` option.

- **AgeGroup Feature** 🎂:
  - Created the `AgeGroup` feature by binning `Age` into categories [0, 12, 19, 35, 50, 100]. This categorization can help in analyzing trends and patterns across different age groups.

- **BalanceCategory Feature** 💵:
  - Created the `BalanceCategory` feature by binning `Balance` into categories ['No Balance', '0-100K', '100K-200K', '200K-300K', '300K-400K', '400K-500K', '500K-600K', '600K-700K', '700K-800K', '800K-900K', '900K-1M'].
  - This categorization highlights how different balance levels affect customer churn.

- **Feature Scaling** 📈:
  - I've `not scaled` the features as I've already make new features out of the original ones and then I encoded them, so they don't need to be scaled!


### Final Observations 🔍  

- ✅ **Data Integrity:** No missing or duplicate values, ensuring a solid foundation for analysis.  
- 📊 **Feature Variability:** High standard deviations in `Balance` and `EstimatedSalary` suggest further investigation is needed.  
- 🌍 **Churn Influences:** Factors like geography, gender, age, and engagement level (active vs inactive) impact customer churn.  
- 🚀 **Feature Engineering Boost:** Newly introduced and scaled features enhance model performance and predictive accuracy.  
- 🔄 **Standardization Benefits:** Transformed features ensure better alignment for modeling, improving insights and decision-making.  


## 🛠️ **3. Data Preprocessing**  
- **What’s done:** Identified missing values, outliers, and feature relationships during EDA.  
- **What’s next:** We’ll handle missing values, encode categorical variables, and split the data into training and testing sets.  


> Already done the preprocessing in the previous notebook, but let's just check out missing values once again to have confirmation!

In [16]:
# find the missing values in each dataset:

print("The missing values in the training dataset:")
display(df_train.isnull().sum())

print("---------------------------------------------------------------")

print("The missing values in the testing dataset:")
display(df_test.isnull().sum())



The missing values in the training dataset:


id                     0
CustomerId             0
Surname                0
CreditScore            0
Age                    0
Tenure                 0
Balance                0
NumOfProducts          0
HasCrCard              0
IsActiveMember         0
EstimatedSalary        0
Exited                 0
AgeCategory            0
CreditScoreCategory    0
BalanceCategory        0
SalaryCategory         0
Geography_Germany      0
Geography_Spain        0
Gender_Male            0
dtype: int64

---------------------------------------------------------------
The missing values in the testing dataset:


id                     0
CustomerId             0
Surname                0
CreditScore            0
Age                    0
Tenure                 0
Balance                0
NumOfProducts          0
HasCrCard              0
IsActiveMember         0
EstimatedSalary        0
AgeCategory            0
CreditScoreCategory    0
BalanceCategory        0
SalaryCategory         0
Geography_Germany      0
Geography_Spain        0
Gender_Male            0
dtype: int64

> No missing values in both datasets... `Confirmed`!

## 🤖 **4. Model Building**  
- **What’s done:** Explored different algorithms during EDA to identify the best candidate.  
- **What’s next:** We’ll train the chosen model, tune hyperparameters, and evaluate its performance.  




In [17]:
df_train.columns    

Index(['id', 'CustomerId', 'Surname', 'CreditScore', 'Age', 'Tenure',
       'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember',
       'EstimatedSalary', 'Exited', 'AgeCategory', 'CreditScoreCategory',
       'BalanceCategory', 'SalaryCategory', 'Geography_Germany',
       'Geography_Spain', 'Gender_Male'],
      dtype='object')

In [None]:
# remove the following columns as we have already transformed and encoded them. 
# Also, we will remove the target column from the training dataset.
# the columns to be removed are: 'id, CustomerId', Surname, 'CreditScore', 'Geography', 'Age', 'Balance', 'EstimatedSalary', 'Exited'


# drop the columns from the training dataset:

df_train.drop(['id', 'CustomerId', 'Surname', 'CreditScore', 'Geography', 'Age', 'Balance', 'EstimatedSalary', 'Exited'], axis=1, inplace=True)


In [None]:
# split the data into training and testing sets

X = df_train.drop('id','CustomerId', 'Surname', 'CreditScore', 'Age', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary', 'Exited', axis=1)
y = df_train['Exited']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"The shape of the training set: {X_train.shape} and the shape of the testing set: {X_test.shape}")
