# Customer Churn Prediction in the Banking Industry

## 1. Introduction<a id='introduction'></a>
Customer churn, the act of customers leaving a business, is a critical challenge faced by many industries, including the banking sector. Understanding and predicting customer churn is crucial for banks to identify potential churners and take proactive measures to retain valuable customers. In this project, we aim to develop a predictive model to identify customers who are likely to churn using the Bank Customer Churn dataset.

The dataset provides valuable information about bank customers, including their demographics, account details, and behavior patterns. By leveraging this dataset, along with the power of data science and machine learning, we will analyze and preprocess the data, perform exploratory data analysis, build predictive models, and evaluate their performance.

The objective of this project is to create a robust customer churn prediction model that can assist banks in identifying customers at risk of churn. By accurately identifying potential churners, banks can implement targeted retention strategies, improve customer satisfaction, and minimize revenue loss.

Through this project, we aim to have a comprehensive understanding of the customer churn prediction problem in the banking industry and a well-performing predictive model that can help banks make informed decisions to mitigate customer churn.

## 2. Dataset overview<a id='dataset-overview'></a>

The Bank Customer Churn dataset provides information about bank customers and their churn behavior. It contains the following columns:

1. `RowNumber`: The row number in the dataset.
2. `CustomerId`: Unique identifier for each customer.
3. `Surname`: Customer's surname.
4. `CreditScore`: Credit score of the customer.
5. `Geography`: Customer's country/region.
6. `Gender`: Gender of the customer (Male or Female).
7. `Age`: Age of the customer.
8. `Tenure`: Number of years the customer has been with the bank.
9. `Balance`: Account balance of the customer.
10. `NumOfProducts`: Number of bank products the customer has.
11. `HasCrCard`: Whether the customer has a credit card or not (1: Yes, 0: No).
12. `IsActiveMember`: Whether the customer is an active member or not (1: Yes, 0: No). (active with different functionalities with bank like programs ,bonds,insurance etc.)
13. `EstimatedSalary`: Estimated salary of the customer. (salary estimated by bank)
14. `Exited`: Whether the customer has exited (churned) or not (1: Yes, 0: No).

The dataset contains various features that capture different aspects of a customer's relationship with the bank, such as their credit score, demographics, tenure, account balance, and product holdings. The target variable, `Exited`, indicates whether a customer has churned or not.

In this project, we will leverage this dataset to build a predictive model that can accurately predict customer churn based on the available features. We will perform data preprocessing, exploratory data analysis, and model building to accomplish this objective.

## 3. Project Roadmap<a id='project-roadmap'></a>

To ensure a systematic approach to this project, we will follow the following roadmap:

1. **Project Setup and Dataset Exploration**: In this initial stage, we will set up our project environment, import the necessary libraries, load the dataset, and explore its structure and contents.

2. **Data Preprocessing**: This stage involves preparing the dataset for analysis by handling missing values if there are any, removing irrelevant features, converting categorical variables into numerical representations, and performing feature scaling.

3. **Exploratory Data Analysis (EDA)**: Here, we will conduct a thorough analysis of the dataset to gain insights into the relationships between the features and the target variable. We will visualize the data, analyze distributions, and explore correlations among the variables.

4. **Feature Engineering**: In this stage, we will enhance the dataset by creating new features or transforming existing ones based on domain knowledge and insights gained from the EDA.

5. **Model Building**: Using the preprocessed and engineered dataset, we will build machine learning models to predict customer churn. We will try different algorithms and techniques, train the models, and evaluate their performance using appropriate evaluation metrics.

6. **Model Evaluation**: Here, we will assess the performance of the trained models and compare their results. We will select the best-performing model to be used for customer churn prediction.

7. **Conclusion and Future Work**: In the final stage, we will summarize our findings, reflect on the project's outcome, and suggest potential improvements or further steps for future work.

## 4. Contents<a id='contents'></a>

1. [Introduction](#introduction)
2. [Datase overview](#dataset-overview)
3. [Project Roadmap](#project-roadmap)


## 5. Project setup and Data exploartion<a id='project-setup-and-data-exploartion'></a>

### A. Libraries

In [1]:
# Basic 
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os

In [2]:
# Load data to Pandas Dataframe
filename = []
for folder in os.listdir('/kaggle/input'):
    for file in os.listdir(f'/kaggle/input/{folder}'):
        filename.append(f'/kaggle/input/{folder}/{file}')

if len(filename)==1:
    if len(filename[0].split(".csv")) > 1:
        data = pd.read_csv(filename[0])
        print("File loaded from '{}'".format(filename[0].split("/")[-1]))
    else:
        print("No CSV file found")
else:
    print("Multiple files found")

File loaded from 'Churn Modeling.csv'


In [3]:
data.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


### B. Summary Stats

In [4]:
print("Number of observations: {}".format(data.shape[0]))
print("Number of features: {}".format(data.shape[1]))

print("Memory Usage: {}+ MB".format(round((data.memory_usage()/1024/1024).sum(), 1)))

Number of observations: 10000
Number of features: 14
Memory Usage: 1.1+ MB


In [5]:
column_types=pd.DataFrame(data.dtypes, columns=['data_type']).T
column_types

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
data_type,int64,int64,object,int64,object,object,int64,int64,float64,int64,int64,int64,float64,int64


In [6]:
# Summary of Numerical data
data.select_dtypes(include = ['int', 'float']).describe()

Unnamed: 0,RowNumber,CustomerId,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5000.5,15690940.0,650.5288,38.9218,5.0128,76485.889288,1.5302,0.7055,0.5151,100090.239881,0.2037
std,2886.89568,71936.19,96.653299,10.487806,2.892174,62397.405202,0.581654,0.45584,0.499797,57510.492818,0.402769
min,1.0,15565700.0,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,2500.75,15628530.0,584.0,32.0,3.0,0.0,1.0,0.0,0.0,51002.11,0.0
50%,5000.5,15690740.0,652.0,37.0,5.0,97198.54,1.0,1.0,1.0,100193.915,0.0
75%,7500.25,15753230.0,718.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0
max,10000.0,15815690.0,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0


In [7]:
# Summary of Object data
data.select_dtypes(include = 'object').describe()

Unnamed: 0,Surname,Geography,Gender
count,10000,10000,10000
unique,2932,3,2
top,Smith,France,Male
freq,32,5014,5457


In [8]:
# Number of unique values in Data
data.nunique()

RowNumber          10000
CustomerId         10000
Surname             2932
CreditScore          460
Geography              3
Gender                 2
Age                   70
Tenure                11
Balance             6382
NumOfProducts          4
HasCrCard              2
IsActiveMember         2
EstimatedSalary     9999
Exited                 2
dtype: int64

In [9]:
# Percentage of null values in each feature
print(data.isnull().mean()*100)
if (data.isnull().sum().sum())==0:
    print(" \n There are no missing data")

RowNumber          0.0
CustomerId         0.0
Surname            0.0
CreditScore        0.0
Geography          0.0
Gender             0.0
Age                0.0
Tenure             0.0
Balance            0.0
NumOfProducts      0.0
HasCrCard          0.0
IsActiveMember     0.0
EstimatedSalary    0.0
Exited             0.0
dtype: float64
 
 There are no missing data


## 6. Data Preprocessing<a id='data-preprocessing'></a>

We can see from the summary of the dataset that the data is already in a suitable format for analysis. Further processing of teh data can be done post EDA.

The steps we need to consider are:

**1. Handling Missing Values**
- We have already seen that the dataset does not have missing data

**2. Removing Irrelevant Features**
- Certain features such as `RowNumber` and `CustomerId` are just for identification and do not provide valuable information for churn prediction.
- `Surname` is another feature that might not have importance in prediction, but requires further exploration as out of the 10,000 customer records, there seems to be only ~3000 unique surnames. We can check if a customer having family members in the same bank contriubutes to the prediction.



In [10]:
data.drop(['RowNumber', 'CustomerId'], axis=1, inplace=True)

**3. Converting Categorical Variables**
- Categorical variables in the dataset, such as `Geography` and `Gender`, will have to be converted to numerical representations using one-hot encoding, which creates binary columns for each category.

**4. Feature Scaling**
- Through EDA we can analyze the numerical features that require scaling, such as `CreditScore`, `Age`, `Tenure`, `Balance`, `NumOfProducts`, and `EstimatedSalary`.
- We cna then apply feature scaling techniques, such as standardization or normalization, to ensure that all features are on a similar scale. This step helps prevent certain features from dominating the model due to their larger magnitudes.

## 7. Exploratory Data Analysis<a id='exploratory-data-analysis'></a>

In [11]:
data[data.Surname=='Smith']

Unnamed: 0,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
234,Smith,585,France,Female,67,5,113978.97,2,0,1,93146.11,0
479,Smith,658,France,Male,29,4,80262.6,1,1,1,20612.82,0
745,Smith,606,France,Male,40,5,0.0,2,1,1,70899.27,0
1064,Smith,723,France,Female,20,4,0.0,2,1,1,140385.33,0
1756,Smith,618,France,Male,37,2,168178.21,2,0,1,101273.23,0
2133,Smith,688,France,Female,32,6,123157.95,1,1,0,172531.23,0
2317,Smith,630,France,Female,36,2,110414.48,1,1,1,48984.95,0
2541,Smith,611,France,Female,61,3,131583.59,4,0,1,66238.23,1
3835,Smith,718,Germany,Female,39,7,93148.74,2,1,1,190746.38,0
4476,Smith,778,France,Male,33,1,0.0,2,1,0,85439.73,0
