<center><font size=6> Bank Churn Prediction </font></center>

| Project Submission| Date |
| --- | --- |
| Rob Barker | September 13, 2024 |
| Filename | RobBarker_NN_BankChurn_FC.html/ipynb | 
| Course | Neural Networks |

### Development Environment
* Local development with Visual Studio Code.
* Jupyter Notebook and Python 3.11.7 with Anaconda3. 
* Google Colab/Drive not used.
* Generated HTML using the jupyter cli

   ```jupyter nbconvert --to html PYF_Project_LearnerNotebook_FullCode.ipynb```
* Added --- (markdown) lines for easier readability for myself. 

### Formatting Notes
* Moved helper functions into separate section.
* Added line separators for readability.
---
---

# Problem Statement

### Context

Businesses like banks which provide service have to worry about problem of 'Customer Churn' i.e. customers leaving and joining another service provider. It is important to understand which aspects of the service influence a customer's decision in this regard. Management can concentrate efforts on improvement of service, keeping in mind these priorities.

### Objective

You as a Data scientist with the  bank need to  build a neural network based classifier that can determine whether a customer will leave the bank  or not in the next 6 months.

### Data Dictionary

* CustomerId: Unique ID which is assigned to each customer

* Surname: Last name of the customer

* CreditScore: It defines the credit history of the customer.
  
* Geography: A customer’s location
   
* Gender: It defines the Gender of the customer
   
* Age: Age of the customer
    
* Tenure: Number of years for which the customer has been with the bank

* NumOfProducts: refers to the number of products that a customer has purchased through the bank.

* Balance: Account balance

* HasCrCard: It is a categorical variable which decides whether the customer has credit card or not.

* EstimatedSalary: Estimated salary

* isActiveMember: Is is a categorical variable which decides whether the customer is active member of the bank or not ( Active member in the sense, using bank products regularly, making transactions etc )

* Exited : whether or not the customer left the bank within six month. It can take two values
** 0=No ( Customer did not leave the bank )
** 1=Yes ( Customer left the bank )

---
---
# Setup Environment

## Importing necessary libraries

In [2]:
# Installing the libraries with the specified version.
%pip install tensorflow==2.15.0 scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==2.0.3 imbalanced-learn==0.10.1 -q --user
%pip install tabulate==0.9.3 -q --user

Note: you may need to restart the kernel to use updated packages.


In [24]:
# Libraries to help with reading and manipulating data.
import numpy as np
import pandas as pd

# TensorFlow is an open-source platform developed by Google for 
# building and training machine learning and deep learning models.
import tensorflow as tf

# Nicely display all the return columns of the dataframe. 
from tabulate import tabulate

# To suppress warnings.
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)

## Loading the dataset

In [7]:
# Load dataset.
bankchurn_df_org = pd.read_csv("/Users/barkz/Desktop/GL Projects/Bank-Churn-Prediction/bank-1.csv")
bankchurn_df = bankchurn_df_org.copy()

---
---
# Data Overview

This section will include:
* Data Analysis & Observations
    * Top 5 rows
    * Bottom 5 rows
    * Shape
    * Datatypes
    * Duplicates
    * Missing values
    * Duplicates
    * Statistical summary
    * Categorical column summary

In [11]:
# Check the first few rows of the original dataset.
bankchurn_df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [12]:
# Retrieve first few rows of the copied dataset.
bankchurn_df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [13]:
# Retrieve last few rows of the copied dataset.
bankchurn_df.tail()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
9995,9996,15606229,Obijiaku,771,France,Male,39,5,0.0,2,1,0,96270.64,0
9996,9997,15569892,Johnstone,516,France,Male,35,10,57369.61,1,1,1,101699.77,0
9997,9998,15584532,Liu,709,France,Female,36,7,0.0,1,0,1,42085.58,1
9998,9999,15682355,Sabbatini,772,Germany,Male,42,3,75075.31,2,1,0,92888.52,1
9999,10000,15628319,Walker,792,France,Female,28,4,130142.79,1,1,0,38190.78,0


In [14]:
# Retrieve number of rows and columns using the shape attribute of the DataFrame.
rows, columns = bankchurn_df.shape

# Print the number of rows and columns from the dataset. Output is formatted into thousands. 
# There are only 9 columns but for consistency, the output includes (,) format.
print(f'Number of Rows: {rows:,}')
print(f'Number of Columns: {columns:,}')

Number of Rows: 10,000
Number of Columns: 14


**Observations**
* There are 10,000 rows and 14 columns in the dataset.

In [15]:
# Get dataset information.
bankchurn_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


**Observations**
* There following datatypes are observed.
  * There are 2 float64
  * There are 9 int64
  * There are 3 object

In [37]:
# Check for missing values in the dataset, Null.
null_values = bankchurn_df.isnull().sum()

# Check for missing values in the dataset, NaN.
nan_values = bankchurn_df.isna().sum()

# Output if there are any missing data points in the dataset.
if null_values.sum() or nan_values.sum() > 0:
    print("There are missing data points in the dataset.")
    
    # List columns with missing values.
    null_columns = null_values[null_values > 0]
    nan_columns = nan_values[nan_values > 0]

    print("Columns with null values:")
    print(null_columns)

    print("Columns with NaN values:")
    print(nan_columns)
else:
    print("There are no Nan or null data points in the dataset.")


There are no Nan or null data points in the dataset.


**Observations**
* There are no NaN or null values in the dataset.

In [17]:
# Check for duplicate values.
bankchurn_df.duplicated().sum()

0

**Observation**
* There are no duplicate values.

In [31]:
# Set the display format for float64 to avoid exponential notation.
pd.options.display.float_format = '{:.0f}'.format

# Statistical summary of the dataset.
bankchurn_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
RowNumber,10000,5000,2887,1,2501,5000,7500,10000
CustomerId,10000,15690941,71936,15565701,15628528,15690738,15753234,15815690
CreditScore,10000,651,97,350,584,652,718,850
Age,10000,39,10,18,32,37,44,92
Tenure,10000,5,3,0,3,5,7,10
Balance,10000,76486,62397,0,0,97199,127644,250898
NumOfProducts,10000,2,1,1,1,1,2,4
HasCrCard,10000,1,0,0,0,1,1,1
IsActiveMember,10000,1,0,0,0,1,1,1
EstimatedSalary,10000,100090,57510,12,51002,100194,149388,199992


**Observation**

| Column Name | Datatype | Observation |
| --- | --- | --- |
| >> RowNumber | int64 | Unique client identifier. Can be dropped from data set. Will be done during data pre-processing. |
| >> CustomerId | int64 | Unique client identifier. Can be dropped from data set. Will be done during data pre-processing. |
| >> Surname | object | Unique client identifier. Can be dropped from data set. Will be done during data pre-processing. |
| CreditScore | int64 | Ranges between 350 - 850 with average of 651. |
| >> Geography | object | Categorical data needs to be encoded. |
| >> Gender | object | Categorical data needs to be encoded. | 
| Age | int64 | Customers range in age from 18 - 92 with an average of 39 years of age. |
| Tenure | int64 | Maximum tenure is 10 years with an average job length of 5 years. There are some customers with no job tenure. |
| Balance | float64 | Balance is a very large distribution of $0.00 - $250,898.00. Average balance is $76,486.00. |
| NumOfProducts | int64 | Maximum number of products held by customers is 4 with all customers having at least 1 product and on average have 2. |
| HasCrCard | int64 | Categorical value. |
| IsActiveMember | int64 | Categorical value. |
| EstimatedSalary | float64 | Average salary is $100,090.00 with a maximum of $199,992.00. |
| Exited | int64 | Target value per business objective; determine whether they will leave or not in the next 6 months. |

In [19]:
# Get summary of the categorical columns.
bankchurn_df.describe(include=["object"]).T

Unnamed: 0,count,unique,top,freq
Surname,10000,2932,Smith,32
Geography,10000,3,France,5014
Gender,10000,2,Male,5457


In [46]:
# Loop through the categorical columns and print unique values.
#for n in bankchurn_df.describe().columns:
    #print(f"Unique values in {n} are :")
    
# Create list of lists for the table.
#table = [[value, count] for value, count in bankchurn_df["IsActiveMember"].value_counts().items()]
# Create list of lists for the table.
table = [[value, count] for value, count in bankchurn_df["HasCrCard"].value_counts().items()]

# Print table using tabulate.
print("Has Credit Card")
print(tabulate(table, headers=["Value", "Count"], tablefmt="grid"))
print("\n")

Has Credit Card
+---------+---------+
|   Value |   Count |
|       1 |    7055 |
+---------+---------+
|       0 |    2945 |
+---------+---------+




In [None]:
# Loop through the categorical columns and print unique values.
for n in bankchurn_df.describe(include=["object"]).columns:
    print(f"Unique values in {n} are :")
    
    # Create list of lists for the table.
    table = [[value, count] for value, count in bankchurn_df[n].value_counts().items()]
    
    # Print table using tabulate.
    print(tabulate(table, headers=["Value", "Count"], tablefmt="grid"))
    print("\n")

**Observations:**
* 80% of customers have exited within 6 months.
* Male (55%) and females (45%) customers are almost a 50/50 split.  
* Half of the customers live in France (50%).
* Most customers are married.
* Most customers lie in the income group of less than $40k
* Most customers have a blue card.
* The data is imbalanced.

#### Observations:
- France is the Top value from the 3 Unique values of `Geography`.
- Males are more than the females in the Dataset, with frequency 5457. 
- `CreditScore` and `Tenure`, and `EstimatedSalary` have approximately same mean and median.
- Mean are greater than the median for `Age`, `NumOfProducts` and the 3 boolean variables, which suggests that those variables are right skewed.
- `Age` ranges from 18 up to 92. 
- 50% of the customers' age are less than or equal to 37.
- Number of years for which the customer has been with the bank `Tenure` have minimum value 0 and maximum 10 years.
- `NumOfProducts` are from 1 up to 4, that variable could be converted into category type.

In [None]:
# Drop the columns that are not required.
bankchurn_df=bankchurn_df.drop(['RowNumber','CustomerId','Surname'], axis=1)
bankchurn_df.head()

In [None]:
# Dropping `RowNumber`, `CustomerId`, and `Surname` which do not add value.
bankchurn_df.drop(["RowNumber", "CustomerId", "Surname"], axis=1, inplace=True)

# Check the shape of the dataset.
print(f"There are {bankchurn_df.shape[0]} rows and {bankchurn_df.shape[1]} columns.")

---
---
# Exploratory Data Analysis

### Univariate Analysis

### Bivariate Analysis

---
---
# Data Pre-processing

### Dummy Variable Creation

### Train-validation-test Split

### Data Normalization

---
---
# Model Building

## Model evaluation criterion

Write down the logic for choosing the metric that would be the best metric for this business scenario.

-


### Neural Network with SGD Optimizer

## Model Performance Improvement

### Neural Network with Adam Optimizer

### Neural Network with Adam Optimizer and Dropout

### Neural Network with Balanced Data (by applying SMOTE) and SGD Optimizer

### Neural Network with Balanced Data (by applying SMOTE) and Adam Optimizer

### Neural Network with Balanced Data (by applying SMOTE), Adam Optimizer, and Dropout

## Model Performance Comparison and Final Model Selection

## Actionable Insights and Business Recommendations

*



<font size=6 color='blue'>Power Ahead</font>
___