Before delving deep into the Backpropagation algorithm, we will do some hands-on to better understand how to create a neural network, train it and get predictions. Once you see this in action, Backpropagation algorithm will become easy to digest. We will be building 3 projects using TensorFlow and Keras:
- Customer Churn Prediction using ANN => Binary Classification Problem
- Handwritten Digit Classification using ANN => Multi-class Classification Problem
- Graduate Admission Prediction using ANN => Regression Problem

# Customer Churn Prediction using ANN

## 1. Credit Card Customer Churn Prediction Dataset

The [Kaggle Customer Churn Prediction Dataset](https://www.kaggle.com/datasets/rjmanoj/credit-card-customer-churn-prediction) contains data of both active customers of a bank and those who exited from the bank.

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv("data/datasets/DL_05_Customer_Churn_Prediction_using_ANN-Churn_Modelling.csv")

In [3]:
print(f"Shape of dataset: {df.shape}")
df.head()

Shape of dataset: (10000, 14)


Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


After exploring the dataset, we can derive the below assumptions:
- Features $RowNumber, CustomerId, Surname$ do not seem to contribute much to the target variable $Exited$.
- $Tenure$ seems to be the no of years the customer is/was with the bank
- $Balance$ seems to be the account balance
- $NumOfProducts$ seems to be the no of products used by the customer (for example, Debit Card, Credit Card, Fixed Deposit Account, Recurring Deposit Account etc.)
- $HasCrCard$ $\implies$ whether customer has Credit Card or not
- $IsActiveMember$ $\implies$ whether customer is an active member of the bank (may be based on the transaction history)
- $EstimatedSalary$ $\implies$ Salary of the customer as estimated by the bank (may be using some other ML model)
- $Exited$ $\implies$ target variable denoting whether customer has exited the bank or not

In [4]:
# Check the dataset for any missing values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


There are no missing values. Features $Surname, Geography, Gender$ are of type "object" and all other datatypes are properly suited for their corresponding data.

In [5]:
# Check for any duplicated rows
df.duplicated().sum()

0

As you can see, there are no duplicated rows.

In [6]:
# Check how many customers exited the bank
df["Exited"].value_counts()

Exited
0    7963
1    2037
Name: count, dtype: int64

7963 customers are still with the bank while 2037 customers already exited the bank. Hence, the dataset has some imbalance which needs to be taken cared for a really good ML model.

**Note**

Here, we do not correct the dataset imbalance issue since we are not concerned about creating a very accurate model. We are more focused on learning the flow of ML model creation, training and prediction.

In [7]:
# Get categories in "Geography" column
df["Geography"].value_counts()

Geography
France     5014
Germany    2509
Spain      2477
Name: count, dtype: int64

In $Geography$ column also, we can see some imbalance.

In [8]:
# Check "Gender" column
df["Gender"].value_counts()

Gender
Male      5457
Female    4543
Name: count, dtype: int64

In $Gender$ column also, we can see a small imbalance.

In [None]:
# Deep Learning as subset of ML

from IPython import display
display.Image("data/images/DL_01_Intro-01-DL-subset-of-ML.jpg")