In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
import pickle


In [2]:
## Load the dataset
data = pd.read_csv('Churn_Modelling.csv')
data.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [3]:
## Preprocess the data 
### Drop unnecessary columns
data = data.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1)
data

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.00,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.80,3,1,0,113931.57,1
3,699,France,Female,39,1,0.00,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...
9995,771,France,Male,39,5,0.00,2,1,0,96270.64,0
9996,516,France,Male,35,10,57369.61,1,1,1,101699.77,0
9997,709,France,Female,36,7,0.00,1,0,1,42085.58,1
9998,772,Germany,Male,42,3,75075.31,2,1,0,92888.52,1


LabelEncoder: This is a tool from scikit-learn that helps convert categories (text values) into numbers.
fit_transform:
fit: The encoder looks at all unique values in the Gender column (e.g., "Male" and "Female") and assigns a unique number to each (e.g., "Male" -> 1, "Female" -> 0).
transform: The encoder then replaces the original text values in the Gender column with the assigned numbers.

In [4]:
## Encode categorical variables
label_encoder_gender = LabelEncoder()
data['Gender'] = label_encoder_gender.fit_transform(data['Gender'])

In [6]:
data


Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,0,42,2,0.00,1,1,1,101348.88,1
1,608,Spain,0,41,1,83807.86,1,0,1,112542.58,0
2,502,France,0,42,8,159660.80,3,1,0,113931.57,1
3,699,France,0,39,1,0.00,2,0,0,93826.63,0
4,850,Spain,0,43,2,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...
9995,771,France,1,39,5,0.00,2,1,0,96270.64,0
9996,516,France,1,35,10,57369.61,1,1,1,101699.77,0
9997,709,France,0,36,7,0.00,1,0,1,42085.58,1
9998,772,Germany,1,42,3,75075.31,2,1,0,92888.52,1


Next colum which is Geography.

In [9]:
## Onehot encode 'Geography' column
from  sklearn.preprocessing import OneHotEncoder
onehot_encoder_geo = OneHotEncoder()
geo_encoder = onehot_encoder_geo.fit_transform(data[['Geography']])


In [10]:
geo_encoder

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 10000 stored elements and shape (10000, 3)>

In [11]:
geo_encoder.toarray()

array([[1., 0., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       ...,
       [1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.]])

In [12]:
onehot_encoder_geo.get_feature_names_out(['Geography'])

array(['Geography_France', 'Geography_Germany', 'Geography_Spain'],
      dtype=object)

In [13]:
geo_encoder_df = pd.DataFrame(geo_encoder.toarray(), columns=onehot_encoder_geo.get_feature_names_out(['Geography']))
geo_encoder_df

Unnamed: 0,Geography_France,Geography_Germany,Geography_Spain
0,1.0,0.0,0.0
1,0.0,0.0,1.0
2,1.0,0.0,0.0
3,1.0,0.0,0.0
4,0.0,0.0,1.0
...,...,...,...
9995,1.0,0.0,0.0
9996,1.0,0.0,0.0
9997,1.0,0.0,0.0
9998,0.0,1.0,0.0


The code above converts the Geography column (containing text categories like "France," "Germany," "Spain") into numeric columns, one for each category. This is called one-hot encoding.

Why We Need This
Machine learning models cannot work directly with text data like "France" or "Germany." One-hot encoding transforms these categories into numbers in a way that avoids giving more importance to any specific category.

What Each Part Does:
OneHotEncoder: A tool to create separate columns for each category in the Geography column.
fit_transform: Learns the categories and converts them into a matrix where:
1 means the row belongs to that category.
0 means it doesn't.
toarray(): Converts the encoded matrix into a full array for easier use.
get_feature_names_out: Names the new columns (e.g., Geography_France, Geography_Germany).
pd.DataFrame: Creates a new table (dataframe) with the encoded columns.
Example Output:
If Geography = ["France", "Germany", "France"], the result will be:

Geography_France	Geography_Germany	Geography_Spain
1	0	0
0	1	0
1	0	0
This ensures that the model understands each category without bias.

In [14]:
## Combining Encoders and DataFrames with the rest of the data
##Now that we have encoded the Geography column, we can combine it with the rest of the data using the pandas concat() function.

data = pd.concat([data.drop('Geography', axis=1), geo_encoder_df], axis=1)

In [15]:
data.head()

Unnamed: 0,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_France,Geography_Germany,Geography_Spain
0,619,0,42,2,0.0,1,1,1,101348.88,1,1.0,0.0,0.0
1,608,0,41,1,83807.86,1,0,1,112542.58,0,0.0,0.0,1.0
2,502,0,42,8,159660.8,3,1,0,113931.57,1,1.0,0.0,0.0
3,699,0,39,1,0.0,2,0,0,93826.63,0,1.0,0.0,0.0
4,850,0,43,2,125510.82,1,1,1,79084.1,0,0.0,0.0,1.0


This below code saves the encoders (label_encoder_gender and onehot_encoder_geo) into files so they can be reused later without having to recreate them.

Why We Need This
In real-world applications, we might need to use the same encoders (with the same mappings) on new data during prediction. Saving the encoders ensures consistency between training and prediction phases.

What Each Part Does:
with open(..., 'wb'): Opens a file in write-binary mode to store the encoder data.
'label_encoder_gender.pkl': File to save the gender encoder.
'label_encoder_geo.pkl': File to save the geography encoder.
pickle.dump(...): Saves (serializes) the encoder object into the file.
label_encoder_gender: Saves the gender label encoder.
onehot_encoder_geo: Saves the geography one-hot encoder.

In [20]:
## Save the encoders and scaler 
with open('label_encoder_gender.pkl', 'wb') as file:
    pickle.dump(label_encoder_gender, file)

with open('label_encoder_geo.pkl', 'wb') as file:
    pickle.dump(onehot_encoder_geo, file)

In [17]:
## Divide the dataset into independent and dependent variables/features
X = data.drop('Exited', axis=1)
y = data['Exited']

## Split the dataset into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Scale these features using the StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


In [18]:
X_train

array([[ 0.35649971,  0.91324755, -0.6557859 , ...,  1.00150113,
        -0.57946723, -0.57638802],
       [-0.20389777,  0.91324755,  0.29493847, ..., -0.99850112,
         1.72572313, -0.57638802],
       [-0.96147213,  0.91324755, -1.41636539, ..., -0.99850112,
        -0.57946723,  1.73494238],
       ...,
       [ 0.86500853, -1.09499335, -0.08535128, ...,  1.00150113,
        -0.57946723, -0.57638802],
       [ 0.15932282,  0.91324755,  0.3900109 , ...,  1.00150113,
        -0.57946723, -0.57638802],
       [ 0.47065475,  0.91324755,  1.15059039, ..., -0.99850112,
         1.72572313, -0.57638802]])

In [19]:
with open('scaler.pkl', 'wb') as file:
    pickle.dump(scaler, file)

#### ANN Implementation

In [21]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.callbacks import EarlyStopping, TensorBoard
import datetime


Sequential:

This is a type of model where layers are stacked one after another in a linear way.
First Layer (Dense(64, ...)):

Dense(64): This is a fully connected layer with 64 neurons.
activation='relu': Uses the ReLU (Rectified Linear Unit) activation function to introduce non-linearity.
input_shape=(X_train.shape[1],): Specifies the shape of the input data (number of features in the training data).
Second Layer (Dense(32, ...)):

Dense(32): A fully connected layer with 32 neurons.
activation='relu': Again uses the ReLU activation for non-linearity.
Output Layer (Dense(1, ...)):

Dense(1): A single neuron, as this is a binary classification problem (output is either 0 or 1).
activation='sigmoid': The sigmoid activation function outputs values between 0 and 1, suitable for binary classification.
What Does This Model Do?
Input Layer: Takes the training data (X_train) as input.
Hidden Layers: Processes the data through layers of neurons to learn patterns.
Output Layer: Outputs a value between 0 and 1, predicting the probability of a class (e.g., 1 for "yes", 0 for "no").

In [22]:
(X_train.shape[1],)

(12,)

In [25]:
## Build the ANN model
model = Sequential([
    Dense(64, activation='relu', input_shape=(X_train.shape[1],)),  # HL1 connected with input layer
    Dense(32, activation='relu'),  # HL2
    Dense(1, activation='sigmoid')  # Output layer
])





In [26]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 64)                832       
                                                                 
 dense_1 (Dense)             (None, 32)                2080      
                                                                 
 dense_2 (Dense)             (None, 1)                 33        
                                                                 
Total params: 2945 (11.50 KB)
Trainable params: 2945 (11.50 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [27]:
import tensorflow
opt=tensorflow.keras.optimizers.Adam(learning_rate=0.001)
loss=tensorflow.keras.losses.BinaryCrossentropy()
loss

<keras.src.losses.BinaryCrossentropy at 0x2a56df10e50>

In [28]:
## compile the model for forward and backward propagation
model.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy'])

In [30]:
## Set up the Tensorboard callback
from tensorflow.keras.callbacks import EarlyStopping , TensorBoard

log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorflow_callback = TensorBoard(log_dir=log_dir, histogram_freq=1)

In [31]:
## Set up the early stopping callback
early_stopping_callback = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

### Early stopping: 
is a technique used in machine learning to stop training a model once it stops improving. It helps prevent overfitting, which occurs when a model learns patterns specific to the training data but fails to generalize well to unseen data.

### Benefits of Early Stopping:
Prevents Overfitting: Stops training before the model starts memorizing the training data.
Saves Time: Reduces unnecessary training epochs, saving computational resources.
Restores the Best Model: Allows you to use the weights of the model when it performed best during training.

#### monitor='val_loss':
The callback monitors the validation loss (a metric showing how well the model performs on validation data).
Training stops if the validation loss doesn't improve.

#### patience=5:
This means the training will wait for 5 epochs after the last improvement in validation loss.
If there’s no improvement after 5 epochs, training stops.

#### restore_best_weights=True:
Ensures that the model reverts to the weights of the epoch where it achieved the best validation loss.
This way, you get the most generalizable model even if the training continues for a bit longer.

In [32]:
### Training the Model
history = model.fit(X_train, y_train, epochs=100, 
validation_data=(X_test, y_test), callbacks=[early_stopping_callback, tensorflow_callback])

Epoch 1/100


Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100


In [33]:
model.save('model.h5')

  saving_api.save_model(


In [34]:
## Load Tensorboard Extension 
%load_ext tensorboard

In [36]:
%tensorboard --logdir logs/fit/20250104-131951

Reusing TensorBoard on port 6006 (pid 10520), started 0:00:11 ago. (Use '!kill 10520' to kill it.)

In [None]:
## Load the pickle file
