# Customer Churn Prediction: Data Preprocessing and Feature Engineering for ANN

This Jupyter Notebook contains the initial data preprocessing and feature engineering steps for building an Artificial Neural Network (ANN) to predict customer churn. The goal is to prepare the raw `Churn_Modelling.csv` dataset by handling irrelevant features, encoding categorical variables, splitting data, and scaling numerical features. This preprocessed data will then be used to train an ANN in subsequent steps.

## Key Steps Covered:

1.  **Environment Setup**: Installation of `ipykernel` for Jupyter Notebook execution.
2.  **Library Imports**: Essential libraries like `pandas` for data manipulation, `sklearn.model_selection` for `train_test_split`, `sklearn.preprocessing` for `StandardScaler` and `LabelEncoder`, and `pickle` for saving preprocessing objects.
3.  **Data Loading**: Reading the `Churn_Modelling.csv` dataset into a pandas DataFrame.
4.  **Feature Dropping**: Removing irrelevant columns such as `RowNumber`, `CustomerID`, and `Surname`.
5.  **Categorical Feature Encoding**:
    * **Label Encoding**: Applying `LabelEncoder` to the `Gender` column (Male/Female to 0/1).
    * **One-Hot Encoding**: Applying `OneHotEncoder` to the `Geography` column (France, Spain, Germany) to avoid ordinality issues. This involves handling `sparse` matrix conversion to a DataFrame and concatenating it with the main dataset.
6.  **Data Splitting**: Dividing the processed dataset into independent features (X) and the dependent target (Y - 'Exited' column).
7.  **Train-Test Split**: Splitting the data into training and testing sets (`X_train`, `X_test`, `y_train`, `y_test`) using `train_test_split`.
8.  **Feature Scaling**: Applying `StandardScaler` to normalize numerical features in `X_train` and `X_test` to bring them to a common scale.
9.  **Saving Preprocessing Objects**: Pickling the `LabelEncoder` for `Gender`, `OneHotEncoder` for `Geography`, and the `StandardScaler` object for future use in deployment, ensuring consistent data transformation.

This notebook prepares a clean, numerical, and scaled dataset, making it ready for training a deep learning model.

In [1]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler,LabelEncoder
import pickle

In [2]:
## Load the dataset
# Read the CSV file into a pandas DataFrame
data=pd.read_csv("Churn_Modelling.csv")

# Display the first few rows of the DataFrame to inspect the data
data.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [3]:
## Preprocess the data
### Drop irrelevant columns
# Drop 'RowNumber', 'CustomerId', and 'Surname' columns as they are not relevant for prediction.
# axis=1 indicates that we are dropping columns.
data=data.drop(['RowNumber','CustomerId','Surname'],axis=1)

# Display the DataFrame after dropping columns
data

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.00,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.80,3,1,0,113931.57,1
3,699,France,Female,39,1,0.00,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...
9995,771,France,Male,39,5,0.00,2,1,0,96270.64,0
9996,516,France,Male,35,10,57369.61,1,1,1,101699.77,0
9997,709,France,Female,36,7,0.00,1,0,1,42085.58,1
9998,772,Germany,Male,42,3,75075.31,2,1,0,92888.52,1


In [4]:
## Encode categorical variables

# Initialize LabelEncoder for the 'Gender' column
label_encoder_gender=LabelEncoder()

# Fit and transform the 'Gender' column. This will convert 'Male' and 'Female' to 0 and 1.
data['Gender']=label_encoder_gender.fit_transform(data['Gender'])

# Display the DataFrame to see the encoded 'Gender' column
data

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,0,42,2,0.00,1,1,1,101348.88,1
1,608,Spain,0,41,1,83807.86,1,0,1,112542.58,0
2,502,France,0,42,8,159660.80,3,1,0,113931.57,1
3,699,France,0,39,1,0.00,2,0,0,93826.63,0
4,850,Spain,0,43,2,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...
9995,771,France,1,39,5,0.00,2,1,0,96270.64,0
9996,516,France,1,35,10,57369.61,1,1,1,101699.77,0
9997,709,France,0,36,7,0.00,1,0,1,42085.58,1
9998,772,Germany,1,42,3,75075.31,2,1,0,92888.52,1


In [5]:
## Onehot encode 'Geography

# Import OneHotEncoder for one-hot encoding
from sklearn.preprocessing import OneHotEncoder

# Initialize OneHotEncoder. sparse_output=False (or sparse=False in older versions) ensures a dense array output.
onehot_encoder_geo=OneHotEncoder()

# Fit and transform the 'Geography' column. We pass it as a DataFrame ([[...]]) because OneHotEncoder expects 2D array.
geo_encoder=onehot_encoder_geo.fit_transform(data[['Geography']]).toarray()

# Display the resulting one-hot encoded array
geo_encoder

array([[1., 0., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       ...,
       [1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.]])

In [6]:
# Get the names of the new features created by one-hot encoding
onehot_encoder_geo.get_feature_names_out(['Geography'])

array(['Geography_France', 'Geography_Germany', 'Geography_Spain'],
      dtype=object)

In [7]:
# Create a DataFrame from the one-hot encoded array with appropriate column names
geo_encoded_df=pd.DataFrame(geo_encoder,columns=onehot_encoder_geo.get_feature_names_out(['Geography']))

# Display the one-hot encoded DataFrame
geo_encoded_df

Unnamed: 0,Geography_France,Geography_Germany,Geography_Spain
0,1.0,0.0,0.0
1,0.0,0.0,1.0
2,1.0,0.0,0.0
3,1.0,0.0,0.0
4,0.0,0.0,1.0
...,...,...,...
9995,1.0,0.0,0.0
9996,1.0,0.0,0.0
9997,1.0,0.0,0.0
9998,0.0,1.0,0.0


In [8]:
## Combine one hot encoder columns with the original data

# Drop the original 'Geography' column from the 'data' DataFrame
data=pd.concat([data.drop('Geography',axis=1),geo_encoded_df],axis=1)


data.head()

Unnamed: 0,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_France,Geography_Germany,Geography_Spain
0,619,0,42,2,0.0,1,1,1,101348.88,1,1.0,0.0,0.0
1,608,0,41,1,83807.86,1,0,1,112542.58,0,0.0,0.0,1.0
2,502,0,42,8,159660.8,3,1,0,113931.57,1,1.0,0.0,0.0
3,699,0,39,1,0.0,2,0,0,93826.63,0,1.0,0.0,0.0
4,850,0,43,2,125510.82,1,1,1,79084.1,0,0.0,0.0,1.0


In [9]:
## Save the encoders and sscaler
with open('label_encoder_gender.pkl','wb') as file:
    pickle.dump(label_encoder_gender,file)

with open('onehot_encoder_geo.pkl','wb') as file:
    pickle.dump(onehot_encoder_geo,file)


In [10]:
data.head()

Unnamed: 0,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_France,Geography_Germany,Geography_Spain
0,619,0,42,2,0.0,1,1,1,101348.88,1,1.0,0.0,0.0
1,608,0,41,1,83807.86,1,0,1,112542.58,0,0.0,0.0,1.0
2,502,0,42,8,159660.8,3,1,0,113931.57,1,1.0,0.0,0.0
3,699,0,39,1,0.0,2,0,0,93826.63,0,1.0,0.0,0.0
4,850,0,43,2,125510.82,1,1,1,79084.1,0,0.0,0.0,1.0


In [11]:
## DiVide the dataset into indepent and dependent features
X=data.drop('Exited',axis=1)
y=data['Exited']

## Split the data in training and tetsing sets
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

## Scale these features
scaler=StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.transform(X_test)


In [14]:
X_train

array([[ 0.35649971,  0.91324755, -0.6557859 , ...,  1.00150113,
        -0.57946723, -0.57638802],
       [-0.20389777,  0.91324755,  0.29493847, ..., -0.99850112,
         1.72572313, -0.57638802],
       [-0.96147213,  0.91324755, -1.41636539, ..., -0.99850112,
        -0.57946723,  1.73494238],
       ...,
       [ 0.86500853, -1.09499335, -0.08535128, ...,  1.00150113,
        -0.57946723, -0.57638802],
       [ 0.15932282,  0.91324755,  0.3900109 , ...,  1.00150113,
        -0.57946723, -0.57638802],
       [ 0.47065475,  0.91324755,  1.15059039, ..., -0.99850112,
         1.72572313, -0.57638802]])

In [15]:
with open('scaler.pkl','wb') as file:
    pickle.dump(scaler,file)

In [24]:
data

Unnamed: 0,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_France,Geography_Germany,Geography_Spain
0,619,0,42,2,0.00,1,1,1,101348.88,1,1.0,0.0,0.0
1,608,0,41,1,83807.86,1,0,1,112542.58,0,0.0,0.0,1.0
2,502,0,42,8,159660.80,3,1,0,113931.57,1,1.0,0.0,0.0
3,699,0,39,1,0.00,2,0,0,93826.63,0,1.0,0.0,0.0
4,850,0,43,2,125510.82,1,1,1,79084.10,0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,771,1,39,5,0.00,2,1,0,96270.64,0,1.0,0.0,0.0
9996,516,1,35,10,57369.61,1,1,1,101699.77,0,1.0,0.0,0.0
9997,709,0,36,7,0.00,1,0,1,42085.58,1,1.0,0.0,0.0
9998,772,1,42,3,75075.31,2,1,0,92888.52,1,0.0,1.0,0.0


### ANN Implementation

# Artificial Neural Network (ANN) Training for Customer Churn Prediction

This Jupyter Notebook focuses on building and training an Artificial Neural Network (ANN) using TensorFlow/Keras for the customer churn prediction project. Building upon the previous data preprocessing steps (where `StandardScaler`, `LabelEncoder`, and `OneHotEncoder` pickle files were generated), this notebook will demonstrate how to construct a sequential ANN model, configure its training process, and visualize its performance.

## Key Steps Covered:

1.  **Library Imports**: Importing TensorFlow, Keras layers (`Sequential`, `Dense`), optimizers (`Adam`), loss functions (`BinaryCrossentropy`), and callbacks (`EarlyStopping`, `TensorBoard`).
2.  **Model Architecture**: Defining a **sequential ANN model** with an input layer, multiple hidden layers (using `Dense` layers with `relu` activation), and a single-neuron output layer with `sigmoid` activation for binary classification.
3.  **Model Compilation**: Configuring the model with an **optimizer** (Adam), a **loss function** (binary cross-entropy), and **metrics** (accuracy) to guide the learning process.
4.  **Callback Setup**:
    * **TensorBoard**: Setting up TensorBoard callbacks to log training progress, enabling visualization of metrics like loss and accuracy over epochs.
    * **Early Stopping**: Implementing early stopping to prevent overfitting by monitoring validation loss and stopping training if no improvement is observed for a specified number of epochs.
5.  **Model Training**: Fitting the ANN model to the preprocessed training data, using the validation set for monitoring, and incorporating the defined callbacks.
6.  **Model Saving**: Saving the trained ANN model to an HDF5 file for later deployment or inference.
7.  **TensorBoard Visualization**: Demonstrating how to launch and navigate TensorBoard within the notebook to visually analyze training and validation metrics.

In [16]:
# Import TensorFlow and Keras modules for building the ANN
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.callbacks import EarlyStopping,TensorBoard
import datetime



In [17]:
# Get the number of features in the input layer. The comma is for tuple unpack.
# This will be used to define the input_shape of the first Dense layer.
(X_train.shape[1],)

(12,)

In [18]:
## Build Our ANN Model
# Initialize a Sequential model (layers are added one after another)
model=Sequential([
    Dense(64,activation='relu',input_shape=(X_train.shape[1],)), ## HL1 Connected wwith input layer
    Dense(32,activation='relu'), ## HL2
    Dense(1,activation='sigmoid')  ## output layer
]

)

2025-06-23 14:49:04.767458: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M1 Pro
2025-06-23 14:49:04.767507: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 16.00 GB
2025-06-23 14:49:04.767516: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 5.33 GB
2025-06-23 14:49:04.767599: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2025-06-23 14:49:04.767656: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


In [19]:
# Display the model summary, showing layer types, output shapes, and number of parameters
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 64)                832       
                                                                 
 dense_1 (Dense)             (None, 32)                2080      
                                                                 
 dense_2 (Dense)             (None, 1)                 33        
                                                                 
Total params: 2945 (11.50 KB)
Trainable params: 2945 (11.50 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [20]:
import tensorflow

# Define the optimizer (Adam with a learning rate of 0.01)
opt=tensorflow.keras.optimizers.Adam(learning_rate=0.01)

# Define the loss function for binary classification (Binary Crossentropy)
loss=tensorflow.keras.losses.BinaryCrossentropy()
loss



<keras.src.losses.BinaryCrossentropy at 0x32e8df4c0>

In [None]:
## compile the model
# Compile the model by specifying the optimizer, loss function, and metrics to monitor during training
model.compile(optimizer=opt,loss="binary_crossentropy",metrics=['accuracy'])

In [None]:
## Set up the Tensorboard
# Import EarlyStopping and TensorBoard callbacks (already imported above, redundant)
from tensorflow.keras.callbacks import EarlyStopping,TensorBoard

# Define the log directory for TensorBoard to store training logs
log_dir="logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")

# Create a TensorBoard callback instance
tensorflow_callback=TensorBoard(log_dir=log_dir,histogram_freq=1)

In [None]:
## Set up Early Stopping
# Create an EarlyStopping callback instance
# monitor='val_loss': Monitor the validation loss
# patience=10: Stop training if validation loss doesn't improve for 10 epochs
# restore_best_weights=True: Restore model weights from the epoch with the best validation loss
early_stopping_callback=EarlyStopping(monitor='val_loss',patience=10,restore_best_weights=True)


In [24]:
### Train the model
# Train the model using the training data
# validation_data: Data to evaluate the model's performance at the end of each epoch
# epochs=100: Maximum number of epochs to train
# callbacks: List of callbacks to apply during training (TensorBoard and EarlyStopping)
history=model.fit(
    X_train,y_train,validation_data=(X_test,y_test),epochs=100,
    callbacks=[tensorflow_callback,early_stopping_callback]
)

Epoch 1/100


2025-06-23 14:49:33.741999: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type GPU is enabled.
2025-06-23 14:49:33.767713: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:961] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node Adam/AssignAddVariableOp.


Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100


In [25]:
# Save the trained model to an HDF5 file
model.save('model.h5')

  saving_api.save_model(


In [None]:
## Load Tensorboard Extension
# Load the TensorBoard notebook extension to visualize training logs
%load_ext tensorboard

In [28]:
# Launch TensorBoard to view the training logs from the specified directory
%tensorboard --logdir logs/fit

Reusing TensorBoard on port 6006 (pid 38582), started 0:00:09 ago. (Use '!kill 38582' to kill it.)

In [None]:
### Load the pickle file
