# Example of e2e ANN implementation
- The idea is to predict the propability of an employee resigning using ANN model

## Covers:
> **Preprocessing of data**
- Read the data from a csv file whicn contains employee information and the status whether they resigned or not
- Remove irrelevant data which is not important for prediction
- Encode categorical data (data which has only certain values) for binary categorization using LabelEncoder()
- Use One-Hot Encoding for non-binary categorical data assuming that there is no ordinal relationship in the data (In general for categorinal data, OHE is preferred) using OneHotEncoder()
- Evaluate the unique feature set created by OHE, remove original column and concatenate new feature data
- Divide the dataset into dependent (output) and independent (input) features
- Based on dependent and independent data, generate training and testing dataset using train_test_split()
- Scale the data using StandardScaler()
- Save the data in pickle file format

> **ANN Implementation**
- Create a sequential ANN model using keras.models Sequential()
  - Add hidden layers using keras.layers dense()
  - Add outout layer using  keras.layers dense()
- Setup Tensorboard to graphically monitor the training process
- Setup early stopping (stop the training process if the learning progress is minimal)
- Train the model
- Fine tune the training process by analyzing Tensorboard graphs
- Save the trained model in .md5 format
- Use model for predicting


## Preprocessing of data

#### Preprocess the data so that unnecessary data is removed and taxt is converted into numbers

In [88]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
import pickle


In [89]:
#load the data
data = pd.read_csv('Churn_Modelling.csv')
data


Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.00,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.80,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.00,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9996,15606229,Obijiaku,771,France,Male,39,5,0.00,2,1,0,96270.64,0
9996,9997,15569892,Johnstone,516,France,Male,35,10,57369.61,1,1,1,101699.77,0
9997,9998,15584532,Liu,709,France,Female,36,7,0.00,1,0,1,42085.58,1
9998,9999,15682355,Sabbatini,772,Germany,Male,42,3,75075.31,2,1,0,92888.52,1


#### Remove irrelevant data which is not important for prediction

In [90]:
## Preprocess the data
### Drop irrelevant data
data = data.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1)
data

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.00,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.80,3,1,0,113931.57,1
3,699,France,Female,39,1,0.00,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...
9995,771,France,Male,39,5,0.00,2,1,0,96270.64,0
9996,516,France,Male,35,10,57369.61,1,1,1,101699.77,0
9997,709,France,Female,36,7,0.00,1,0,1,42085.58,1
9998,772,Germany,Male,42,3,75075.31,2,1,0,92888.52,1


#### Encode categorical data (data which has only certain values) for binary categorization using LabelEncoder()

In [91]:
# Encoding of categorical data (data which has only certain values)
# In this case, Gender has only two values (male, female) which can be binary encoded (0,1)
# Can use LabelEncoder() for this. But it can not be used for ordinal data (data which has a certain order). For ordinal data. OrdinalEncoder() is used
label_encover_gender = LabelEncoder()
data['Gender'] = label_encover_gender.fit_transform(data['Gender'])
data

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,0,42,2,0.00,1,1,1,101348.88,1
1,608,Spain,0,41,1,83807.86,1,0,1,112542.58,0
2,502,France,0,42,8,159660.80,3,1,0,113931.57,1
3,699,France,0,39,1,0.00,2,0,0,93826.63,0
4,850,Spain,0,43,2,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...
9995,771,France,1,39,5,0.00,2,1,0,96270.64,0
9996,516,France,1,35,10,57369.61,1,1,1,101699.77,0
9997,709,France,0,36,7,0.00,1,0,1,42085.58,1
9998,772,Germany,1,42,3,75075.31,2,1,0,92888.52,1


In [92]:
# For Geography (France, Spain, Germany), we can not use LabelEncoder as for assigned values (0, 1, 2), ANN might think that Germany has higher importance than France and Spain. In this case we can use One-Hot Encoding assuming that there is no ordinal relationship in the data
# In general for categorinal data, OHE is preferred
from sklearn.preprocessing import OneHotEncoder
onehot_encoder_geo = OneHotEncoder()
geo_encoder = onehot_encoder_geo.fit_transform(data[['Geography']])
geo_encoder

<10000x3 sparse matrix of type '<class 'numpy.float64'>'
	with 10000 stored elements in Compressed Sparse Row format>

#### Use One-Hot Encoding for non-binary categorical data assuming that there is no ordinal relationship in the data (In general for categorinal data, OHE is preferred) using OneHotEncoder()

In [93]:
# Get the list of features created by OHE
one_hot_encoded_features = onehot_encoder_geo.get_feature_names_out(['Geography'])
one_hot_encoded_features

array(['Geography_France', 'Geography_Germany', 'Geography_Spain'],
      dtype=object)

In [None]:
## Save the encoders in a pickle file
with open('label_encoder_gender.pkl', 'wb') as file:
    pickle.dump(label_encover_gender, file)

with open('onehot_encoder_geo.pkl', 'wb') as file:
    pickle.dump(onehot_encoder_geo, file)

#### Evaluate the unique feature set created by OHE, remove original column and concatenate new feature data

In [95]:
# Get the OHE data in data frame format, this would be concatenated in data later
geo_encoded_df = pd.DataFrame(geo_encoder.toarray(), columns = one_hot_encoded_features)
geo_encoded_df

Unnamed: 0,Geography_France,Geography_Germany,Geography_Spain
0,1.0,0.0,0.0
1,0.0,0.0,1.0
2,1.0,0.0,0.0
3,1.0,0.0,0.0
4,0.0,0.0,1.0
...,...,...,...
9995,1.0,0.0,0.0
9996,1.0,0.0,0.0
9997,1.0,0.0,0.0
9998,0.0,1.0,0.0


In [96]:
# Remove 'Geography' column and add new OHE features
data = pd.concat([data.drop('Geography', axis=1), geo_encoded_df], axis=1)
# print from beginning
data.head()

Unnamed: 0,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_France,Geography_Germany,Geography_Spain
0,619,0,42,2,0.0,1,1,1,101348.88,1,1.0,0.0,0.0
1,608,0,41,1,83807.86,1,0,1,112542.58,0,0.0,0.0,1.0
2,502,0,42,8,159660.8,3,1,0,113931.57,1,1.0,0.0,0.0
3,699,0,39,1,0.0,2,0,0,93826.63,0,1.0,0.0,0.0
4,850,0,43,2,125510.82,1,1,1,79084.1,0,0.0,0.0,1.0


#### Divide the dataset into dependent (output) and independent (input) features

In [97]:
## Divide the dataset into dependent (output) and independent (input) features
X = data.drop('Exited', axis=1)
Y = data['Exited']

#### Based on dependent and independent data, generate training and testing dataset using train_test_split()

In [98]:
# Split the data into training and testing set
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2 ,random_state=42) # => train_size=0.8

#### Scale the data using StandardScaler()

In [99]:
# Scale the feature data so that the calculations are controlled
scalar = StandardScaler()
X_train = scalar.fit_transform(X_train)
X_test = scalar.fit_transform(X_test)

#### Save the data in pickle file format

In [None]:
## Save the data in pickle file foramt
with open('scaler.pkl', 'wb') as file:
    pickle.dump(scalar, file)

## ANN Implementation

In [101]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.callbacks import EarlyStopping, TensorBoard
import datetime

In [102]:
X_train.shape[1]

12

#### Create a sequential ANN model using keras.models Sequential()

> Architecture
- Contains input of size 12
- Hidden Layer 1:
  - Perceptrons: 64
  - Activation Function: relu
  - input size:12
- Hiddel Layer 2:
  - Perceptrons: 32
  - Activation Function: relu
  - input size:64
- Output Layer:
  - Perceptrons: 1
  - Activation Function: sigmoid
  - input size:32

In [103]:
# Create a sequential ANN model using keras.models Sequential()
model = Sequential([
    Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')
]
)

model.summary()

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [104]:
import tensorflow

# Define your own optimizer (for backpropagation and weight update)
backpropagation_opt = tensorflow.keras.optimizers.Adam(learning_rate=0.01)

# Define your own loss function
loss_function = tensorflow.keras.losses.BinaryCrossentropy()

# Compile the model based on defined loss function and backpropagation optimizer
model.compile(optimizer=backpropagation_opt, loss=loss_function, metrics=['accuracy'])

# Using standard loss function and optimizer
#model.compile(optimizer='Adam', loss="binary_crossentropy", metrics=['accuracy'])

#### Setup Tensorboard to graphically monitor the training process
- passon the training log files to plot the graphs
- histogram_freq?


In [105]:
## Setup the tensor board
log_dir = "logs/fit" + datetime.datetime.now(). strftime("%Y%m%d-%H%M%S")
tensorflow_callback = TensorBoard(log_dir=log_dir, histogram_freq=1)

#### Setup early stopping
- stop the training process if the learning progress is minimal: $$Y_{true} - Y_{actual} \approx 0 $$
- monitors : measure based on loss function
- patience : wait at least till these iterations of training
- restore_best_weights : get back to best weights achieved during training

In [106]:
## Setup early stopping
early_stopping_callback = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

# Updated patience from 5 to 10 after analyzing Tensorboard data

In [107]:
## Train the model
history = model.fit(X_train, Y_train, validation_data=(X_test, Y_test), epochs=100,
                    callbacks=[tensorflow_callback, early_stopping_callback]
                    )

Epoch 1/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 5ms/step - accuracy: 0.8239 - loss: 0.4334 - val_accuracy: 0.8540 - val_loss: 0.3587
Epoch 2/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8499 - loss: 0.3619 - val_accuracy: 0.8540 - val_loss: 0.3526
Epoch 3/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8624 - loss: 0.3378 - val_accuracy: 0.8530 - val_loss: 0.3449
Epoch 4/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8697 - loss: 0.3338 - val_accuracy: 0.8580 - val_loss: 0.3525
Epoch 5/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.8622 - loss: 0.3423 - val_accuracy: 0.8535 - val_loss: 0.3463
Epoch 6/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.8631 - loss: 0.3227 - val_accuracy: 0.8620 - val_loss: 0.3392
Epoch 7/100
[1m250/25

In [108]:
## Load Tensorboard data. % is used for loadinf external apps in python
%reload_ext tensorboard
%tensorboard --logdir logs/fit20250209-201221

Launching TensorBoard...

In [109]:
## Save the model
model.save('test_model.h5')



#### Use the model for prediction

In [110]:
## Load the model, scalar pickle file
from tensorflow.keras.models import load_model

prediction_model = load_model('test_model.h5')

# load the encoder and scalar
with open('onehot_encoder_geo.pkl', 'rb') as file:
    label_encoder_geo = pickle.load(file)

with open('label_encover_gender.pkl', 'rb') as file:
    label_encoder_gender = pickle.load(file)

with open('scalar.pkl', 'rb') as file:
    scalar = pickle.load(file)



In [111]:
# Example input data
input_data = {
    'CreditScore': 100,
    'Geography': 'France',
    'Gender': 'Male',
    'Age': 60,
    'Tenure': 3,
    'Balance': 600,
    'NumOfProducts': 2,
    'HasCrCard': 1,
    'IsActiveMember': 0,
    'EstimatedSalary': 5000
}

#### Need to encode the data in the same way as it was done on the training data

In [112]:
## OHE of 'Geography' data
geo_encoded_input = label_encoder_geo.transform([[input_data['Geography']]]).toarray()
geo_encoded_df = pd.DataFrame(geo_encoded_input, columns=label_encoder_geo.get_feature_names_out(['Geography']))
geo_encoded_df



Unnamed: 0,Geography_France,Geography_Germany,Geography_Spain
0,1.0,0.0,0.0


In [113]:
input_df=pd.DataFrame([input_data])
input_df

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
0,100,France,Male,60,3,600,2,1,0,5000


In [114]:
## Encode categorical variables
input_df['Gender']=label_encoder_gender.transform(input_df['Gender'])
input_df


Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
0,100,France,1,60,3,600,2,1,0,5000


In [115]:
## concatination one hot encoded 
input_df=pd.concat([input_df.drop("Geography",axis=1),geo_encoded_df],axis=1)
input_df

Unnamed: 0,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Geography_France,Geography_Germany,Geography_Spain
0,100,1,60,3,600,2,1,0,5000,1.0,0.0,0.0


In [116]:
## Scaling the input data
input_scaled=scalar.transform(input_df)
input_scaled

array([[-5.59108471,  0.90911166,  2.02494929, -0.69844549, -1.24633823,
         0.80510537,  0.63367318, -1.0502616 , -1.63125121,  0.98019606,
        -0.57581067, -0.56349184]])

#### Predict the possibility of resignation

In [117]:
## Predict possibility of resignation
prediction=model.predict(input_scaled)
prediction

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 87ms/step


array([[0.24189691]], dtype=float32)

In [118]:
prediction_proba = prediction[0][0]

if prediction_proba > 0.5:
    print('The customer is likely to churn.')
else:
    print('The customer is not likely to churn.')

The customer is not likely to churn.
