## ML Final Project

**Scenario:** You work at a multinational bank that is aiming to increase it's market share in 
Europe. Recently, it has been noticed that the number of customers using the banking 
services has declined, and the bank is worried that existing customers have stopped 
using them as their main bank. <br> 

As a data scientist, you are tasked with finding out the 
reasons behind customer churn (when a customer stops using them as the main bank) and to predict customer churn. <br> 

The marketing team, 
in particular, is interested in your findings and want to better understand existing 
customer behavior and possibly predict customer churn. Your results will help the 
marketing team to use their budget wisely to target potential churners. To achieve 
this objective, in this exercise, you will import the banking data (Churn_Modelling.csv) 
provided by the bank and do some machine learning to solve their problem.

Data dictionary

- CustomerID: Unique ID of each customer
- CredRate: Credit Score of the customer 
- Geography: Country customer is from 
- Gender
- Age
- Tenure: How long customer has been with bank 
- Prod Number: Number of products customer has with bank 
- HasCrCard: Does customer have credit card
- ActMem: Is customer active member 
- Estimated salary: Annual estimated salary of customer 
- Exited: Whether customer has churned (1 is yes)

In [24]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split ,  GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.neighbors import KNeighborsClassifier
import xgboost as xgb




** The Bank is worried about exisitng customer leaving and has tasked me to do a analysis of what may have contributed to the decline.
 ** 1. first i will import the main data list and do a inspection of the data set given, 
 find out any missing values, drop or fill up NaN (Data Preprocessing)

In [25]:
data = pd.read_csv("Churn_Modelling.csv")

In [26]:
data

Unnamed: 0,CustomerId,CredRate,Geography,Gender,Age,Tenure,Balance,Prod Number,HasCrCard,ActMem,EstimatedSalary,Exited
0,15634602,619,France,Female,42.0,2,0.00,1,1,1,101348.88,1
1,15647311,608,Spain,Female,41.0,1,83807.86,1,0,1,112542.58,0
2,15619304,502,France,Female,42.0,8,159660.80,3,1,0,113931.57,1
3,15701354,699,France,Female,39.0,1,0.00,2,0,0,93826.63,0
4,15737888,850,Spain,Female,43.0,2,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...,...
9995,15606229,771,France,Male,39.0,5,0.00,2,1,0,96270.64,0
9996,15569892,516,France,Male,35.0,10,57369.61,1,1,1,101699.77,0
9997,15584532,709,France,Female,36.0,7,0.00,1,0,1,42085.58,1
9998,15682355,772,Germany,Male,42.0,3,75075.31,2,1,0,92888.52,1


In [27]:
data.describe()

Unnamed: 0,CustomerId,CredRate,Age,Tenure,Balance,Prod Number,HasCrCard,ActMem,EstimatedSalary,Exited
count,10000.0,10000.0,9994.0,10000.0,10000.0,10000.0,10000.0,10000.0,9996.0,10000.0
mean,15690940.0,650.5288,38.925255,5.0128,76485.889288,1.5302,0.7055,0.5151,100074.744083,0.2037
std,71936.19,96.653299,10.489248,2.892174,62397.405202,0.581654,0.45584,0.499797,57515.774555,0.402769
min,15565700.0,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,15628530.0,584.0,32.0,3.0,0.0,1.0,0.0,0.0,50974.0775,0.0
50%,15690740.0,652.0,37.0,5.0,97198.54,1.0,1.0,1.0,100168.24,0.0
75%,15753230.0,718.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0
max,15815690.0,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0


In [28]:
df= data.drop(["CustomerId","Balance"],axis=1)

In [29]:
df

Unnamed: 0,CredRate,Geography,Gender,Age,Tenure,Prod Number,HasCrCard,ActMem,EstimatedSalary,Exited
0,619,France,Female,42.0,2,1,1,1,101348.88,1
1,608,Spain,Female,41.0,1,1,0,1,112542.58,0
2,502,France,Female,42.0,8,3,1,0,113931.57,1
3,699,France,Female,39.0,1,2,0,0,93826.63,0
4,850,Spain,Female,43.0,2,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...
9995,771,France,Male,39.0,5,2,1,0,96270.64,0
9996,516,France,Male,35.0,10,1,1,1,101699.77,0
9997,709,France,Female,36.0,7,1,0,1,42085.58,1
9998,772,Germany,Male,42.0,3,2,1,0,92888.52,1


** Had decided to drop the "customerID  and "Balance" column. as it may not be relevant in my later prediction findings.

In [30]:
df.isna().sum()

CredRate           0
Geography          0
Gender             4
Age                6
Tenure             0
Prod Number        0
HasCrCard          0
ActMem             0
EstimatedSalary    4
Exited             0
dtype: int64

** To get a summary of the missing values(NaN) in each columns.
i have concluded that the missing value in each of the columns are very minimal and hence simply drop rows with NA.

In [31]:
(df[df.isna().any(axis=1)])


Unnamed: 0,CredRate,Geography,Gender,Age,Tenure,Prod Number,HasCrCard,ActMem,EstimatedSalary,Exited
6,822,France,,50.0,7,2,1,1,10062.8,0
10,528,France,Male,,6,2,0,0,80181.12,0
11,497,Spain,Male,,3,2,1,0,76390.01,0
12,476,France,Female,,10,2,1,0,26260.98,0
37,804,Spain,Male,,7,1,0,1,98453.45,0
38,850,France,Male,,7,1,1,1,40812.9,0
39,582,Germany,Male,,6,2,0,1,178074.04,0
74,519,France,Male,36.0,9,2,0,1,,0
77,678,France,Female,32.0,9,1,1,1,,0
87,729,France,Male,30.0,9,2,1,0,,0


** Performing a function to find out missing values(NaN) in the columns.

In [32]:
df = df.dropna()
df

Unnamed: 0,CredRate,Geography,Gender,Age,Tenure,Prod Number,HasCrCard,ActMem,EstimatedSalary,Exited
0,619,France,Female,42.0,2,1,1,1,101348.88,1
1,608,Spain,Female,41.0,1,1,0,1,112542.58,0
2,502,France,Female,42.0,8,3,1,0,113931.57,1
3,699,France,Female,39.0,1,2,0,0,93826.63,0
4,850,Spain,Female,43.0,2,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...
9995,771,France,Male,39.0,5,2,1,0,96270.64,0
9996,516,France,Male,35.0,10,1,1,1,101699.77,0
9997,709,France,Female,36.0,7,1,0,1,42085.58,1
9998,772,Germany,Male,42.0,3,2,1,0,92888.52,1


In [33]:
df.Exited.value_counts()

0    7950
1    2036
Name: Exited, dtype: int64

To find out whats the ratio of customer Exited versus Stayed.

In [34]:
df = pd.get_dummies(df)
df

Unnamed: 0,CredRate,Age,Tenure,Prod Number,HasCrCard,ActMem,EstimatedSalary,Exited,Geography_France,Geography_Germany,Geography_Spain,Gender_Female,Gender_Male
0,619,42.0,2,1,1,1,101348.88,1,1,0,0,1,0
1,608,41.0,1,1,0,1,112542.58,0,0,0,1,1,0
2,502,42.0,8,3,1,0,113931.57,1,1,0,0,1,0
3,699,39.0,1,2,0,0,93826.63,0,1,0,0,1,0
4,850,43.0,2,1,1,1,79084.10,0,0,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,771,39.0,5,2,1,0,96270.64,0,1,0,0,0,1
9996,516,35.0,10,1,1,1,101699.77,0,1,0,0,0,1
9997,709,36.0,7,1,0,1,42085.58,1,1,0,0,1,0
9998,772,42.0,3,2,1,0,92888.52,1,0,1,0,0,1


** Results shown after using Dummy Encoding was done.

In [35]:
X = df.drop('Exited', axis=1) # features

In [36]:
y = df.Exited # targets

In [37]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 111)

sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)

** Preparing the Features and Targets to be used for the Train, Test, Split process.

## Random Forest

In [38]:
classifier = RandomForestClassifier(random_state = 111) 

classifier.fit(X_train, y_train)

RandomForestClassifier(random_state=111)

In [39]:
result_rf = classifier.predict(X_test)
print(classification_report(y_test , result_rf))
roc_auc_score(y_test, result_rf)

              precision    recall  f1-score   support

           0       0.87      0.96      0.91      1977
           1       0.74      0.44      0.55       520

    accuracy                           0.85      2497
   macro avg       0.80      0.70      0.73      2497
weighted avg       0.84      0.85      0.84      2497



0.7011239835025873

** Results is 70% correct predictability

In [40]:
base_model_rf = RandomForestClassifier(random_state = 111)
param_dict_rf = {'n_estimators' : [10  , 20 , 50 , 100] , 
                 'max_depth' : [5 ,6 ,7 , 9 , 10]}
grid_model_rf = GridSearchCV(param_grid= param_dict_rf , 
                             estimator= base_model_rf , cv= 5 , verbose=1,n_jobs=-1) 
grid_model_rf.fit(X_train, y_train)
grid_model_rf.best_params_

Fitting 5 folds for each of 20 candidates, totalling 100 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    4.9s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:   10.3s finished


{'max_depth': 9, 'n_estimators': 100}

** Using GridsearchCV to get a better recommended parameters for a better result.

In [41]:
classifier = RandomForestClassifier(random_state = 111, n_estimators = 9, max_depth=100) 

classifier.fit(X_train, y_train)
result_rf = classifier.predict(X_test)
print(classification_report(y_test , result_rf))
roc_auc_score(y_test, result_rf)

              precision    recall  f1-score   support

           0       0.87      0.93      0.90      1977
           1       0.64      0.46      0.54       520

    accuracy                           0.83      2497
   macro avg       0.76      0.70      0.72      2497
weighted avg       0.82      0.83      0.82      2497



0.6978410373137233

** Results was almost the same as before. around 69% plus correct predictability.

## KNN

In [42]:
base_knn = KNeighborsClassifier()
param_dict_knn =  {'n_neighbors' : range(1,10)
                  }
grid_model_knn = GridSearchCV(param_grid= param_dict_knn , 
                              estimator= base_knn, 
                              cv= 3, verbose=1, n_jobs=-1)
grid_model_knn.fit(X_train_scaled, y_train)  
grid_model_knn.best_params_

Fitting 3 folds for each of 9 candidates, totalling 27 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  27 out of  27 | elapsed:    2.3s finished


{'n_neighbors': 9}

In [43]:
knn = KNeighborsClassifier(n_neighbors = 9)
knn.fit(X_train_scaled, y_train)
result_knn = knn.predict(X_test_scaled)
print(classification_report(y_test , result_knn))
roc_auc_score(y_test, result_knn)

              precision    recall  f1-score   support

           0       0.84      0.96      0.90      1977
           1       0.68      0.33      0.44       520

    accuracy                           0.83      2497
   macro avg       0.76      0.64      0.67      2497
weighted avg       0.81      0.83      0.80      2497



0.6422673242286292

** Results is 64% correct predictability

## XGBoost

In [44]:
clf_xgb = xgb.XGBClassifier(objective='binary:logistic', random_state =42)

clf_xgb.fit(X_train, y_train);

In [45]:
param_grid1 = {
    'max_depth': [3, 4, 5, 6],
    'learning_rate': [0.1, 0.01, 0.05],
    'gamma': [0, 0.25, 1.0],
    'scale_pos_weight': [1, 2, 3, 4]  
}



model = GridSearchCV(param_grid= param_grid1, 
                             estimator= clf_xgb , cv= 5, verbose=1, n_jobs=-1)

xgb_m = model.fit(X_train, y_train,
          early_stopping_rounds=10,   
         # Validation error needs to decrease at least every 10 rounds to continue training.
         
          eval_metric='auc',   # early stopping is based on this metric
         
          eval_set=[(X_test, y_test)],   # evaluate auc upon the test set
          verbose=False)

xgb_m.best_params_

Fitting 5 folds for each of 144 candidates, totalling 720 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    5.3s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   21.6s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:   51.5s
[Parallel(n_jobs=-1)]: Done 720 out of 720 | elapsed:  1.4min finished


{'gamma': 0, 'learning_rate': 0.1, 'max_depth': 3, 'scale_pos_weight': 1}

In [46]:
clf_xgb = xgb.XGBClassifier(objective='binary:logistic', random_state = 42, learning_rate = 0.1,max_depth=3)

clf_xgb.fit(X_train, y_train)
result_xgb = clf_xgb.predict(X_test)
print(classification_report(y_test , result_xgb))
roc_auc_score(y_test, result_xgb)

              precision    recall  f1-score   support

           0       0.87      0.97      0.92      1977
           1       0.79      0.46      0.58       520

    accuracy                           0.86      2497
   macro avg       0.83      0.71      0.75      2497
weighted avg       0.85      0.86      0.85      2497



0.712407104781915

** Results is 71 % correct predictability!

Conclusion : Of the 3 ML models done " XGBoost" seems to give a higher predictability than the other two, so i will choose this 
as my ML Model.