# ML Modeling

## Libraries

In [34]:
# main libraries
import pandas as pd
import numpy as np
import time
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# metrics
from sklearn.metrics import classification_report, recall_score, precision_score, f1_score, make_scorer, confusion_matrix

# ML classifier models
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

# model selection (CV)
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from scipy.stats import randint

## Machine Learning

### Separating the DF into X and Y

In [2]:
bank = pd.read_csv("../data/bank_processed_data.csv", index_col=0)
bank.head()

Unnamed: 0,Customer_Age,Dependent_count,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,...,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio,Education_Level_encoded,Income_Category_encoded,Card_Category_encoded,x0_Married,x0_Single,x0_Unknown,x1_Existing Customer,x2_M
0,45,3,39,5,1,3,12691.0,777,1.335,1144,...,1.625,0.061,2.0,3.0,0.0,1.0,0.0,0.0,1.0,1.0
1,49,5,44,6,1,2,8256.0,864,1.541,1291,...,3.714,0.105,4.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
2,51,3,36,4,1,0,3418.0,0,2.594,1887,...,2.333,0.0,4.0,4.0,0.0,1.0,0.0,0.0,1.0,1.0
3,40,4,34,3,4,1,3313.0,2517,1.405,1171,...,2.333,0.76,2.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0
4,40,3,21,5,1,0,4716.0,0,2.175,816,...,2.5,0.0,1.0,3.0,0.0,1.0,0.0,0.0,1.0,1.0


In [86]:
# separating X and y
X = bank.drop(columns="x1_Existing Customer")
y = bank["x1_Existing Customer"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, stratify=y, random_state = 42)

In [87]:
# Scaling the data

scaler = StandardScaler() # initialize the scaler

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### Choosing the Models

The column that I want to identify is **x1_Existing Customer**, being **1** if the customer is still a customer, and **0** otherwise. In this case, as it's a True/False decission, the models that fit better for this type of Supervised ML are the Classifiers.

I will start checking the different models, without parameter tuning, for identify which are the models that perform better.

In [88]:
# Initializing the models

neigh = KNeighborsClassifier()
tree = DecisionTreeClassifier()
gradient = GradientBoostingClassifier()
RF = RandomForestClassifier()
adaboost = AdaBoostClassifier()
extra_tree = ExtraTreesClassifier()
support_vector = SVC(class_weight="balanced", probability=True)


models = [neigh, tree, RF, adaboost, gradient, extra_tree, support_vector]
model_names = ["KNeighbors", "DecisionTree", "RandomForest", "AdaBoost", 
               "GradientBoost", "ExtraTress", "SVC"]

The data per each category is not balanced, as customers represent 83.8% of the sample, the accuracy here is not relevant. 

In this scenario, I will focus more on *recall*, to ensure that the model classifies correctly the labels, and the precision. Mention that, the main focus will be on the label **0.0**, as it is the customers that already churned the bank, and we want to focus on that part to ensure that our model is able to predict possible future cases and act before churn happens.

Last, but not least, *macro avg* will also be taken into consideration, as we want to ensure that **0.0** are classified correctly, but we want that the amount of **1.0** are good too. I would have to find the perfect balance between those metrics.

#### Finding the best classification model

In [89]:
time_to_train = []
macro_precision = []
macro_recall = []
macro_F1 = []
report_dict = []

for model in models:    
    start = time.time()
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    
    # metrics
    recall = recall_score(y_test, y_pred, average="macro")
    precision = precision_score(y_test, y_pred, average="macro")
    f1 = f1_score(y_test, y_pred, average="macro")
    
    clasf_report = classification_report(y_test, y_pred, target_names=["Churn", "Customer"])
    clasf_report_dict = classification_report(y_test, y_pred, output_dict=True)
    
    print(f"Classification Report of {model} | Precision {round(precision,2)} | Recall {round(recall,2)} | F1 {round(f1,2)}:")
    print(f"{clasf_report}")
    print(f"Training time of {time.time() - start}\n")
    
    # appending to empty lists
    time_to_train.append((time.time() - start))
    macro_precision.append(round(precision,2))
    macro_recall.append(round(recall,2))
    macro_F1.append(round(f1,2))
    report_dict.append(clasf_report_dict)

Classification Report of KNeighborsClassifier() | Precision 0.86 | Recall 0.76 | F1 0.79:
              precision    recall  f1-score   support

       Churn       0.79      0.54      0.64       325
    Customer       0.92      0.97      0.94      1701

    accuracy                           0.90      2026
   macro avg       0.86      0.76      0.79      2026
weighted avg       0.90      0.90      0.90      2026

Training time of 0.6211016178131104

Classification Report of DecisionTreeClassifier() | Precision 0.88 | Recall 0.89 | F1 0.88:
              precision    recall  f1-score   support

       Churn       0.79      0.82      0.80       325
    Customer       0.97      0.96      0.96      1701

    accuracy                           0.94      2026
   macro avg       0.88      0.89      0.88      2026
weighted avg       0.94      0.94      0.94      2026

Training time of 0.06283164024353027

Classification Report of RandomForestClassifier() | Precision 0.95 | Recall 0.92 | F1 0.9

*IMPORTANT*

The `precision` is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively **the ability of the classifier not to label as positive a sample that is negative**.

The `recall` is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively **the ability of the classifier to find all the positive samples**. Note that in binary classification, recall of the positive class is also known as `sensitivity`; recall of the negative class is `specificity`.

The most important for this work is to increase the **sensitivity** (to detect all churn cases). Even though the other parameters are very important also.

In [90]:
precision_0 = []
recall_0 = []
f1_0 = []
precision_1 = []
recall_1 = []
f1_1 = []

for report in report_dict:
    # Info of churn label
    precision_0.append(round(report["0.0"]["precision"],2))
    recall_0.append(round(report["0.0"]["recall"],2))
    f1_0.append(round(report["0.0"]["f1-score"],2))
    
    # Info of current customers
    precision_1.append(round(report["1.0"]["precision"],2))
    recall_1.append(round(report["1.0"]["recall"],2))
    f1_1.append(round(report["1.0"]["f1-score"],2))

With all the information, I will create a DF for better visualization of the different models, beign able to identify which ones will provide best results for identiying churned customers.

In [91]:
best_models_DF = pd.DataFrame({"model":model_names,
                               "training_time":time_to_train,
                               "precision_macro":macro_precision,
                               "recall_macro":macro_recall,
                               "f1_macro":macro_F1,
                               "precision_0":precision_0,
                               "recall_0":recall_0,
                               "f1_0":f1_0,
                               "precision_1":precision_1,
                               "recall_1":recall_1,
                               "f1_1":f1_1
                               })

In [94]:
top3 = best_models_DF.sort_values(by=["f1_0"], ascending=False).reset_index(drop=True).iloc[:3]
top3

Unnamed: 0,model,training_time,precision_macro,recall_macro,f1_macro,precision_0,recall_0,f1_0,precision_1,recall_1,f1_1
0,GradientBoost,1.438149,0.95,0.92,0.93,0.92,0.85,0.89,0.97,0.99,0.98
1,RandomForest,0.823797,0.95,0.92,0.93,0.92,0.85,0.88,0.97,0.99,0.98
2,AdaBoost,0.411902,0.93,0.92,0.93,0.89,0.86,0.87,0.97,0.98,0.98


In [81]:
# ver con guillem
best_models_DF.sort_values(by=["f1_0"], ascending=False).reset_index(drop=True)

Unnamed: 0,model,training_time,precision_macro,recall_macro,f1_macro,precision_0,recall_0,f1_0,precision_1,recall_1,f1_1
0,GradientBoost,1.438178,0.95,0.91,0.93,0.94,0.84,0.89,0.97,0.99,0.98
1,RandomForest,0.839755,0.94,0.89,0.91,0.92,0.8,0.85,0.96,0.99,0.97
2,AdaBoost,0.407914,0.92,0.9,0.91,0.87,0.82,0.84,0.97,0.98,0.97
3,DecisionTree,0.058816,0.86,0.86,0.86,0.77,0.77,0.77,0.96,0.96,0.96
4,SVC,5.158185,0.83,0.91,0.86,0.68,0.89,0.77,0.98,0.92,0.95
5,ExtraTress,0.529571,0.92,0.82,0.86,0.91,0.64,0.75,0.93,0.99,0.96
6,KNeighbors,0.646728,0.86,0.73,0.77,0.81,0.49,0.61,0.91,0.98,0.94


The three models that does the best selection for the churned customers and also for the actual customers are **GradientBoost**, **AdaBoost** and **RandomForest**.

Now that we have in mind which models work best, let's start tuning them for improve their results

### Tuning Models

#### GradientBoost

In [38]:
# start_time = time.time()

# gradient_params = {"loss":["deviance", "exponential"],
#                   "criterion":["friedman_mse", "mse", "mae"],
#                   "max_features":["auto", "sqrt", "log2"],
#                   "n_estimators":randint(low=50, high=300),
#                   "max_depth":randint(low=2, high=8),
#                   "max_leaf_nodes":randint(low=5, high=15)}

# scorers = {"precision_score": make_scorer(precision_score, average="macro"),
#            "recall_score": make_scorer(recall_score, average="macro"),
#            "f1_score": make_scorer(f1_score, average="macro")
#           }

# gradient_search = RandomizedSearchCV(gradient,
#                                      gradient_params,
#                                      n_iter=10,
#                                      n_jobs=-1,
#                                      cv=10,
#                                      scoring=scorers,
#                                      refit=False,
#                                      random_state=42
#                                     )

# gradient_search.fit(X_train_scaled, y_train)

# print("--- %s seconds ---" % (time.time() - start_time))

--- 811.6379034519196 seconds ---


After the results obtained, I will create a DF for visualize which are the scores for each scoring. After that, I will pick the parameters that performed better for passing it to GridSearchCV.

In [43]:
gradient_results = pd.DataFrame(gradient_search.cv_results_)

gradient_results = gradient_results[["params", "mean_test_precision_score", "rank_test_precision_score",
                                     "mean_test_recall_score", "rank_test_recall_score",
                                     "mean_test_f1_score", "rank_test_f1_score"]]

gradient_results

Unnamed: 0,params,mean_test_precision_score,rank_test_precision_score,mean_test_recall_score,rank_test_recall_score,mean_test_f1_score,rank_test_f1_score
0,"{'criterion': 'mae', 'loss': 'exponential', 'm...",0.926291,6,0.846002,6,0.877009,6
1,"{'criterion': 'friedman_mse', 'loss': 'devianc...",0.953618,4,0.90978,4,0.929682,4
2,"{'criterion': 'mae', 'loss': 'exponential', 'm...",0.910651,7,0.784373,8,0.829159,8
3,"{'criterion': 'mse', 'loss': 'exponential', 'm...",0.9603,3,0.927914,3,0.942923,3
4,"{'criterion': 'friedman_mse', 'loss': 'devianc...",0.962142,2,0.937456,2,0.949083,2
5,"{'criterion': 'friedman_mse', 'loss': 'devianc...",0.962713,1,0.938994,1,0.950104,1
6,"{'criterion': 'mae', 'loss': 'exponential', 'm...",0.909723,8,0.790855,7,0.831012,7
7,"{'criterion': 'mae', 'loss': 'exponential', 'm...",0.877599,10,0.684356,10,0.730526,10
8,"{'criterion': 'mae', 'loss': 'deviance', 'max_...",0.888624,9,0.747059,9,0.791634,9
9,"{'criterion': 'friedman_mse', 'loss': 'exponen...",0.947826,5,0.884469,5,0.912026,5


In [46]:
gradient_results["params"][5]

{'criterion': 'friedman_mse',
 'loss': 'deviance',
 'max_depth': 4,
 'max_features': 'log2',
 'max_leaf_nodes': 14,
 'n_estimators': 269}

In [210]:
start_time = time.time()

gradient_params = {"loss":["deviance"],
                  "criterion":["friedman_mse"],
                  "max_features":["log2"],
                  "n_estimators":[269],
                  "max_depth":[4],
                  "max_leaf_nodes":[14]}

scorers = {"precision_score": make_scorer(precision_score, average="macro"),
           "recall_score": make_scorer(recall_score, average="macro"),
           "f1_score": make_scorer(f1_score, average="macro")
          }

best_gradient_search = GridSearchCV(gradient,
                               gradient_params,
                               n_jobs=-1,
                               cv=50,
                               scoring=scorers,
                               refit=False)

best_gradient_search.fit(X_train_scaled, y_train)

print("--- %s seconds ---" % (time.time() - start_time))

--- 15.260923147201538 seconds ---


After fitting the gradient with the best parameters, we obtain the training time, which is considerable low. Then, we can start creating the **best_gradient** and predict which is going to be the overall *precision*, *recall* and *F1 Score* for the model.

This structure will be the same for the following models.

In [211]:
best_gradient = GradientBoostingClassifier(loss="deviance", criterion="friedman_mse",
                                          max_features="log2", n_estimators=269,
                                          max_depth=4, max_leaf_nodes=14)

In [212]:
best_gradient.fit(X_train_scaled, y_train)

GradientBoostingClassifier(max_depth=4, max_features='log2', max_leaf_nodes=14,
                           n_estimators=269)

In [213]:
y_pred_gradient = best_gradient.predict(X_test_scaled)

#### AdaBoost

In [96]:
# start_time = time.time()

# adaboost_params = {"algorithm":["SAMME", "SAMME.R"],
#                   "n_estimators":randint(low=10, high=200)
#                   }

# scorers = {"precision_score": make_scorer(precision_score, average="macro"),
#            "recall_score": make_scorer(recall_score, average="macro"),
#            "f1_score": make_scorer(f1_score, average="macro")
#           }

# adaboost_search = RandomizedSearchCV(adaboost,
#                                      adaboost_params,
#                                      n_iter=10,
#                                      n_jobs=-1,
#                                      cv=10,
#                                      scoring=scorers,
#                                      refit=False,
#                                      random_state=42
#                                     )

# adaboost_search.fit(X_train_scaled, y_train)

# print("--- %s seconds ---" % (time.time() - start_time))

--- 14.541921854019165 seconds ---


In [99]:
adaboost_results = pd.DataFrame(adaboost_search.cv_results_)

adaboost_results = adaboost_results[["params", "mean_test_precision_score", "rank_test_precision_score",
                                     "mean_test_recall_score", "rank_test_recall_score",
                                     "mean_test_f1_score", "rank_test_f1_score"]]

adaboost_results

Unnamed: 0,params,mean_test_precision_score,rank_test_precision_score,mean_test_recall_score,rank_test_recall_score,mean_test_f1_score,rank_test_f1_score
0,"{'algorithm': 'SAMME', 'n_estimators': 189}",0.936063,1,0.912258,2,0.923323,2
1,"{'algorithm': 'SAMME', 'n_estimators': 24}",0.906809,10,0.833152,10,0.863271,10
2,"{'algorithm': 'SAMME', 'n_estimators': 81}",0.930603,7,0.888706,8,0.907679,8
3,"{'algorithm': 'SAMME', 'n_estimators': 30}",0.911296,9,0.846614,9,0.873831,9
4,"{'algorithm': 'SAMME', 'n_estimators': 131}",0.933799,6,0.905646,4,0.918695,4
5,"{'algorithm': 'SAMME', 'n_estimators': 84}",0.928539,8,0.890884,7,0.907993,7
6,"{'algorithm': 'SAMME', 'n_estimators': 97}",0.933977,5,0.896619,6,0.913639,6
7,"{'algorithm': 'SAMME', 'n_estimators': 109}",0.934408,4,0.902094,5,0.916889,5
8,"{'algorithm': 'SAMME.R', 'n_estimators': 161}",0.934634,3,0.921834,1,0.927895,1
9,"{'algorithm': 'SAMME', 'n_estimators': 159}",0.935324,2,0.911031,3,0.9224,3


In [100]:
adaboost_results["params"][8]

{'algorithm': 'SAMME.R', 'n_estimators': 161}

In [214]:
start_time = time.time()

adaboost_params = {"algorithm":["SAMME.R"],
                  "n_estimators":[161]
                  }

scorers = {"precision_score": make_scorer(precision_score, average="macro"),
           "recall_score": make_scorer(recall_score, average="macro"),
           "f1_score": make_scorer(f1_score, average="macro")
          }

best_adaboost_search = GridSearchCV(adaboost,
                               adaboost_params,
                               n_jobs=-1,
                               cv=50,
                               scoring=scorers,
                               refit=False)

best_adaboost_search.fit(X_train_scaled, y_train)

print("--- %s seconds ---" % (time.time() - start_time))

--- 13.742919921875 seconds ---


In [215]:
best_adaboost = AdaBoostClassifier(algorithm="SAMME.R", n_estimators=161)

In [216]:
best_adaboost.fit(X_train_scaled, y_train)

AdaBoostClassifier(n_estimators=161)

In [217]:
y_pred_adaboost = best_adaboost.predict(X_test_scaled)

#### RandomForest

In [120]:
# start_time = time.time()

# RF_params = {"criterion":["gini", "entropy"],
#              "max_features":["auto", "sqrt", "log2"],
#              "class_weight":["balanced", "balanced_subsample"],          
#              "n_estimators":randint(low=10, high=400),
#              "max_depth":randint(low=2, high=20),
#              "min_samples_split":randint(low=2, high=40)
#             }

# scorers = {"precision_score": make_scorer(precision_score, average="macro"),
#            "recall_score": make_scorer(recall_score, average="macro"),
#            "f1_score": make_scorer(f1_score, average="macro")
#           }

# RF_search = RandomizedSearchCV(RF,
#                                RF_params,
#                                n_iter=10,
#                                n_jobs=-1,
#                                cv=50,
#                                scoring=scorers,
#                                refit=False,
#                                random_state=42)

# RF_search.fit(X_train_scaled, y_train)

# print("--- %s seconds ---" % (time.time() - start_time))

--- 178.9842643737793 seconds ---


In [121]:
RF_results = pd.DataFrame(RF_search.cv_results_)

RF_results = RF_results[["params", "mean_test_precision_score", "rank_test_precision_score",
                         "mean_test_recall_score", "rank_test_recall_score",
                         "mean_test_f1_score", "rank_test_f1_score"]]

RF_results

Unnamed: 0,params,mean_test_precision_score,rank_test_precision_score,mean_test_recall_score,rank_test_recall_score,mean_test_f1_score,rank_test_f1_score
0,"{'class_weight': 'balanced', 'criterion': 'ent...",0.937418,2,0.922512,6,0.928718,2
1,"{'class_weight': 'balanced', 'criterion': 'gin...",0.912491,5,0.934996,1,0.922076,5
2,"{'class_weight': 'balanced_subsample', 'criter...",0.789057,10,0.884323,10,0.821278,10
3,"{'class_weight': 'balanced_subsample', 'criter...",0.869792,8,0.926401,4,0.893274,8
4,"{'class_weight': 'balanced_subsample', 'criter...",0.909635,6,0.932632,3,0.91965,7
5,"{'class_weight': 'balanced', 'criterion': 'ent...",0.948699,1,0.907981,8,0.925835,3
6,"{'class_weight': 'balanced', 'criterion': 'gin...",0.90883,7,0.934426,2,0.919668,6
7,"{'class_weight': 'balanced', 'criterion': 'ent...",0.936011,3,0.925385,5,0.929491,1
8,"{'class_weight': 'balanced', 'criterion': 'ent...",0.935234,4,0.916906,7,0.924709,4
9,"{'class_weight': 'balanced_subsample', 'criter...",0.803499,9,0.892898,9,0.83512,9


In [128]:
RF_results["params"][6]

{'class_weight': 'balanced',
 'criterion': 'gini',
 'max_depth': 10,
 'max_features': 'log2',
 'min_samples_split': 19,
 'n_estimators': 397}

In [218]:
start_time = time.time()

RF_params = {"class_weight":["balanced"],
             "criterion":["gini"],
             "max_features":["log2"],
             "max_depth":[10],
             "min_samples_split":[19],
             "n_estimators":[397]
            }

scorers = {"precision_score": make_scorer(precision_score, average="macro"),
           "recall_score": make_scorer(recall_score, average="macro"),
           "f1_score": make_scorer(f1_score, average="macro")
          }

best_RF_search = GridSearchCV(RF,
                              RF_params,
                              n_jobs=-1,
                              cv=50,
                              scoring=scorers,
                              refit=False)

best_RF_search.fit(X_train_scaled, y_train)

print("--- %s seconds ---" % (time.time() - start_time))

--- 30.310616731643677 seconds ---


In [219]:
best_RF = RandomForestClassifier(class_weight="balanced", criterion="gini", max_features="log2",
                                 max_depth=10, min_samples_split=19, n_estimators=397)

In [220]:
best_RF.fit(X_train_scaled, y_train)

RandomForestClassifier(class_weight='balanced', max_depth=10,
                       max_features='log2', min_samples_split=19,
                       n_estimators=397)

In [221]:
y_pred_RF = best_RF.predict(X_test_scaled)

### Analyzing the results

In [222]:
# Analyzing initial results of the models

# GradientBoost
gradient.fit(X_train_scaled, y_train)
y_first_pred_gradient = gradient.predict(X_test_scaled)

#AdaBoost
adaboost.fit(X_train_scaled, y_train)
y_first_pred_adaboost = adaboost.predict(X_test_scaled)

# RF
RF.fit(X_train_scaled, y_train)
y_first_pred_RF = RF.predict(X_test_scaled)

In [223]:
# Comparing GradientBoost
print(f"GradientBoost\nInitial:\n{pd.DataFrame(classification_report(y_test, y_first_pred_gradient, output_dict=True, target_names=['Churn', 'Customer']))}\n")
print(f"Modified:\n{pd.DataFrame(classification_report(y_test, y_pred_gradient, output_dict=True, target_names=['Churn', 'Customer']))}\n\n\n")

# Comparing AdaBoost
print(f"AdaBoost\nInitial:\n{pd.DataFrame(classification_report(y_test, y_first_pred_adaboost, output_dict=True, target_names=['Churn', 'Customer']))}\n")
print(f"Modified:\n{pd.DataFrame(classification_report(y_test, y_pred_adaboost, output_dict=True, target_names=['Churn', 'Customer']))}\n\n\n")

# Comparing RandomForest
print(f"RandomForest\nInitial:\n{pd.DataFrame(classification_report(y_test, y_first_pred_RF, output_dict=True, target_names=['Churn', 'Customer']))}\n")
print(f"Modified:\n{pd.DataFrame(classification_report(y_test, y_pred_RF, output_dict=True, target_names=['Churn', 'Customer']))}")

GradientBoost
Initial:
                Churn     Customer  accuracy    macro avg  weighted avg
precision    0.923333     0.972190  0.964956     0.947762      0.964353
recall       0.852308     0.986479  0.964956     0.919393      0.964956
f1-score     0.886400     0.979282  0.964956     0.932841      0.964383
support    325.000000  1701.000000  0.964956  2026.000000   2026.000000

Modified:
                Churn     Customer  accuracy    macro avg  weighted avg
precision    0.935897     0.980747   0.97384     0.958322      0.973552
recall       0.898462     0.988242   0.97384     0.943352      0.973840
f1-score     0.916797     0.984480   0.97384     0.950639      0.973623
support    325.000000  1701.000000   0.97384  2026.000000   2026.000000



AdaBoost
Initial:
                Churn     Customer  accuracy    macro avg  weighted avg
precision    0.893891     0.972595  0.960513     0.933243      0.959969
recall       0.855385     0.980600  0.960513     0.917992      0.960513
f1-score 

Best model between **GradientBoost** (?)

### Changing train_test_split

The scenario will be with the same methods, as they were the best. The structure will be the following one:
* RandomizedSearchCV for finding the best parameters. In the case the parameters are the same than with the original test_size, this cell will be removed.
* Obtain the best parameters of RandomizedSearchCV (in case they are different).
* Run the GridSearchCV.
* Find the prediction for the different models / tests, doing a later comparision with the original test_size sample.

#### 10% Test Size

In [122]:
X_train_10, X_test_10, y_train_10, y_test_10 = train_test_split(X, y, test_size = 0.1, stratify=y ,random_state = 42)

In [123]:
# Scaling the data

scaler_10 = StandardScaler() # initialize the scaler

X_train_scaled_10 = scaler_10.fit_transform(X_train_10)
X_test_scaled_10 = scaler_10.transform(X_test_10)

##### GradientBoost

In [186]:
start_time = time.time()

gradient_params_10 = {"loss":["deviance"],
                      "criterion":["friedman_mse"],
                      "max_features":["log2"],
                      "n_estimators":[269],
                      "max_depth":[4],
                      "max_leaf_nodes":[14]}

scorers = {"precision_score": make_scorer(precision_score, average="macro"),
           "recall_score": make_scorer(recall_score, average="macro"),
           "f1_score": make_scorer(f1_score, average="macro")
          }

best_gradient_search_10 = GridSearchCV(gradient,
                                       gradient_params_10,
                                       n_jobs=-1,
                                       cv=50,
                                       scoring=scorers,
                                       refit=False
                                      )

best_gradient_search_10.fit(X_train_scaled_10, y_train_10)

print("--- %s seconds ---" % (time.time() - start_time))

--- 17.027616024017334 seconds ---


In [187]:
best_gradient_10 = GradientBoostingClassifier(loss="deviance", criterion="friedman_mse",
                                              max_features="log2", n_estimators=269,
                                              max_depth=4, max_leaf_nodes=14)

In [188]:
best_gradient_10.fit(X_train_scaled_10, y_train_10)

GradientBoostingClassifier(max_depth=4, max_features='log2', max_leaf_nodes=14,
                           n_estimators=269)

In [189]:
y_pred_gradient_10 = best_gradient_10.predict(X_test_scaled_10)

##### AdaBoost

In [190]:
start_time = time.time()

adaboost_params_10 = {"algorithm":["SAMME.R"],
                      "n_estimators":[161]
                     }

scorers = {"precision_score": make_scorer(precision_score, average="macro"),
           "recall_score": make_scorer(recall_score, average="macro"),
           "f1_score": make_scorer(f1_score, average="macro")
          }

best_adaboost_search_10 = GridSearchCV(adaboost,
                                       adaboost_params_10,
                                       n_jobs=-1,
                                       cv=50,
                                       scoring=scorers,
                                       refit=False
                                      )

best_adaboost_search_10.fit(X_train_scaled_10, y_train_10)

print("--- %s seconds ---" % (time.time() - start_time))

--- 15.418502807617188 seconds ---


In [191]:
best_adaboost_10 = AdaBoostClassifier(algorithm="SAMME.R", n_estimators=161)

In [192]:
best_adaboost_10.fit(X_train_scaled_10, y_train_10)

AdaBoostClassifier(n_estimators=161)

In [193]:
y_pred_adaboost_10 = best_adaboost_10.predict(X_test_scaled_10)

##### RandomForest

In [145]:
start_time = time.time()

RF_params_10 = {"criterion":["gini", "entropy"],
                "max_features":["auto", "sqrt", "log2"],
                "class_weight":["balanced", "balanced_subsample"],
                "n_estimators":randint(low=10, high=400),
                "max_depth":randint(low=2, high=20),
                "min_samples_split":randint(low=2, high=40)
               }

scorers = {"precision_score": make_scorer(precision_score, average="macro"),
           "recall_score": make_scorer(recall_score, average="macro"),
           "f1_score": make_scorer(f1_score, average="macro")
          }

RF_search_10 = RandomizedSearchCV(RF,
                                  RF_params_10,
                                  n_iter=10,
                                  n_jobs=-1,
                                  cv=50,
                                  scoring=scorers,
                                  refit=False,
                                  random_state=42
                                 )

RF_search_10.fit(X_train_scaled_10, y_train_10)

print("--- %s seconds ---" % (time.time() - start_time))

--- 199.81752181053162 seconds ---


In [146]:
RF_results_10 = pd.DataFrame(RF_search_10.cv_results_)

RF_results_10 = RF_results_10[["params", "mean_test_precision_score", "rank_test_precision_score",
                               "mean_test_recall_score", "rank_test_recall_score",
                               "mean_test_f1_score", "rank_test_f1_score"]]

RF_results_10

Unnamed: 0,params,mean_test_precision_score,rank_test_precision_score,mean_test_recall_score,rank_test_recall_score,mean_test_f1_score,rank_test_f1_score
0,"{'class_weight': 'balanced', 'criterion': 'ent...",0.935569,3,0.924724,7,0.929239,3
1,"{'class_weight': 'balanced', 'criterion': 'gin...",0.913592,5,0.938146,1,0.924556,5
2,"{'class_weight': 'balanced_subsample', 'criter...",0.793462,10,0.888467,10,0.825799,10
3,"{'class_weight': 'balanced_subsample', 'criter...",0.871615,8,0.928083,4,0.895344,8
4,"{'class_weight': 'balanced_subsample', 'criter...",0.909588,6,0.93776,2,0.921974,6
5,"{'class_weight': 'balanced', 'criterion': 'ent...",0.949398,1,0.909672,8,0.927001,4
6,"{'class_weight': 'balanced', 'criterion': 'gin...",0.905827,7,0.934247,3,0.918355,7
7,"{'class_weight': 'balanced', 'criterion': 'ent...",0.934053,4,0.926703,5,0.929573,1
8,"{'class_weight': 'balanced', 'criterion': 'ent...",0.935842,2,0.925161,6,0.929427,2
9,"{'class_weight': 'balanced_subsample', 'criter...",0.802038,9,0.892698,9,0.833891,9


In [148]:
RF_results_10["params"][1]

{'class_weight': 'balanced',
 'criterion': 'gini',
 'max_depth': 12,
 'max_features': 'log2',
 'min_samples_split': 25,
 'n_estimators': 382}

In [194]:
start_time = time.time()

RF_params_10 = {"class_weight":["balanced"],
                "criterion":["gini"],
                "max_features":["log2"],
                "max_depth":[12],
                "min_samples_split":[25],
                "n_estimators":[382]
               }

scorers = {"precision_score": make_scorer(precision_score, average="macro"),
           "recall_score": make_scorer(recall_score, average="macro"),
           "f1_score": make_scorer(f1_score, average="macro")
          }

best_RF_search_10 = GridSearchCV(RF,
                                 RF_params_10,
                                 n_jobs=-1,
                                 cv=50,
                                 scoring=scorers,
                                 refit=False
                                )

best_RF_search_10.fit(X_train_scaled_10, y_train_10)

print("--- %s seconds ---" % (time.time() - start_time))

--- 31.965221881866455 seconds ---


In [195]:
best_RF_10 = RandomForestClassifier(class_weight="balanced", criterion="gini", max_features="log2",
                                 max_depth=12, min_samples_split=25, n_estimators=382)

In [196]:
best_RF_10.fit(X_train_scaled_10, y_train_10)

RandomForestClassifier(class_weight='balanced', max_depth=12,
                       max_features='log2', min_samples_split=25,
                       n_estimators=382)

In [197]:
y_pred_RF_10 = best_RF_10.predict(X_test_scaled_10)

#### 30% Test Size

In [153]:
X_train_30, X_test_30, y_train_30, y_test_30 = train_test_split(X, y, test_size = 0.3, stratify=y ,random_state = 42)

In [154]:
# Scaling the data

scaler_30 = StandardScaler() # initialize the scaler

X_train_scaled_30 = scaler_10.fit_transform(X_train_30)
X_test_scaled_30 = scaler_10.transform(X_test_30)

##### GradientBoost

In [155]:
# start_time = time.time()

# gradient_params_30 = {"loss":["deviance", "exponential"],
#                       "criterion":["friedman_mse", "mse", "mae"],
#                       "max_features":["auto", "sqrt", "log2"],
#                       "n_estimators":randint(low=50, high=300),
#                       "max_depth":randint(low=2, high=8),
#                       "max_leaf_nodes":randint(low=5, high=15)
#                      }

# scorers = {"precision_score": make_scorer(precision_score, average="macro"),
#            "recall_score": make_scorer(recall_score, average="macro"),
#            "f1_score": make_scorer(f1_score, average="macro")
#           }

# gradient_search_30 = RandomizedSearchCV(gradient,
#                                         gradient_params_30,
#                                         n_iter=10,
#                                         n_jobs=-1,
#                                         cv=10,
#                                         scoring=scorers,
#                                         refit=False,
#                                         random_state=42
#                                        )

# gradient_search_30.fit(X_train_scaled_30, y_train_30)

# print("--- %s seconds ---" % (time.time() - start_time))

--- 643.497720003128 seconds ---


In [156]:
gradient_results_30 = pd.DataFrame(gradient_search_30.cv_results_)

gradient_results_30 = gradient_results_30[["params", "mean_test_precision_score", "rank_test_precision_score",
                                           "mean_test_recall_score", "rank_test_recall_score",
                                           "mean_test_f1_score", "rank_test_f1_score"]]

gradient_results_30

Unnamed: 0,params,mean_test_precision_score,rank_test_precision_score,mean_test_recall_score,rank_test_recall_score,mean_test_f1_score,rank_test_f1_score
0,"{'criterion': 'mae', 'loss': 'exponential', 'm...",0.932549,6,0.857462,6,0.889038,6
1,"{'criterion': 'friedman_mse', 'loss': 'devianc...",0.951894,4,0.901801,4,0.924365,4
2,"{'criterion': 'mae', 'loss': 'exponential', 'm...",0.921782,7,0.803202,7,0.846937,7
3,"{'criterion': 'mse', 'loss': 'exponential', 'm...",0.958854,3,0.92366,3,0.940043,3
4,"{'criterion': 'friedman_mse', 'loss': 'devianc...",0.960722,2,0.933591,1,0.946251,1
5,"{'criterion': 'friedman_mse', 'loss': 'devianc...",0.961395,1,0.931212,2,0.945293,2
6,"{'criterion': 'mae', 'loss': 'exponential', 'm...",0.919618,8,0.801238,8,0.842993,8
7,"{'criterion': 'mae', 'loss': 'exponential', 'm...",0.875599,10,0.68045,10,0.726741,10
8,"{'criterion': 'mae', 'loss': 'deviance', 'max_...",0.893381,9,0.747646,9,0.794207,9
9,"{'criterion': 'friedman_mse', 'loss': 'exponen...",0.94709,5,0.88293,5,0.910884,5


In [161]:
gradient_results_30["params"][4]

{'criterion': 'friedman_mse',
 'loss': 'deviance',
 'max_depth': 5,
 'max_features': 'sqrt',
 'max_leaf_nodes': 10,
 'n_estimators': 285}

In [198]:
start_time = time.time()

gradient_params_30 = {"loss":["deviance"],
                      "criterion":["friedman_mse"],
                      "max_features":["sqrt"],
                      "n_estimators":[285],
                      "max_depth":[5],
                      "max_leaf_nodes":[10]}

scorers = {"precision_score": make_scorer(precision_score, average="macro"),
           "recall_score": make_scorer(recall_score, average="macro"),
           "f1_score": make_scorer(f1_score, average="macro")
          }

best_gradient_search_30 = GridSearchCV(gradient,
                                       gradient_params_30,
                                       n_jobs=-1,
                                       cv=50,
                                       scoring=scorers,
                                       refit=False
                                      )

best_gradient_search_30.fit(X_train_scaled_30, y_train_30)

print("--- %s seconds ---" % (time.time() - start_time))

--- 12.058336734771729 seconds ---


In [199]:
best_gradient_30 = GradientBoostingClassifier(loss="deviance", criterion="friedman_mse",
                                              max_features="sqrt", n_estimators=285,
                                              max_depth=5, max_leaf_nodes=10)

In [200]:
best_gradient_30.fit(X_train_scaled_30, y_train_30)

GradientBoostingClassifier(max_depth=5, max_features='sqrt', max_leaf_nodes=10,
                           n_estimators=285)

In [201]:
y_pred_gradient_30 = best_gradient_30.predict(X_test_scaled_30)

##### AdaBoost

In [166]:
# start_time = time.time()

# adaboost_params_30 = {"algorithm":["SAMME", "SAMME.R"],
#                       "n_estimators":randint(low=10, high=200)
#                      }

# scorers = {"precision_score": make_scorer(precision_score, average="macro"),
#            "recall_score": make_scorer(recall_score, average="macro"),
#            "f1_score": make_scorer(f1_score, average="macro")
#           }

# adaboost_search_30 = RandomizedSearchCV(adaboost,
#                                         adaboost_params_30,
#                                         n_iter=10,
#                                         n_jobs=-1,
#                                         cv=10,
#                                         scoring=scorers,
#                                         refit=False,
#                                         random_state=42
#                                        )

# adaboost_search_30.fit(X_train_scaled_30, y_train_30)

# print("--- %s seconds ---" % (time.time() - start_time))

--- 13.467572689056396 seconds ---


In [167]:
adaboost_results_30 = pd.DataFrame(adaboost_search_30.cv_results_)

adaboost_results_30 = adaboost_results_30[["params", "mean_test_precision_score", "rank_test_precision_score",
                                           "mean_test_recall_score", "rank_test_recall_score",
                                           "mean_test_f1_score", "rank_test_f1_score"]]

adaboost_results_30

Unnamed: 0,params,mean_test_precision_score,rank_test_precision_score,mean_test_recall_score,rank_test_recall_score,mean_test_f1_score,rank_test_f1_score
0,"{'algorithm': 'SAMME', 'n_estimators': 189}",0.941839,1,0.912465,2,0.926109,1
1,"{'algorithm': 'SAMME', 'n_estimators': 24}",0.904759,10,0.832266,10,0.861997,10
2,"{'algorithm': 'SAMME', 'n_estimators': 81}",0.93307,7,0.895373,7,0.912438,7
3,"{'algorithm': 'SAMME', 'n_estimators': 30}",0.913258,9,0.845387,9,0.874004,9
4,"{'algorithm': 'SAMME', 'n_estimators': 131}",0.937733,2,0.905016,4,0.92009,3
5,"{'algorithm': 'SAMME', 'n_estimators': 84}",0.933742,5,0.891751,8,0.91069,8
6,"{'algorithm': 'SAMME', 'n_estimators': 97}",0.933657,6,0.899683,6,0.915225,6
7,"{'algorithm': 'SAMME', 'n_estimators': 109}",0.93477,4,0.900725,5,0.916334,5
8,"{'algorithm': 'SAMME.R', 'n_estimators': 161}",0.931817,8,0.917057,1,0.923947,2
9,"{'algorithm': 'SAMME', 'n_estimators': 159}",0.934959,3,0.905743,3,0.919298,4


In [170]:
adaboost_results_30["params"][0]

{'algorithm': 'SAMME', 'n_estimators': 189}

In [202]:
start_time = time.time()

adaboost_params_30 = {"algorithm":["SAMME"],
                      "n_estimators":[189]
                     }

scorers = {"precision_score": make_scorer(precision_score, average="macro"),
           "recall_score": make_scorer(recall_score, average="macro"),
           "f1_score": make_scorer(f1_score, average="macro")
          }

best_adaboost_search_30 = GridSearchCV(adaboost,
                                       adaboost_params_30,
                                       n_jobs=-1,
                                       cv=50,
                                       scoring=scorers,
                                       refit=False
                                      )

best_adaboost_search_30.fit(X_train_scaled_30, y_train_30)

print("--- %s seconds ---" % (time.time() - start_time))

--- 12.744485139846802 seconds ---


In [203]:
best_adaboost_30 = AdaBoostClassifier(algorithm="SAMME", n_estimators=189)

In [204]:
best_adaboost_30.fit(X_train_scaled_30, y_train_30)

AdaBoostClassifier(algorithm='SAMME', n_estimators=189)

In [205]:
y_pred_adaboost_30 = best_adaboost_30.predict(X_test_scaled_30)

##### RandomForest

In [206]:
start_time = time.time()

RF_params_30 = {"class_weight":["balanced"],
                "criterion":["gini"],
                "max_features":["log2"],
                "max_depth":[12],
                "min_samples_split":[25],
                "n_estimators":[382]
               }

scorers = {"precision_score": make_scorer(precision_score, average="macro"),
           "recall_score": make_scorer(recall_score, average="macro"),
           "f1_score": make_scorer(f1_score, average="macro")
          }

best_RF_search_30 = GridSearchCV(RF,
                                 RF_params_30,
                                 n_jobs=-1,
                                 cv=50,
                                 scoring=scorers,
                                 refit=False
                                )

best_RF_search_30.fit(X_train_scaled_30, y_train_30)

print("--- %s seconds ---" % (time.time() - start_time))

--- 24.18022394180298 seconds ---


In [207]:
best_RF_30 = RandomForestClassifier(class_weight="balanced", criterion="gini", max_features="log2",
                                 max_depth=12, min_samples_split=25, n_estimators=382)

In [208]:
best_RF_30.fit(X_train_scaled_30, y_train_30)

RandomForestClassifier(class_weight='balanced', max_depth=12,
                       max_features='log2', min_samples_split=25,
                       n_estimators=382)

In [209]:
y_pred_RF_30 = best_RF_30.predict(X_test_scaled_30)

#### New sample results VS Original ones

In [224]:
# Comparing GradientBoost
print(f"GradientBoost\ntest_size=0.2:\n{pd.DataFrame(classification_report(y_test, y_pred_gradient, output_dict=True, target_names=['Churn', 'Customer']))}\n")
print(f"test_size=0.1:\n{pd.DataFrame(classification_report(y_test_10, y_pred_gradient_10, output_dict=True, target_names=['Churn', 'Customer']))}\n")
print(f"test_size=0.3:\n{pd.DataFrame(classification_report(y_test_30, y_pred_gradient_30, output_dict=True, target_names=['Churn', 'Customer']))}\n\n\n")

# Comparing AdaBoost
print(f"AdaBoost\ntest_size=0.2:\n{pd.DataFrame(classification_report(y_test, y_pred_adaboost, output_dict=True, target_names=['Churn', 'Customer']))}\n")
print(f"test_size=0.1:\n{pd.DataFrame(classification_report(y_test_10, y_pred_adaboost_10, output_dict=True, target_names=['Churn', 'Customer']))}\n")
print(f"test_size=0.3:\n{pd.DataFrame(classification_report(y_test_30, y_pred_adaboost_30, output_dict=True, target_names=['Churn', 'Customer']))}\n\n\n")

# Comparing RandomForest
print(f"RandomForest\ntest_size=0.2:\n{pd.DataFrame(classification_report(y_test, y_pred_RF, output_dict=True, target_names=['Churn', 'Customer']))}\n")
print(f"test_size=0.1:\n{pd.DataFrame(classification_report(y_test_10, y_pred_RF_10, output_dict=True, target_names=['Churn', 'Customer']))}\n")
print(f"test_size=0.3:\n{pd.DataFrame(classification_report(y_test_30, y_pred_RF_30, output_dict=True, target_names=['Churn', 'Customer']))}")

GradientBoost
test_size=0.2:
                Churn     Customer  accuracy    macro avg  weighted avg
precision    0.935897     0.980747   0.97384     0.958322      0.973552
recall       0.898462     0.988242   0.97384     0.943352      0.973840
f1-score     0.916797     0.984480   0.97384     0.950639      0.973623
support    325.000000  1701.000000   0.97384  2026.000000   2026.000000

test_size=0.1:
                Churn    Customer  accuracy    macro avg  weighted avg
precision    0.921212    0.987028  0.976308     0.954120      0.976438
recall       0.932515    0.984706  0.976308     0.958611      0.976308
f1-score     0.926829    0.985866  0.976308     0.956347      0.976366
support    163.000000  850.000000  0.976308  1013.000000   1013.000000

test_size=0.3:
                Churn     Customer  accuracy    macro avg  weighted avg
precision    0.935412     0.973745  0.968082     0.954579      0.967590
recall       0.860656     0.988632  0.968082     0.924644      0.968082
f1-score

* **`test_size` reduced**

Looking on the *Churn* customers, when the `test_size` is reduced, the **recall** improves and the **precision** goes worse. This doesn't happen on the *AdaBoost*, that's why I will be focusing on the other models.

Looking into *RandomForest* and *GradientBoost* we can conclude that the best model is **GradientBoost**. The reason why is because the `recall` for the churned customers on both models is the same, but the `precision` drops significantly on the *RandomForest*. Also, important to mention that the overall metrics for the *Customers* are better on the *GradientBoost* too.

* **`test_size` amplified**

On the other hand, when we increase the `test_size`, the `recall` for the *Churn* customers goes down on the three models. Although in some cases the `precision` improves, the overall `f1_score` shows us that the results are worse with that sample, that's why we discard it.

* **conclusions**

As the results improve with less sample, is better to stay with a `test_size = 0.1` rather than a `0.2`. Also, it shows that the results are better for the **GradientBoost** model.

We might consider keep reducing the sample to see if the numbers improve more, but that wouldn't be a good practice as with each sample reduction there would be less data for doing a good prediction.