# ML Modeling

## Libraries

In [17]:
# main libraries
import pandas as pd
import numpy as np
import time
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# metrics
from sklearn.metrics import classification_report, recall_score, precision_score, f1_score,confusion_matrix, accuracy_score

# ML classifier models
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

# resampling
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import NearMiss
from collections import Counter

# model selection (CV)
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV, StratifiedKFold
from scipy.stats import randint

## Machine Learning

### Separating the DF into X and Y

In [2]:
bank = pd.read_csv("../data/bank_processed_data.csv", index_col=0)
bank.head()

Unnamed: 0,Customer_Age,Dependent_count,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,...,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio,Education_Level_encoded,Income_Category_encoded,Card_Category_encoded,x0_Married,x0_Single,x0_Unknown,x1_Existing Customer,x2_M
0,45,3,39,5,1,3,12691.0,777,1.335,1144,...,1.625,0.061,2.0,3.0,0.0,1.0,0.0,0.0,1.0,1.0
1,49,5,44,6,1,2,8256.0,864,1.541,1291,...,3.714,0.105,4.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
2,51,3,36,4,1,0,3418.0,0,2.594,1887,...,2.333,0.0,4.0,4.0,0.0,1.0,0.0,0.0,1.0,1.0
3,40,4,34,3,4,1,3313.0,2517,1.405,1171,...,2.333,0.76,2.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0
4,40,3,21,5,1,0,4716.0,0,2.175,816,...,2.5,0.0,1.0,3.0,0.0,1.0,0.0,0.0,1.0,1.0


In [3]:
# separating X and y
X = bank.drop(columns="x1_Existing Customer")
y = bank["x1_Existing Customer"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, stratify=y, random_state = 42)

In [4]:
# Scaling the data

scaler = StandardScaler() # initialize the scaler

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### Choosing the Models

The column that I want to identify is **x1_Existing Customer**, being **1** if the customer is still a customer, and **0** otherwise. In this case, as it's a True/False decission, the models that fit better for this type of Supervised ML are the Classifiers.

I will start checking the different models, without parameter tuning, for identify which are the models that perform better.

In [5]:
# Initializing the models

neigh = KNeighborsClassifier()
tree = DecisionTreeClassifier()
gradient = GradientBoostingClassifier()
RF = RandomForestClassifier()
adaboost = AdaBoostClassifier()
extra_tree = ExtraTreesClassifier()
support_vector = SVC()


models = [neigh, tree, RF, adaboost, gradient, extra_tree, support_vector]
model_names = ["KNeighbors", "DecisionTree", "RandomForest", "AdaBoost", 
               "GradientBoost", "ExtraTress", "SVC"]

The data per each category is not balanced, as customers represent 83.8% of the sample, the accuracy here is not relevant. 

In this scenario, I will focus more on `recall`, to ensure that the model classifies correctly the labels, and the precision. Mention that, the main focus will be on the label **0.0, as it is the customers that already churned the bank, and we want to focus on that part** to ensure that our model is able to predict possible future cases and act before churn happens.

Last, but not least, `macro avg` will also be taken into consideration, as we want to ensure that **0.0** are classified correctly, but we want that the amount of **1.0** are good too. I would have to find the perfect balance between those metrics.

*NOTES*

The `precision` is the ratio TP / (TP + FP) where TP is the number of true positives and FP the number of false positives. The precision is intuitively **the ability of the classifier not to label as positive a sample that is negative**.

The `recall` is the ratio TP / (TP + FN) where TP is the number of true positives and FN the number of false negatives. The recall is intuitively **the ability of the classifier to find all the positive samples**. Note that in binary classification, recall of the positive class is also known as `sensitivity`; recall of the negative class is `specificity`.

The most important for this work is to increase the **sensitivity** (to detect all churn cases). Even though the other parameters are very important also.

#### Finding the best classification model

In [6]:
def top3_classifier_model(models):
    """
    Input: Models to test
    Output: DF top 3 models
    """
    # first batch of empty lists    
    time_to_train = []
    accuracy = []
    macro_precision = []
    macro_recall = []
    macro_F1 = []
    report_dict = []

    for model in models:    
        start = time.time()
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)

        # metrics
        accuracy_ = accuracy_score(y_test, y_pred)
        recall = recall_score(y_test, y_pred, average="macro")
        precision = precision_score(y_test, y_pred, average="macro")
        f1 = f1_score(y_test, y_pred, average="macro")
        clasf_report_dict = classification_report(y_test, y_pred, output_dict=True)

        # appending to empty lists
        time_to_train.append((time.time() - start))
        accuracy.append(round(accuracy_,4))
        macro_precision.append(round(precision,4))
        macro_recall.append(round(recall,4))
        macro_F1.append(round(f1,4))
        report_dict.append(clasf_report_dict)
    
    # second batch
    precision_0 = []
    recall_0 = []
    f1_0 = []
    precision_1 = []
    recall_1 = []
    f1_1 = []

    for report in report_dict:
        # Info of churn label
        precision_0.append(round(report["0.0"]["precision"],4))
        recall_0.append(round(report["0.0"]["recall"],4))
        f1_0.append(round(report["0.0"]["f1-score"],4))

        # Info of current customers
        precision_1.append(round(report["1.0"]["precision"],4))
        recall_1.append(round(report["1.0"]["recall"],4))
        f1_1.append(round(report["1.0"]["f1-score"],4))
        
    # creating DF
    best_models_DF = pd.DataFrame({"model":model_names,
                                   "training_time":time_to_train,
                                   "accuracy":accuracy,
                                   "precision_macro":macro_precision,
                                   "recall_macro":macro_recall,
                                   "f1_macro":macro_F1,
                                   "precision_0":precision_0,
                                   "recall_0":recall_0,
                                   "f1_0":f1_0,
                                   "precision_1":precision_1,
                                   "recall_1":recall_1,
                                   "f1_1":f1_1
                                  })
    
    # getting top 3 models
    top3 = best_models_DF.sort_values(by=["f1_macro"], ascending=False).reset_index(drop=True).iloc[:3]
    
    return top3

In [8]:
top3_classifier_model(models)

Unnamed: 0,model,training_time,accuracy,precision_macro,recall_macro,f1_macro,precision_0,recall_0,f1_0,precision_1,recall_1,f1_1
0,GradientBoost,1.475824,0.965,0.9478,0.9194,0.9328,0.9233,0.8523,0.8864,0.9722,0.9865,0.9793
1,RandomForest,0.866706,0.963,0.9461,0.9132,0.9287,0.9223,0.84,0.8792,0.9699,0.9865,0.9781
2,AdaBoost,0.38694,0.9605,0.9332,0.918,0.9254,0.8939,0.8554,0.8742,0.9726,0.9806,0.9766


The three models that does the best selection for the churned customers and also for the actual customers are **GradientBoost**, **AdaBoost** and **RandomForest**.

Now that we have in mind which models work best, let's start tuning them for improve their results

### Tuning Models

#### GradientBoost

In [63]:
# start_time = time.time()

# gradient_params = {"loss":["deviance", "exponential"],
#                   "criterion":["friedman_mse", "mse", "mae"],
#                   "max_features":["auto", "sqrt", "log2"],
#                   "n_estimators":randint(low=50, high=300),
#                   "max_depth":randint(low=2, high=8),
#                   "max_leaf_nodes":randint(low=5, high=15)
#                   }

# scorers = {"precision_score": make_scorer(precision_score, average="macro"),
#            "recall_score": make_scorer(recall_score, average="macro"),
#            "f1_score": make_scorer(f1_score, average="macro")
#           }

# gradient_search = RandomizedSearchCV(gradient,
#                                      gradient_params,
#                                      n_iter=10,
#                                      n_jobs=-1,
#                                      cv=StratifiedKFold(),
#                                      scoring=scorers,
#                                      refit=False,
#                                      random_state=42
#                                     )

# gradient_search.fit(X_train_scaled, y_train)

# print("--- %s seconds ---" % (time.time() - start_time))

After the results obtained, I will create a DF for visualize which are the scores for each scoring. After that, I will pick the parameters that performed better for passing it to GridSearchCV.

In [64]:
# gradient_results = pd.DataFrame(gradient_search.cv_results_)

# gradient_results = gradient_results[["params", "mean_test_precision_score", "rank_test_precision_score",
#                                      "mean_test_recall_score", "rank_test_recall_score",
#                                      "mean_test_f1_score", "rank_test_f1_score"]]

# gradient_results

In [65]:
# gradient_results["params"][4]

When we know which are the best parameters to pass to the model, we will create a function for each one of them and then obtain their respective **y_pred**. The **y_value** obtained will be stored in a variable for, later on, build the `classification_report` and `confusion_matrix`.

The structure will be the same for future models and variations.

In [62]:
def gradient_gridsearch(X_train, y_train):
    """
    Input: X_train and y_train for model training
    Output: Best params
    
    """
    # Inputting the different parameters obtained from the RandomizedSearchCV
    gradient_params = {"loss":["deviance", "exponential"],
                       "criterion":["friedman_mse", "mse"],
                       "max_features":["log2","sqrt"],
                       "n_estimators":[260, 265, 270, 275, 280],
                       "max_depth":[3, 4, 5],
                       "max_leaf_nodes":[10, 12, 14, 16]
                      }

    # Running the best_gradient with GridSearchCV
    best_gradient = GridSearchCV(gradient,
                                 gradient_params,
                                 n_jobs=-1,
                                 cv=StratifiedKFold(),
                                 scoring="f1_macro"
                                )
    
    best_gradient.fit(X_train, y_train)
    
    best = best_gradient.best_estimator_
    
    return best

In [63]:
gradient_gridsearch(X_train_scaled, y_train)

GradientBoostingClassifier(max_depth=5, max_features='log2', max_leaf_nodes=14,
                           n_estimators=260)

In [64]:
def gradient_trainer(X_train, y_train, X_test, loss, criterion, max_features, n_estimators, max_depth, max_leaf_nodes):
    """
    Input: All the inputs for training the GradientBoostingClassifier with GridSearchCV
    Output: y_pred
    """
    # Creating the best_gradient 
    best_gradient = GradientBoostingClassifier(loss=loss, criterion=criterion,
                                               max_features=max_features, n_estimators=n_estimators,
                                               max_depth=max_depth, max_leaf_nodes=max_leaf_nodes
                                              )
    
    # Model fit
    best_gradient.fit(X_train, y_train)
    
    # Model train
    y_pred = best_gradient.predict(X_test)
    
    return y_pred

In [65]:
y_pred_gradient = gradient_trainer(X_train_scaled,
                                   y_train,
                                   X_test_scaled,
                                   "deviance",
                                   "friedman_mse",
                                   "log2",
                                   265,
                                   5,
                                   14
                                  )

#### AdaBoost

In [13]:
# start_time = time.time()

# adaboost_params = {"algorithm":["SAMME", "SAMME.R"],
#                   "n_estimators":randint(low=10, high=200)
#                   }

# scorers = {"precision_score": make_scorer(precision_score, average="macro"),
#            "recall_score": make_scorer(recall_score, average="macro"),
#            "f1_score": make_scorer(f1_score, average="macro")
#           }

# adaboost_search = RandomizedSearchCV(adaboost,
#                                      adaboost_params,
#                                      n_iter=10,
#                                      n_jobs=-1,
#                                      cv=10,
#                                      scoring=scorers,
#                                      refit=False,
#                                      random_state=42
#                                     )

# adaboost_search.fit(X_train_scaled, y_train)

# print("--- %s seconds ---" % (time.time() - start_time))

In [14]:
# adaboost_results = pd.DataFrame(adaboost_search.cv_results_)

# adaboost_results = adaboost_results[["params", "mean_test_precision_score", "rank_test_precision_score",
#                                      "mean_test_recall_score", "rank_test_recall_score",
#                                      "mean_test_f1_score", "rank_test_f1_score"]]

# adaboost_results

In [15]:
# adaboost_results["params"][8]

In [42]:
def adaboost_gridsearch(X_train, y_train):
    """
    Input: X_train and y_train for model training
    Output: Best params
    
    """
    # Inputting the different parameters obtained from the RandomizedSearchCV
    adaboost_params = {"algorithm":["SAMME", "SAMME.R"],
                       "n_estimators":[155, 158, 160, 161, 164]
                      }

    # Running the best_gradient with GridSearchCV
    best_adaboost = GridSearchCV(adaboost,
                                 adaboost_params,
                                 n_jobs=-1,
                                 cv=StratifiedKFold(),
                                 scoring="f1_macro"
                                )
    
    best_adaboost.fit(X_train, y_train)
    
    best = best_adaboost.best_estimator_
    
    return best

In [43]:
adaboost_gridsearch(X_train_scaled, y_train)

AdaBoostClassifier(n_estimators=155)

In [47]:
def adaboost_trainer(X_train, y_train, X_test, algorithm, n_estimators):
    """
    Input: All the inputs for training the AdaBoostClassifier with GridSearchCV
    Output: y_pred
    """
    # Creating the best_adaboost 
    best_adaboost = AdaBoostClassifier(algorithm=algorithm, n_estimators=n_estimators)
    
    # Model fit
    best_adaboost.fit(X_train, y_train)
    
    # Model train
    y_pred = best_adaboost.predict(X_test)
    
    return y_pred

In [48]:
y_pred_adaboost = adaboost_trainer(X_train_scaled,
                                   y_train,
                                   X_test_scaled,
                                   "SAMME.R",
                                   155
                                  )

#### RandomForest

In [18]:
# start_time = time.time()

# RF_params = {"criterion":["gini", "entropy"],
#              "max_features":["auto", "sqrt", "log2"],
#              "class_weight":["balanced", "balanced_subsample"],          
#              "n_estimators":randint(low=10, high=400),
#              "max_depth":randint(low=2, high=20),
#              "min_samples_split":randint(low=2, high=40)
#             }

# scorers = {"precision_score": make_scorer(precision_score, average="macro"),
#            "recall_score": make_scorer(recall_score, average="macro"),
#            "f1_score": make_scorer(f1_score, average="macro")
#           }

# RF_search = RandomizedSearchCV(RF,
#                                RF_params,
#                                n_iter=10,
#                                n_jobs=-1,
#                                cv=50,
#                                scoring=scorers,
#                                refit=False,
#                                random_state=42)

# RF_search.fit(X_train_scaled, y_train)

# print("--- %s seconds ---" % (time.time() - start_time))

In [19]:
# RF_results = pd.DataFrame(RF_search.cv_results_)

# RF_results = RF_results[["params", "mean_test_precision_score", "rank_test_precision_score",
#                          "mean_test_recall_score", "rank_test_recall_score",
#                          "mean_test_f1_score", "rank_test_f1_score"]]

# RF_results

In [20]:
# RF_results["params"][6]

In [50]:
def RF_gridsearch(X_train, y_train):
    """
    Input: X_train and y_train for model training
    Output: Best params
    
    """
    # Inputting the different parameters obtained from the RandomizedSearchCV    
    RF_params = {"criterion":["gini", "entropy"],
                 "max_features":["auto", "sqrt", "log2"],
                 "class_weight":["balanced", "balanced_subsample"],               
                 "n_estimators":[390, 395, 400, 405],
                 "max_depth":[8, 10, 12],
                 "min_samples_split":[16, 18, 20, 22]
                }

    # Running the best_gradient with GridSearchCV
    best_RF = GridSearchCV(RF,
                                 RF_params,
                                 n_jobs=-1,
                                 cv=StratifiedKFold(),
                                 scoring="f1_macro"
                                )
    
    best_RF.fit(X_train, y_train)
    
    best = best_RF.best_estimator_
    
    return best

In [51]:
RF_gridsearch(X_train_scaled, y_train)

RandomForestClassifier(class_weight='balanced', criterion='entropy',
                       max_depth=12, min_samples_split=18, n_estimators=390)

In [52]:
def RF_trainer(X_train, y_train, X_test, class_weight, criterion, max_features, max_depth, min_samples_split, n_estimators):
    """
    Input: All the inputs for training the RandomForestClassifier with GridSearchCV
    Output: y_pred
    """
    # Creating the best_RF
    best_RF = RandomForestClassifier(class_weight=class_weight, criterion=criterion, max_features=max_features,
                                     max_depth=max_depth, min_samples_split=min_samples_split, n_estimators=n_estimators)
    
    # Model fit
    best_RF.fit(X_train, y_train)
    
    # Model train
    y_pred = best_RF.predict(X_test)
    
    return y_pred

In [53]:
y_pred_RF = RF_trainer(X_train_scaled,
                       y_train, 
                       X_test_scaled, 
                       "balanced", 
                       "entropy", 
                       "auto", 
                       12, 
                       18, 
                       390
                      )

### Analyzing the results

In [66]:
def initial_predictions(X_train, y_train, X_test):
    """
    Input: Variables for training a model
    Output: y_pred for three different models    
    """
    #GradientBoost
    gradient.fit(X_train, y_train)
    y_pred_initial_gradient = gradient.predict(X_test)
    
    #AdaBoost
    adaboost.fit(X_train, y_train)
    y_pred_initial_adaboost = adaboost.predict(X_test)
    
    # RF
    RF.fit(X_train, y_train)
    y_pred_initial_RF = RF.predict(X_test)
    
    return y_pred_initial_gradient, y_pred_initial_adaboost, y_pred_initial_RF

In [67]:
y_pred_initial_gradient, y_pred_initial_adaboost, y_pred_initial_RF = initial_predictions(X_train_scaled, 
                                                                                          y_train, 
                                                                                          X_test_scaled
                                                                                         )

In [68]:
# Comparing GradientBoost
print(f"GradientBoost\nInitial:\n{pd.DataFrame(classification_report(y_test, y_pred_initial_gradient, output_dict=True, target_names=['Churn', 'Customer']))}\n")
print(f"Modified:\n{pd.DataFrame(classification_report(y_test, y_pred_gradient, output_dict=True, target_names=['Churn', 'Customer']))}\n\n\n")

# Comparing AdaBoost
print(f"AdaBoost\nInitial:\n{pd.DataFrame(classification_report(y_test, y_pred_initial_adaboost, output_dict=True, target_names=['Churn', 'Customer']))}\n")
print(f"Modified:\n{pd.DataFrame(classification_report(y_test, y_pred_adaboost, output_dict=True, target_names=['Churn', 'Customer']))}\n\n\n")

# Comparing RandomForest
print(f"RandomForest\nInitial:\n{pd.DataFrame(classification_report(y_test, y_pred_initial_RF, output_dict=True, target_names=['Churn', 'Customer']))}\n")
print(f"Modified:\n{pd.DataFrame(classification_report(y_test, y_pred_RF, output_dict=True, target_names=['Churn', 'Customer']))}")

GradientBoost
Initial:
                Churn     Customer  accuracy    macro avg  weighted avg
precision    0.923333     0.972190  0.964956     0.947762      0.964353
recall       0.852308     0.986479  0.964956     0.919393      0.964956
f1-score     0.886400     0.979282  0.964956     0.932841      0.964383
support    325.000000  1701.000000  0.964956  2026.000000   2026.000000

Modified:
                Churn     Customer  accuracy    macro avg  weighted avg
precision    0.932692     0.980163  0.972853     0.956428      0.972548
recall       0.895385     0.987654  0.972853     0.941519      0.972853
f1-score     0.913658     0.983895  0.972853     0.948776      0.972628
support    325.000000  1701.000000  0.972853  2026.000000   2026.000000



AdaBoost
Initial:
                Churn     Customer  accuracy    macro avg  weighted avg
precision    0.893891     0.972595  0.960513     0.933243      0.959969
recall       0.855385     0.980600  0.960513     0.917992      0.960513
f1-score 

Looking into the `classification_report`, we will focus on the **Churn** column. On a first view, we can identify that the best models are the `GradientBoost` and `RandomForest`. Let's dive deep and compare each other:
1. **`GradientBoost`**: The `precision` is 0.9326, the `recall` is 0.8953, the `f1-score` is 0.9136 and the `accuracy` is 0.9728.
2. **`RandomForest`**: The `precision` is 0.8521, the `recall` is 0.9046, the `f1-score` is 0.8776 and the `accuracy` is 0.9595.

Although the `recall` is higher on the `RandomForest`, the precision has a big difference compared to the `GradientBoost`. Also, taking into consideration that the `f1-score` will give us a more accurate reference of which model performs better with a more conservative approach, we can ensure that the **`GradientBoost`** is the best model overall.

For further testing, we will only focus with that model.

### Changing train_test_split

As we already did our initial `RandomizedSearchCV` for obtaining the values were we should focus with the `GridSearchCV`, in this case is not neccesary to do that step again.

What we will do, is the following:
* Split the test size in two different values, **0.1** and **0.3**.
* Obtain the best parameters for the gridsearch and see if they are slightly different from the previous model.
* Train the models with the new samples.
* Compare the new `classification_report` with the previous one and see if there is improvement. **In case that the model improves, for further analysis we will use the best one**.

#### 10% Test Size

In [79]:
X_train_10, X_test_10, y_train_10, y_test_10 = train_test_split(X, y, test_size = 0.1, stratify=y ,random_state = 42)

In [80]:
# Scaling the data
X_train_scaled_10 = scaler.fit_transform(X_train_10)
X_test_scaled_10 = scaler.transform(X_test_10)

In [81]:
# Obtaining best params
gradient_gridsearch(X_train_scaled_10, y_train_10)

GradientBoostingClassifier(criterion='mse', max_depth=5, max_features='sqrt',
                           max_leaf_nodes=16, n_estimators=265)

In [82]:
# Storing y_pred
y_pred_gradient_10 = gradient_trainer(X_train_scaled_10,
                                      y_train_10, 
                                      X_test_scaled_10, 
                                      "deviance", 
                                      "mse", 
                                      "sqrt", 
                                      265, 
                                      5, 
                                      16
                                     )

#### 30% Test Size

In [83]:
X_train_30, X_test_30, y_train_30, y_test_30 = train_test_split(X, y, test_size = 0.3, stratify=y ,random_state = 42)

In [84]:
# Scaling the data
X_train_scaled_30 = scaler.fit_transform(X_train_30)
X_test_scaled_30 = scaler.transform(X_test_30)

In [85]:
# Obtaining best params
gradient_gridsearch(X_train_scaled_30, y_train_30)

GradientBoostingClassifier(max_depth=5, max_features='log2', max_leaf_nodes=14,
                           n_estimators=270)

In [86]:
# Storing y_pred
y_pred_gradient_30 = gradient_trainer(X_train_scaled_30,
                                      y_train_30,
                                      X_test_scaled_30,
                                      "deviance",
                                      "friedman_mse",
                                      "log2",
                                      270,
                                      5,
                                      14
                                     )

#### New sample results VS Previous sample

In [87]:
# Comparing GradientBoost
print(f"GradientBoost\ntest_size=0.2:\n{pd.DataFrame(classification_report(y_test, y_pred_gradient, output_dict=True, target_names=['Churn', 'Customer']))}\n")
print(f"test_size=0.1:\n{pd.DataFrame(classification_report(y_test_10, y_pred_gradient_10, output_dict=True, target_names=['Churn', 'Customer']))}\n")
print(f"test_size=0.3:\n{pd.DataFrame(classification_report(y_test_30, y_pred_gradient_30, output_dict=True, target_names=['Churn', 'Customer']))}")

GradientBoost
test_size=0.2:
                Churn     Customer  accuracy    macro avg  weighted avg
precision    0.932692     0.980163  0.972853     0.956428      0.972548
recall       0.895385     0.987654  0.972853     0.941519      0.972853
f1-score     0.913658     0.983895  0.972853     0.948776      0.972628
support    325.000000  1701.000000  0.972853  2026.000000   2026.000000

test_size=0.1:
                Churn    Customer  accuracy    macro avg  weighted avg
precision    0.914634    0.984688  0.973346     0.949661      0.973416
recall       0.920245    0.983529  0.973346     0.951887      0.973346
f1-score     0.917431    0.984108  0.973346     0.950770      0.973379
support    163.000000  850.000000  0.973346  1013.000000   1013.000000

test_size=0.3:
                Churn     Customer  accuracy    macro avg  weighted avg
precision    0.944072     0.974537  0.970056     0.959304      0.969645
recall       0.864754     0.990200  0.970056     0.927477      0.970056
f1-score

* **`test_size` reduced**

Looking on the *Churn* customers, when the `test_size` is reduced, the **recall** improves and the **precision** goes worse. This doesn't happen on the *AdaBoost*, that's why I will be focusing on the other models.

Looking into *RandomForest* and *GradientBoost* we can conclude that the best model is **GradientBoost**. The reason why is because the `recall` for the churned customers on both models is the same, but the `precision` drops significantly on the *RandomForest*. Also, important to mention that the overall metrics for the *Customers* are better on the *GradientBoost* too.

* **`test_size` amplified**

On the other hand, when we increase the `test_size`, the `recall` for the *Churn* customers goes down on the three models. Although in some cases the `precision` improves, the overall `f1_score` shows us that the results are worse with that sample, that's why we discard it.

* **conclusions**

As the results improve with less sample, is better to stay with a **`test_size = 0.1`** rather than a `0.2`.

We might consider keep reducing the sample to see if the numbers improve more, but that wouldn't be a good practice as with each sample reduction the variance would increase, causing on overfitting the model.

### Resampling

Although the measures take into consideration the `macro avg` instead of the `weighted avg`, it would be a good idea to see how the models performs with **resampling**. In this case, there will be two kind of resamples:

1. **Over Sampling**: Fake data will be created for the train set using the `SMOTE` method, increasing the size of the minority class as maximum as possible.
2. **Under Sampling**: Original data will be removed for the train set using `NearMiss` method, reducing the majority class to the same amount as the minority.

#### Over Sampling

In [88]:
# initializing SMOTE

smote = SMOTE()

X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# checking that the amount is the same for each value

print('Original dataset shape:', Counter(y_train))
print('Resample dataset shape:', Counter(y_train_smote))

Original dataset shape: Counter({1.0: 6799, 0.0: 1302})
Resample dataset shape: Counter({0.0: 6799, 1.0: 6799})


In [89]:
# Scaling the data
X_train_scaled_smote = scaler.fit_transform(X_train_smote)

In [90]:
# Obtaining best params
gradient_gridsearch(X_train_scaled_smote, y_train_smote)

GradientBoostingClassifier(criterion='mse', loss='exponential', max_depth=4,
                           max_features='sqrt', max_leaf_nodes=16,
                           n_estimators=265)

In [97]:
# Storing y_pred
y_pred_gradient_smote = gradient_trainer(X_train_scaled_smote,
                                         y_train_smote,
                                         X_test_scaled,
                                         "exponential",
                                         "mse",
                                         "sqrt",
                                         265,
                                         4,
                                         16
                                        )

#### Under Sampling

In [92]:
# initializing NearMiss

nm = NearMiss()

X_train_nm, y_train_nm = nm.fit_resample(X_train, y_train)

# checking that the amount is the same for each value

print('Original dataset shape:', Counter(y_train))
print('Resample dataset shape:', Counter(y_train_nm))

Original dataset shape: Counter({1.0: 6799, 0.0: 1302})
Resample dataset shape: Counter({0.0: 1302, 1.0: 1302})


In [93]:
# Scaling the data
X_train_scaled_nm = scaler.fit_transform(X_train_nm)

In [94]:
# Obtaining best params
gradient_gridsearch(X_train_scaled_nm, y_train_nm)

GradientBoostingClassifier(criterion='mse', max_depth=4, max_features='log2',
                           max_leaf_nodes=14, n_estimators=280)

In [98]:
# Storing y_pred
y_pred_gradient_nm = gradient_trainer(X_train_scaled_nm,
                                         y_train_nm,
                                         X_test_scaled,
                                         "deviance",
                                         "mse",
                                         "log2",
                                         280,
                                         4,
                                         14
                                        )

#### Results

In [99]:
# Classification Report
print(f"GradientBoost\ntest_size=0.1:\n{pd.DataFrame(classification_report(y_test_10, y_pred_gradient_10, output_dict=True, target_names=['Churn', 'Customer']))}\n")
print(f"Over Sampling:\n{pd.DataFrame(classification_report(y_test, y_pred_gradient_smote, output_dict=True, target_names=['Churn', 'Customer']))}\n")
print(f"Under Sampling:\n{pd.DataFrame(classification_report(y_test, y_pred_gradient_nm, output_dict=True, target_names=['Churn', 'Customer']))}")

GradientBoost
test_size=0.1:
                Churn    Customer  accuracy    macro avg  weighted avg
precision    0.914634    0.984688  0.973346     0.949661      0.973416
recall       0.920245    0.983529  0.973346     0.951887      0.973346
f1-score     0.917431    0.984108  0.973346     0.950770      0.973379
support    163.000000  850.000000  0.973346  1013.000000   1013.000000

Over Sampling:
                Churn     Customer  accuracy    macro avg  weighted avg
precision    0.229358     1.000000  0.461007     0.614679      0.876378
recall       1.000000     0.358025  0.461007     0.679012      0.461007
f1-score     0.373134     0.527273  0.461007     0.450204      0.502547
support    325.000000  1701.000000  0.461007  2026.000000   2026.000000

Under Sampling:
                Churn     Customer  accuracy    macro avg  weighted avg
precision    0.280799     0.936000   0.64462     0.608400      0.830896
recall       0.778462     0.619048   0.64462     0.698755      0.644620
f1-scor