# Demo 1: A Machine Learning Classification Task

## Imports

In [1]:
import numpy as np
import pandas as pd
from sklearn.compose import make_column_transformer
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_validate, train_test_split, GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, StandardScaler

import warnings
warnings.filterwarnings("ignore")

## Dataset Description

This [dataset](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing) is taken from the Bank Market Data Set available in the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php). It contains 20 input variables. The output variable is to determine whether the client has subscribed to a term deposit. 

To have a better understanding of the dataset, we first read in the dataset and split into train and test portions: 

In [2]:
df = pd.read_csv("../datasets/bank-additional-full.csv", sep=";")

train_df, test_df = train_test_split(df, test_size=0.3, random_state=4801)

Recall one of the most important principles in machine learning – you should NEVER let the test data interfere or enter the training phase at ANY stage. Thus, we perform the splitting of training and testing data at the very first place. 

Let's have a glimpse at the training dataset: 

In [3]:
train_df

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
17762,53,entrepreneur,married,university.degree,unknown,yes,no,cellular,jul,tue,...,1,999,0,nonexistent,1.4,93.918,-42.7,4.961,5228.1,yes
36082,25,admin.,single,university.degree,no,no,no,cellular,may,tue,...,1,999,1,failure,-1.8,92.893,-46.2,1.266,5099.1,no
19760,32,admin.,married,high.school,no,no,no,cellular,aug,fri,...,8,999,0,nonexistent,1.4,93.444,-36.1,4.966,5228.1,no
20614,35,admin.,single,university.degree,no,yes,yes,cellular,aug,wed,...,1,999,0,nonexistent,1.4,93.444,-36.1,4.965,5228.1,no
38193,80,retired,divorced,basic.4y,no,no,no,cellular,oct,tue,...,1,999,0,nonexistent,-3.4,92.431,-26.9,0.744,5017.5,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15819,36,technician,married,unknown,unknown,yes,no,cellular,jul,mon,...,1,999,0,nonexistent,1.4,93.918,-42.7,4.960,5228.1,no
1198,48,blue-collar,married,basic.6y,no,yes,no,telephone,may,thu,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.855,5191.0,no
775,41,blue-collar,married,basic.4y,no,yes,no,telephone,may,wed,...,2,999,0,nonexistent,1.1,93.994,-36.4,4.856,5191.0,no
18670,31,services,single,high.school,no,no,no,cellular,jul,thu,...,3,999,0,nonexistent,1.4,93.918,-42.7,4.968,5228.1,no


The target feature is `y`, whether the client has subscribed a term deposit. 

We can group the features as the following types: 
1.  Numerical features (`age`, `duration`, `campaign`, `pdays`, `previous`, `emp.var.rate`, `cons.price.idx`, `cons.conf.idx`, `euribor3m`, `nr.employed`)
2.  Binary features, to be preprocessed (due to the `unknown` category) (`default`, `housing`, `loan`)
3.  Categorical features with "unknown" (`job`, `marital`, `education`)
3.  Other categorical features (`contact`, `month`, `day_of_week`, `poutcome`)

## Feature Engineering

Note that the feature `pdays` has a value `999` (client not being previously contacted) which will significantly contaminate this feature. Thus, we are going to make a new feature to replace the original feature. 
The new feature is `is_contacted_before` and it is equal to `1` if `pdays < 999` and `0` otherwise. 

In [4]:
train_df["is_contacted_before"] = train_df["pdays"].apply(lambda pdays: 0 if pdays == 999 else 1)

test_df["is_contacted_before"] = test_df["pdays"].apply(lambda pdays: 0 if pdays == 999 else 1)

## Preprocessing

### Split into X and y

In [5]:
X_train, y_train = train_df.drop(columns=["y"]), train_df["y"]
X_test, y_test = test_df.drop(columns=["y"]), test_df["y"]

### Transform columns

Note that in the attribute information section of the dataset, it is explicitly mentioned that `duration` should not be included as we are having an intent to build a predictive model. Thus, we are going to drop this feature. 

We then classify the features into different types: 

In [6]:
numeric_feats = ["age", "campaign", "previous", "emp.var.rate", "cons.price.idx", "cons.conf.idx", "euribor3m", "nr.employed"]
binary_feats = ["default", "housing", "loan"]
categorical_feats_1 = ["job", "marital", "education"]
categorical_feats_2 = ["contact", "month", "day_of_week", "poutcome"]
drop_feats = ["duration", "pdays"]

Next, we fit a column transformer to different features. Note that different preprocessors should be applied: 
`numerical_feats`: `StandardScaler()` (scale features)
`binary_feats`: `FunctionTransformer()` (replace "unknown" with "no", then replace `yes` with `1` and `no` with `0`)
`categorical_feats_1`: `FunctionTransformer()` -> `SimpleImputer()` -> `OneHotEncoder()` (replace "unknown" with `np.nan`, then impute the null values using the most frequent value, followed by one-hot encoding)
`categorical_feats_2`: `OneHotEncoder()`

In [7]:
def process_yes_no_unknown(df):
    for col in df.columns:
        df[col] = df[col].apply(lambda text: 1 if text == "yes" else 0)
    return df

In [8]:
def replace_unknown_with_nan(df):
    for col in df.columns:
        df[col] = df[col].apply(lambda text: text if text != "unknown" else np.nan)
    return df

In [9]:
preprocessor = make_column_transformer(
    (StandardScaler(), numeric_feats), 
    (FunctionTransformer(process_yes_no_unknown), binary_feats), 
    (
        make_pipeline(
            FunctionTransformer(replace_unknown_with_nan), 
            SimpleImputer(strategy="most_frequent"), 
            OneHotEncoder()
        ), categorical_feats_1
    ), 
    (OneHotEncoder(), categorical_feats_2), 
    ("drop", drop_feats)
)

In [10]:
X_train_transformed = preprocessor.fit_transform(X_train)

Let's have a glimpse of what the transformed dataframe of `X_train` looks like: 

In [11]:
pd.DataFrame(X_train_transformed)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,42,43,44,45,46,47,48,49,50,51
0,1.240030,-0.568335,-0.349766,0.836939,0.591584,-0.474459,0.771529,0.842072,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
1,-1.439447,-0.568335,1.634901,-1.197993,-1.180119,-1.229769,-1.356666,-0.937039,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
2,-0.769578,1.970674,-0.349766,0.836939,-0.227721,0.949841,0.774409,0.842072,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,-0.482491,-0.568335,-0.349766,0.836939,-0.227721,0.949841,0.773833,0.842072,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
4,3.823811,-0.568335,-0.349766,-2.215459,-1.978681,2.935228,-1.657320,-2.062430,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28826,-0.386795,-0.568335,-0.349766,0.836939,0.591584,-0.474459,0.770953,0.842072,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
28827,0.761552,-0.568335,-0.349766,0.646164,0.722949,0.885100,0.710477,0.330405,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
28828,0.091683,-0.205620,-0.349766,0.646164,0.722949,0.885100,0.711052,0.330405,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
28829,-0.865273,0.157096,-0.349766,0.836939,0.591584,-0.474459,0.775561,0.842072,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0


After processing, our number of features have increased from `20` to `52`. 

## Modelling

After preprocessing, we can now build our models.

### Baseline

Let's start with our baseline model by using `DummyClassifier`. 

In [12]:
model1 = make_pipeline(
    preprocessor, 
    DummyClassifier()
)

model1.fit(X_train, y_train)

In [13]:
model1.score(X_train, y_train)

0.8869966355658839

Our baseline results suggest that the score by assigning the most frequent label is 88.7%. This means that 88.7% of the data is "no" while the remaining 11.3% is "yes". 

### Random Forest Classifier

`RandomForestClassifier` is one of the most successful classifiers built. The idea is based on ensembling. Let's try to fit our data into the classifier and obtain the cross-validation results using a 10-fold cross validation. 

In [14]:
model_rf = make_pipeline(
    preprocessor, 
    RandomForestClassifier(random_state=9542)
)

In [15]:
results = cross_validate(
    model_rf, X_train, y_train, cv=10, n_jobs=-1, return_train_score=True
)

In [16]:
pd.DataFrame(results)

Unnamed: 0,fit_time,score_time,test_score,train_score
0,14.645738,0.685794,0.886616,0.995684
1,14.620667,0.674158,0.882414,0.995838
2,14.837813,0.620339,0.895942,0.995607
3,14.601188,0.605053,0.89282,0.995915
4,15.132909,0.53331,0.889698,0.995607
5,14.74743,0.523042,0.892126,0.995992
6,14.832412,0.525353,0.891086,0.995992
7,14.83998,0.531398,0.899757,0.995992
8,5.439169,0.187747,0.897329,0.995645
9,5.310481,0.188646,0.897329,0.995645


It seems that our model is overfitting. Let's try to tune some hyperparameters in this model and use the f1-score as our scoring metric. 

In [17]:
param_grid = {
    "randomforestclassifier__n_estimators": [32, 64, 128], 
    "randomforestclassifier__criterion": ["gini", "entropy"], 
    "randomforestclassifier__max_depth": [5, 10, None], 
}

In [18]:
grid_search = GridSearchCV(model_rf, param_grid, n_jobs=-1, cv=10, return_train_score=True)

In [19]:
grid_search.fit(X_train, y_train.ravel())

In [20]:
pd.DataFrame(grid_search.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_randomforestclassifier__criterion,param_randomforestclassifier__max_depth,param_randomforestclassifier__n_estimators,params,split0_test_score,split1_test_score,...,split2_train_score,split3_train_score,split4_train_score,split5_train_score,split6_train_score,split7_train_score,split8_train_score,split9_train_score,mean_train_score,std_train_score
0,2.896919,0.036993,0.22676,0.022291,gini,5.0,32,"{'randomforestclassifier__criterion': 'gini', ...",0.895284,0.892473,...,0.899684,0.899877,0.899954,0.899491,0.900108,0.898952,0.898528,0.899183,0.899526,0.000479
1,5.017316,0.135188,0.264921,0.038058,gini,5.0,64,"{'randomforestclassifier__criterion': 'gini', ...",0.895284,0.893167,...,0.8998,0.900031,0.899915,0.900262,0.900647,0.899067,0.898759,0.899645,0.899838,0.000582
2,10.046763,0.889708,0.496332,0.04175,gini,5.0,128,"{'randomforestclassifier__criterion': 'gini', ...",0.895978,0.893861,...,0.900455,0.8998,0.899453,0.899607,0.900262,0.89926,0.899106,0.899761,0.899818,0.000485
3,5.112277,0.233509,0.322317,0.026729,gini,10.0,32,"{'randomforestclassifier__criterion': 'gini', ...",0.897365,0.893514,...,0.918105,0.917566,0.916256,0.91799,0.917373,0.916063,0.91668,0.917181,0.917454,0.000934
4,8.11928,0.713949,0.3724,0.094702,gini,10.0,64,"{'randomforestclassifier__criterion': 'gini', ...",0.898405,0.893167,...,0.918491,0.917874,0.917566,0.918761,0.918105,0.916179,0.917373,0.91799,0.918078,0.000909
5,14.13434,0.722982,0.46709,0.034886,gini,10.0,128,"{'randomforestclassifier__criterion': 'gini', ...",0.898405,0.89282,...,0.918452,0.917797,0.917836,0.91826,0.918105,0.916949,0.917682,0.917836,0.918174,0.000766
6,5.517294,0.318041,0.252358,0.02482,gini,,32,"{'randomforestclassifier__criterion': 'gini', ...",0.889043,0.883802,...,0.993526,0.993294,0.993718,0.994027,0.994027,0.993795,0.99368,0.993526,0.993695,0.000284
7,9.242735,0.091161,0.346754,0.020018,gini,,64,"{'randomforestclassifier__criterion': 'gini', ...",0.88939,0.882761,...,0.995144,0.995568,0.995221,0.995607,0.995761,0.995568,0.995221,0.995298,0.995391,0.000205
8,21.368661,2.543377,0.784183,0.148941,gini,,128,"{'randomforestclassifier__criterion': 'gini', ...",0.886963,0.882414,...,0.995645,0.995953,0.995684,0.996069,0.996031,0.996031,0.995799,0.995722,0.995849,0.000155
9,3.533499,0.455487,0.27281,0.059374,entropy,5.0,32,{'randomforestclassifier__criterion': 'entropy...,0.896325,0.893861,...,0.899723,0.898759,0.899607,0.899337,0.899723,0.898451,0.898297,0.899414,0.899348,0.000619


In [21]:
grid_search.best_params_

{'randomforestclassifier__criterion': 'entropy',
 'randomforestclassifier__max_depth': 10,
 'randomforestclassifier__n_estimators': 128}

Our best tuned hyperparameters are: `criterion="entropy"`, `max_depth=10`, `n_estimators=128`. 

Let's fit this model again and see the results: 

In [22]:
model_rf_best = make_pipeline(
    preprocessor, 
    RandomForestClassifier(random_state=9542, criterion="entropy", max_depth=10, n_estimators=128)
)

In [23]:
model_rf_best.fit(X_train, y_train)

## Results on Test Set

Now, let's predict our results using the test set: 

In [24]:
y_pred = model_rf_best.predict(X_test)

Next, we generate the classification report: 

In [25]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

          no       0.91      0.98      0.95     10975
         yes       0.64      0.23      0.34      1382

    accuracy                           0.90     12357
   macro avg       0.77      0.61      0.64     12357
weighted avg       0.88      0.90      0.88     12357



Note that accuracy is not the only metric that we can use. We can also use precision and recall as our metrics. 

Define $TP = \text{True Positives}$, $FP = \text{False Positives}$, $TN = \text{True Negatives}$ and $FN = \text{False Negatives}$. 

Then, the precision is $\dfrac{TP}{TP + FP}$. Precision means that, "out of all my positive samples identified, what proportion of them are actually positive?"
The recall is $\dfrac{TP}{TP + FN}$. Recall means that, "out of all my ground-truth positive samples, what proportion of them are identified as positive?"

From the results above, we can see that we are performing poorly on the recall metric. Based on the business context that we are facing, our model is not good at identifying people who subscribed a term deposit. This means that as a bank, we will be accidentally sending promotions to people who are already using our term deposits. 