# Module 5: Model Development

All in all, we have to conduct the following steps:
1. Prepare dataset(s) for training:
    - Create training and validation datasets
    - Create an additional, downsampled training set
2. Create some helper functions, e.g., for quick performance evaluation
3. As a baseline, train a logistic regression model
4. Train a random forest model for comparison
5. Save all models and datasets for later use

## Configuration

In [1]:
# basic configuration, put these lines at the top of each notebook
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
# plotting configuration (basically just change plot size)
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (16, 10)

In [3]:
# show all columns of our data frames
import pandas as pd
pd.options.display.max_columns = None
pd.set_option("display.precision", 2)
pd.options.display.max_rows = 100

## Data preparation

### Data loading

In [4]:
!ls -lh data/

total 6768016
-rw-r--r--  1 felix  staff   697M Aug 26 15:10 data_raw.csv
-rw-r--r--  1 felix  staff   1.8G Aug 28 09:53 data_raw.feather
-rw-r--r--  1 felix  staff    19M Aug 28 17:00 feats_clean.feather
-rw-r--r--  1 felix  staff    94M Aug 28 17:02 feats_final.feather
-rw-r--r--  1 felix  staff    21M Aug 28 17:00 feats_raw.feather
-rwxr-xr-x@ 1 felix  staff    25M Aug 16 09:26 [31mtrain_identity.csv[m[m
-rwxr-xr-x@ 1 felix  staff   652M Aug 16 09:26 [31mtrain_transaction.csv[m[m


In [5]:
DATA_PATH = 'data/'
data = pd.read_feather(f'{DATA_PATH}feats_final.feather')
data.shape

(542547, 154)

In [6]:
data.head()

Unnamed: 0,isFraud,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,card6,addr1,addr2,dist1,P_emaildomain,R_emaildomain,DeviceType,day,hour,dist1*TransactionAmt,ProductCD_C,ProductCD_H,ProductCD_R,ProductCD_S,ProductCD_W,card1_1,card1_2,card1_3,card1_4,card1_5,card1_6,card1_7,card1_8,card1_9,card1_10,card1_11,card2_1,card2_2,card2_3,card2_4,card2_5,card2_6,card2_7,card2_8,card2_9,card2_10,card2_11,card3_1,card3_2,card3_3,card3_4,card3_5,card3_6,card3_7,card3_8,card3_9,card3_10,card3_11,card4_american express,card4_discover,card4_mastercard,card4_missing_value,card4_visa,card5_1,card5_2,card5_3,card5_4,card5_5,card5_6,card5_7,card5_8,card5_9,card5_10,card5_11,card6_charge card,card6_credit,card6_debit,card6_debit or credit,addr1_1,addr1_2,addr1_3,addr1_4,addr1_5,addr1_6,addr1_7,addr1_8,addr1_9,addr1_10,addr1_11,addr2_1,addr2_2,addr2_3,addr2_4,addr2_5,addr2_6,addr2_7,addr2_8,addr2_9,addr2_10,addr2_11,P_emaildomain_anonymous.com,P_emaildomain_aol.com,P_emaildomain_att.net,P_emaildomain_comcast.net,P_emaildomain_gmail.com,P_emaildomain_hotmail.com,P_emaildomain_icloud.com,P_emaildomain_msn.com,P_emaildomain_other,P_emaildomain_outlook.com,P_emaildomain_yahoo.com,R_emaildomain_anonymous.com,R_emaildomain_aol.com,R_emaildomain_comcast.net,R_emaildomain_gmail.com,R_emaildomain_hotmail.com,R_emaildomain_icloud.com,R_emaildomain_msn.com,R_emaildomain_other,R_emaildomain_outlook.com,R_emaildomain_yahoo.com,R_emaildomain_yahoo.com.mx,DeviceType_desktop,DeviceType_missing_value,DeviceType_mobile,day_0,day_1,day_2,day_3,day_4,day_5,day_6,hour_0,hour_1,hour_2,hour_3,hour_4,hour_5,hour_6,hour_7,hour_8,hour_9,hour_10,hour_11,hour_12,hour_13,hour_14,hour_15,hour_16,hour_17,hour_18,hour_19,hour_20,hour_21,hour_22,hour_23
0,0,0.31,W,11,11,1,mastercard,4,credit,3,1,0.24,gmail.com,other,missing_value,0,0,0.07,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0.38,W,11,4,1,visa,3,debit,6,1,0.61,outlook.com,other,missing_value,0,0,0.23,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0.37,W,11,11,1,mastercard,5,debit,11,1,0.24,yahoo.com,other,missing_value,0,0,0.09,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0.37,H,11,10,1,mastercard,4,credit,11,1,0.24,gmail.com,other,mobile,0,0,0.09,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0.36,W,11,3,1,visa,1,debit,9,1,0.39,gmail.com,other,missing_value,0,0,0.14,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


First, we should drop the columns that we applied one-hot encoding to in the previous module. We do this, because they constitute duplicate information.

In [7]:
cols_to_drop = ['ProductCD', 'card1', 'card2', 'card3',
       'card4', 'card5', 'card6', 'addr1', 'addr2', 'P_emaildomain',
       'R_emaildomain', 'DeviceType', 'day', 'hour', 'dist1*TransactionAmt']

In [8]:
data = data.drop(columns=cols_to_drop)
print(data.shape)

(542547, 139)


### Data splits

Now, we split our dataset into a training and a test dataset. There are several approaches for doing this. Here, we use a random sample as test set that contains 10% of all observations. We can use a random sample, because our data does not constitute time-series data. In that case it is common to assemble continuous periods in training and test data.

In [9]:
from sklearn.model_selection import train_test_split
data_train, data_test = train_test_split(data, test_size=0.1)
print(data_train.shape)
print(data_test.shape)

(434037, 139)
(108510, 139)


Next, we need to look at the distribution of our target variable (remember: binary variable that indicates whether a transaction is fraudulent) in both our training and testing datasets. We can do so using the `value_counts()` method on the respective column.

In [10]:
data_train.isFraud.value_counts()

(418964, 139)
(15073, 139)


In [11]:
data_test.isFraud.value_counts()

0    104791
1      3719
Name: isFraud, dtype: int64

Obviously, our dataset is imbalanced. This means that we have way more non-fraudulent examples than fraudulent ones. This can cause problems, e.g., our model can simply predict the more common class and achieve superficially good performance. There are various methods to deal with the so-called _class imbalance problem_. The most common are:
- Don't do anything about it as most ML models can deal with imbalanced datasets
- Downsampling: create a more balanced dataset by reducing the size of the bigger class (e.g., using random sampling)
- Upsampling: create a more balanced dataset by increasing the size of the smaller class (e.g., by resampling)
- Cost weighting: assign higher costs to misclassifications of the smaller class

Here, we will create an additional training dataset that is downsampled from the original one. In detail, we restrict the non-fraudulent class to be four times bigger than the fraudulent class.

In [12]:
imbalanced = data_train
imbalanced.isFraud.value_counts()

0    418964
1     15073
Name: isFraud, dtype: int64

In [13]:
from sklearn.utils import resample

not_fraud_downsampled = resample(not_fraud,
                                replace = False,
                                n_samples = len(fraud)*4,
                                random_state = 27)
downsampled = pd.concat([not_fraud_downsampled, fraud])
downsampled.isFraud.value_counts()

0    60292
1    15073
Name: isFraud, dtype: int64

As you can see, we get a more uniform distribution (roughly an 80%/20% split).

## Model setup

### Helper functions

Before training, we obviously have to remove the target variable from the training data. Since we have to do this for each training run, we write a helper function for this.

In [14]:
def split_data(data):
    X = data.drop(columns=['isFraud'])
    y = data.isFraud
    return X, y

In addition, we want to be able to quickly grasp the overall performance of a trained model. Thus, we write a helper function that prints the most important classification metrics for a given model and dataset. Analyzing multiple metrics makes sense in the case of an imbalanced dataset, because the commonly used _accuracy score_ can be misleading in these cases. For illustration imagine the case where one class corresponds to 99% of examples in a dataset. A naive classifier that always predicts this class would achieve 99% accuracy, but obviously does not constitue a very good model.

In [15]:
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score, roc_auc_score

def evaluate_model(model, data):
    X, y = split_data(data)
    pred = model.predict(X)
    print("Accuracy: {:.4f}".format(accuracy_score(y, pred)))
    print("Precision: {:.4f}".format(precision_score(y, pred)))
    print("Recall: {:.4f}".format(recall_score(y, pred)))
    print("F1: {:.4f}".format(f1_score(y, pred)))
    print("AUC: {:.4f}".format(roc_auc_score(y, pred)))

### Dummy classifier

In order to test our methods, we create a dummy classifier that randomly picks between the two classes. This dummy classifier is part of the _scikit-learn_ package. In order to train the model, we invoke the `fit` method on the created classifier, providing training data and labels as arguments.

In [16]:
from sklearn.dummy import DummyClassifier

X, y = split_data(imbalanced)
dummy = DummyClassifier(strategy='uniform')
dummy.fit(X, y)

DummyClassifier(constant=None, random_state=None, strategy='uniform')

After fitting, we can now evaluate the trained model on both versions of the training data set (imbalanced and downsampled), as well as the test set.

In [17]:
evaluate_model(dummy, imbalanced)

Accuracy: 0.5001
Precision: 0.0354
Recall: 0.5101
F1: 0.0662
AUC: 0.5049


In [18]:
evaluate_model(dummy, downsampled)

Accuracy: 0.4967
Precision: 0.1983
Recall: 0.4981
F1: 0.2836
AUC: 0.4973


In [19]:
evaluate_model(dummy, data_test)

Accuracy: 0.5022
Precision: 0.0345
Recall: 0.5015
F1: 0.0646
AUC: 0.5018


As you would expect, we get around 50% accuracy and an AUC value of 0.5.

## Logistic regression

Next, we want to train a logistic regression model that should provide a more realistic baseline for subsequent model training. We will train models on both versions of the training set.

### Imbalanced dataset

In [20]:
from sklearn.linear_model import LogisticRegression

X, y = split_data(imbalanced)
log_imb = LogisticRegression()
log_imb.fit(X, y)
evaluate_model(log_imb, imbalanced)



Accuracy: 0.9654
Precision: 0.5945
Recall: 0.0159
F1: 0.0309
AUC: 0.5077


In [None]:
evaluate_model(log_imb, data_test)

We can see that the model is biased towards the bigger class (high accuracy, low AUC score).

### Downsampled dataset

In [21]:
X, y = split_data(downsampled)
log_ds = LogisticRegression()
log_ds.fit(X, y)
evaluate_model(log_ds, downsampled)



Accuracy: 0.8348
Precision: 0.6803
Recall: 0.3283
F1: 0.4429
AUC: 0.6449


In [22]:
evaluate_model(log_ds, data_test)

Accuracy: 0.9397
Precision: 0.2345
Recall: 0.3356
F1: 0.2761
AUC: 0.6484


The model trained on the downsampled dataset achieves slightly better performance, as can be seen from the increased AUC score.

## Random forest

Now, we want to train a more complex model that is often a good choice for classification problems on tabular data: a random forest. A random forest constitutes an ensemble of decision trees where predictions of single trees are combined in order to derive a more robust prediction. Again, we train models on both the imbalanced and downsampled version of the training set in order to see which one works better for our use case.

### Imbalanced dataset

In [23]:
from sklearn.ensemble import RandomForestClassifier

For the beginning, we leave the default parameters mostly untouched and simply specifiy the number of trees to train (here: 10).

In [24]:
X, y = split_data(imbalanced)
rf_imb = RandomForestClassifier(n_estimators=10)
rf_imb.fit(X, y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [25]:
evaluate_model(rf_imb, imbalanced)

Accuracy: 0.9953
Precision: 0.9907
Recall: 0.8739
F1: 0.9286
AUC: 0.9368


In [26]:
evaluate_model(rf_imb, data_test)

Accuracy: 0.9738
Precision: 0.7458
Recall: 0.3582
F1: 0.4839
AUC: 0.6769


For the first time we can observe overfitting, marked by the large performance gap between training and test data. We can see the potential of this modelling approach though, as the performance on the training data set is quite impressive.

### Downsampled dataset

In [60]:
X, y = split_data(downsampled)
rf_ds = RandomForestClassifier(n_estimators=50, max_features=0.7)
rf_ds.fit(X, y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features=0.7, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=50,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [61]:
evaluate_model(rf_ds, downsampled)

Accuracy: 0.9986
Precision: 0.9984
Recall: 0.9946
F1: 0.9965
AUC: 0.9971


In [62]:
evaluate_model(rf_ds, data_test)

Accuracy: 0.9515
Precision: 0.3765
Recall: 0.6319
F1: 0.4719
AUC: 0.7974


We can observe similar results as on the more imbalanced dataset, but notice the performance improvement on the test set (which is still imbalanced).

### Grid search for hyperparameter tuning

Manual hyperparameter tuning can be quite time-consuming. In the case of random forests, we can vary a lot of parameters, for example:
- number of trees to train
- number of maximum features that each tree is trained on
- minimum number of examples in each leaf
- maximum depth of each decision tree

In the following, we will focus on the first two hyperparameters and perform a simple grid search in order to derive the best possible model. This is a brute-force approach (i.e., testing every possible hyperparameter combination) that can be computationally expensive (esp., when more hyperparameters are included), but it will suffice for starters.

In [64]:
from sklearn.model_selection import GridSearchCV

In [65]:
param_grid = {
    'n_estimators': [50, 100],
    'max_features': [0.5, 0.7],
}

In [66]:
rf = RandomForestClassifier()
rf_cv = GridSearchCV(rf, param_grid=param_grid, scoring='roc_auc', cv=5)
rf_cv.fit(X, y)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=None,
                                              oob_score=False,
                                              random_state=None, verbose=0,
                                              warm_start=False),
             iid

In [67]:
evaluate_model(rf_cv, downsampled)

Accuracy: 0.9993
Precision: 0.9989
Recall: 0.9974
F1: 0.9981
AUC: 0.9986


In [68]:
evaluate_model(rf_cv, data_test)

Accuracy: 0.9524
Precision: 0.3810
Recall: 0.6227
F1: 0.4728
AUC: 0.7934


We can see that the best model from grid search is on par with our previously best-performing model.
We will stop model training here, further testing is left to the workshop participants.

## Save models and datasets

Reproducible experimentation in machine learning requires datasets and trained models to be saved. In the following, we will save our data in the efficient _feather_ format, and our models using the built-in serialization functionaliy in Python.

In [81]:
data_train.reset_index(drop=True).to_feather(f"{DATA_PATH}data_train.feather")
data_test.reset_index(drop=True).to_feather(f"{DATA_PATH}data_test.feather")

In [83]:
from joblib import dump, load

In [84]:
dump(rf_ds, f'{DATA_PATH}random_forest.joblib') 

['data/random_forest.joblib']

In [87]:
!ls -lh data/

total 7053944
-rw-r--r--  1 felix  staff   697M Aug 26 15:10 data_raw.csv
-rw-r--r--  1 felix  staff   1.8G Aug 28 09:53 data_raw.feather
-rw-r--r--  1 felix  staff    17M Sep 10 14:29 data_test.feather
-rw-r--r--  1 felix  staff    66M Sep 10 14:29 data_train.feather
-rw-r--r--  1 felix  staff    19M Aug 28 17:00 feats_clean.feather
-rw-r--r--  1 felix  staff    94M Aug 28 17:02 feats_final.feather
-rw-r--r--  1 felix  staff    21M Aug 28 17:00 feats_raw.feather
-rw-r--r--  1 felix  staff    57M Sep 10 14:32 random_forest.joblib
-rwxr-xr-x@ 1 felix  staff    25M Aug 16 09:26 [31mtrain_identity.csv[m[m
-rwxr-xr-x@ 1 felix  staff   652M Aug 16 09:26 [31mtrain_transaction.csv[m[m
