# Module 5: Model Development

All in all, we have to conduct the following steps:
1. Prepare dataset(s) for training:
    - Create training and validation datasets
    - Create an additional, downsampled training set
2. Create some helper functions, e.g., for quick performance evaluation
3. As a baseline, train a logistic regression model
4. Train a random forest model for comparison
5. Save all models and datasets for later use

## Configuration

In [None]:
# basic configuration, put these lines at the top of each notebook
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [None]:
# plotting configuration (basically just change plot size)
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (16, 10)

In [None]:
# show all columns of our data frames
import pandas as pd
pd.options.display.max_columns = None
pd.set_option("display.precision", 2)
pd.options.display.max_rows = 100

## Data preparation

### Data loading

In [None]:
!ls -lh tmp/

In [None]:
DATA_PATH = 'tmp/'
data = pd.read_feather(f'{DATA_PATH}feats_final.feather')
data.shape

In [None]:
data.head()

First, we should drop the columns that we applied one-hot encoding to in the previous module. We do this, because they constitute duplicate information.

In [None]:
cols_to_drop = ['ProductCD', 'card1', 'card2', 'card3',
       'card4', 'card5', 'card6', 'addr1', 'addr2', 'P_emaildomain',
       'R_emaildomain', 'DeviceType', 'day', 'hour', 'dist1*TransactionAmt']

In [None]:
data = data.drop(columns=cols_to_drop)
print(data.shape)

### Data splits

Now, we split our dataset into a training and a test dataset. There are several approaches for doing this. Here, we use a random sample as test set that contains 10% of all observations. We can use a random sample, because our data does not constitute time-series data. In that case it is common to assemble continuous periods in training and test data.

In [None]:
from sklearn.model_selection import train_test_split
data_train, data_test = train_test_split(data, test_size=0.1)
print(data_train.shape)
print(data_test.shape)

Next, we need to look at the distribution of our target variable (remember: binary variable that indicates whether a transaction is fraudulent) in both our training and testing datasets. We can do so using the `value_counts()` method on the respective column.

In [None]:
data_train.isFraud.value_counts(normalize=True)

In [None]:
data_test.isFraud.value_counts(normalize=True)

Obviously, our dataset is imbalanced. This means that we have way more non-fraudulent examples than fraudulent ones. This can cause problems, e.g., our model can simply predict the more common class and achieve superficially good performance. There are various methods to deal with the so-called _class imbalance problem_. The most common are:
- Don't do anything about it as most ML models can deal with imbalanced datasets
- Downsampling: create a more balanced dataset by reducing the size of the bigger class (e.g., using random sampling)
- Upsampling: create a more balanced dataset by increasing the size of the smaller class (e.g., by resampling)
- Cost weighting: assign higher costs to misclassifications of the smaller class

Here, we will create an additional training dataset that is downsampled from the original one. In detail, we restrict the non-fraudulent class to be four times bigger than the fraudulent class.

In [None]:
imbalanced = data_train
imbalanced.isFraud.value_counts()

In [None]:
not_fraud = data_train.loc[data_train.isFraud == 0]
fraud = data_train.loc[data_train.isFraud == 1]
print(not_fraud.shape)
print(fraud.shape)

In [None]:
from sklearn.utils import resample

not_fraud_downsampled = resample(not_fraud,
                                replace = False,
                                n_samples = len(fraud)*4,
                                random_state = 27)
downsampled = pd.concat([not_fraud_downsampled, fraud])
downsampled.isFraud.value_counts()

As you can see, we get a more uniform distribution (roughly an 80%/20% split).

**Exercise:** Create a completely balanced dataset (i.e., 50%/50% split between fraudulent and valid transactions).

## Model setup

### Helper functions

Before training, we obviously have to remove the target variable from the training data. Since we have to do this for each training run, we write a helper function for this.

In [None]:
def split_data(data):
    X = data.drop(columns=['isFraud'])
    y = data.isFraud
    return X, y

In addition, we want to be able to quickly grasp the overall performance of a trained model. Thus, we write a helper function that prints the most important classification metrics for a given model and dataset (we will go into detail about these metrics in the next module). Analyzing multiple metrics makes sense in the case of an imbalanced dataset, because the commonly used _accuracy score_ can be misleading in these cases. For illustration imagine the case where one class corresponds to 99% of examples in a dataset. A naive classifier that always predicts this class would achieve 99% accuracy, but obviously does not constitue a very good model.

In [None]:
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score, roc_auc_score

def evaluate_model(model, data):
    X, y = split_data(data)
    pred = model.predict(X)
    probs = model.predict_proba(X)
    print("Accuracy: {:.4f}".format(accuracy_score(y, pred)))
    print("Precision: {:.4f}".format(precision_score(y, pred)))
    print("Recall: {:.4f}".format(recall_score(y, pred)))
    print("F1: {:.4f}".format(f1_score(y, pred)))
    print("AUC: {:.4f}".format(roc_auc_score(y, probs[:,1])))

### Dummy classifier

In order to test our methods, we create a dummy classifier that randomly picks between the two classes. This dummy classifier is part of the _scikit-learn_ package. In order to train the model, we invoke the `fit` method on the created classifier, providing training data and labels as arguments.

In [None]:
from sklearn.dummy import DummyClassifier

X, y = split_data(imbalanced)
dummy = DummyClassifier(strategy='uniform')
dummy.fit(X, y)

After fitting, we can now evaluate the trained model on both versions of the training data set (imbalanced and downsampled), as well as the test set.

In [None]:
evaluate_model(dummy, imbalanced)

In [None]:
evaluate_model(dummy, downsampled)

In [None]:
evaluate_model(dummy, data_test)

As you would expect, we get around 50% accuracy and an AUC value of 0.5.

## Logistic regression

Next, we want to train a logistic regression model that should provide a more realistic baseline for subsequent model training. We will train the model on our downsampled dataset in order to avoid a biased model. For performance evaluation, we will focus on the AUC score since it was used in the original Kaggle competition.

In [None]:
from sklearn.linear_model import LogisticRegression

X, y = split_data(downsampled)
log_ds = LogisticRegression()
log_ds.fit(X, y)
evaluate_model(log_ds, downsampled)

In [None]:
evaluate_model(log_ds, data_test)

**Exercise:** Train a logistic regression model on your previously created, completely balanced dataset. How does its performance compare to the results observed above?

## Random forest

Now, we want to train a more complex model that is often a good choice for classification problems on tabular data: a random forest. A random forest constitutes an ensemble of decision trees where predictions of single trees are combined in order to derive a more robust prediction. Again, we train models on the downsampled version of the training set.

In [None]:
from sklearn.ensemble import RandomForestClassifier

For the beginning, we leave the default parameters mostly untouched and simply specifiy the number of trees to train (here: 50) and the maximum number of features to use for each individual tree (here: 70%).

In [None]:
X, y = split_data(downsampled)
rf_ds = RandomForestClassifier(n_estimators=50, max_features=0.7)
rf_ds.fit(X, y)

In [None]:
evaluate_model(rf_ds, downsampled)

In [None]:
evaluate_model(rf_ds, data_test)

For the first time we can observe overfitting, marked by the large performance gap between training and test data. We can see the potential of this modelling approach though, as the performance on the training data set is quite impressive.

**Exercise:** Train a random forest model on your previously created, completely balanced dataset. How does its performance compare to the results observed above?

## Grid search for hyperparameter tuning

Manual hyperparameter tuning can be quite time-consuming. In the case of random forests, we can vary a lot of parameters, for example:
- number of trees to train
- number of maximum features that each tree is trained on
- minimum number of examples in each leaf
- maximum depth of each decision tree

In the following, we will focus on the first two hyperparameters and perform a simple grid search in order to derive the best possible model. This is a brute-force approach (i.e., testing every possible hyperparameter combination) that can be computationally expensive (esp., when more hyperparameters are included), but it will suffice for starters.

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
param_grid = {
    'n_estimators': [50, 100],
    'max_features': [0.5, 0.7],
}

In [None]:
rf = RandomForestClassifier()
rf_cv = GridSearchCV(rf, param_grid=param_grid, scoring='roc_auc', cv=5)
rf_cv.fit(X, y)

In [None]:
evaluate_model(rf_cv, downsampled)

In [None]:
evaluate_model(rf_cv, data_test)

We can see that the best model from grid search is on par with our previously best-performing model.
We will stop model training here, further testing is left to the workshop participants.

**Exercise:** Perform hyperparameter tuning for a random forest model on your previously created, completely balanced dataset. How does the performance compare to the results observed above?

## Save models and datasets

Reproducible experimentation in machine learning requires datasets and trained models to be saved. In the following, we will save our data in the efficient _feather_ format, and our models using the built-in serialization functionaliy in Python.

In [None]:
data_train.reset_index(drop=True).to_feather(f"{DATA_PATH}data_train.feather")
data_test.reset_index(drop=True).to_feather(f"{DATA_PATH}data_test.feather")

In [None]:
from joblib import dump, load

In [None]:
dump(rf_ds, f'{DATA_PATH}random_forest.joblib') 

In [None]:
!ls -lh tmp/