# Stratified K-fold Cross Validation with SMOTE

A showcase of stratified k-fold cross validation with oversampled data. Credit Card Fraud Detection data obtained from:

https://www.kaggle.com/mlg-ulb/creditcardfraud

For details of the data see above link. This is an extension of the notebook on imbalanced data [here](https://github.com/allannsze/demos/blob/main/python/imbalanced_data.ipynb).

In [1]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, accuracy_score
from imblearn.over_sampling import SMOTE

In [2]:
df = pd.read_csv('data/creditcard.csv')

# Drop duplicates and Time column
df.drop_duplicates(inplace=True)
df.drop('Time', axis=1, inplace=True)
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


## Stratification

Cross validation is generally used to assess a model's predictive power on unseen data. In k-fold cross validation, the data is split into k folds and a model is fitted using each fold as the training set and the remaining ones as testing set. For each run, the model is evaluated using chosen metrics. The metrics from the total k runs are averaged to give a cross validation score.

In cases of severe class imbalance k-fold cross validation may run into issues and produce misleading results. The problem arises from instances where a particular fold/ split contain minimal or zero observations of the minority class (in this case the fraudulent transactions). As such, the folds must be stratified to ensure an equal proportion of the minority class exists within each fold, leading to the use of stratified k-fold cross validation. Keep in mind that we also wish to perform oversampling on training data within each cross validaiton run.

Usually, this can be done using `Pipeline` from the `imblearn` package to combine:
* `StratifiedKFold` from `sklearn`
* `SMOTE` from `imblearn`
* Your model
* Parameter tuning e.g. Grid search

But, let's show how we can do this manually using AUC as metric:

In [3]:
X = df.drop('Class', axis=1)
y = df['Class']

In [4]:
# Model Config
skf = StratifiedKFold(n_splits=10, shuffle=True)
scaler = StandardScaler()
smote = SMOTE(random_state=42)
lr = LogisticRegression()

score_method = 'AUC'
skf_scores = []
fold = 1

# Loop through each CV run
print('Performing cross validation...')
print(f'Scoring method: {score_method}')
for train_index, test_index in skf.split(X, y):
    # Train test split
    X_train_fold, X_test_fold = X.loc[train_index,:], X.loc[test_index,:]
    y_train_fold, y_test_fold = y[train_index], y[test_index]
    
    # Normalise Amount column
    X_train_fold['Amount'] = scaler.fit_transform(X_train_fold['Amount'].values.reshape(-1, 1))
    X_test_fold['Amount'] = scaler.transform (X_test_fold['Amount'].values.reshape(-1, 1))
    
    # Oversample and fit training data
    X_train_sm, y_train_sm = smote.fit_resample(X_train_fold, y_train_fold)
    lr.fit(X_train_sm, y_train_sm)
    y_pred_fold = lr.predict(X_test_fold)
    
    # Scoring metric 
    score = roc_auc_score(y_test_fold, y_pred_fold)
    print(f'Fold {fold} Cross Validation Score: {score:.3f}')
    skf_scores.append(score)
    fold += 1
print(f'Average Cross Validation Score: {np.mean(skf_scores):.3f}')

Performing cross validation...
Scoring method: AUC
Fold 1 Cross Validation Score: 0.933
Fold 2 Cross Validation Score: 0.945
Fold 3 Cross Validation Score: 0.945
Fold 4 Cross Validation Score: 0.945
Fold 5 Cross Validation Score: 0.964
Fold 6 Cross Validation Score: 0.925
Fold 7 Cross Validation Score: 0.934
Fold 8 Cross Validation Score: 0.934
Fold 9 Cross Validation Score: 0.945
Fold 10 Cross Validation Score: 0.934
Average Cross Validation Score: 0.940


Summary:
*  Only use training data to fit scaler
* Only oversample the training set and leave test set unchanged
* Generally oversampling should be performed after the stratified k-fold split