# [obsolete] Cross Validation

Datasets:

1. [Credit Card Fraud Detection](https://www.kaggle.com/dalpozz/creditcardfraud)

Dataset download:

1. https://www.kaggle.com/dalpozz/creditcardfraud/downloads/creditcardfraud.zip

References:

+ [Cross-validation @ scikit-learn](http://scikit-learn.org/stable/modules/cross_validation.html)
+ https://www.kaggle.com/arathee2/achieving-100-accuracy
+ https://www.kaggle.com/edunuke/binary-decision-tree-with-cross-validation/notebook
+ [Why every statistician should know about cross-validation](https://robjhyndman.com/hyndsight/crossvalidation/)

More:

+ https://www.kaggle.com/maximilianhahn/manager-skill-for-cross-validation-pipelines
+ https://www.kaggle.com/alexandrebarachant/simple-grasp-cross-validation/code
+ https://www.kaggle.com/solomonk/proper-cross-validation

In [1]:
import os, itertools
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin

DATASET = lambda fname: os.path.join('datasets', fname)

## Load dataset

Check dataset dimension and class distribution.

In [2]:
assert os.path.exists(DATASET('creditcard.csv'))
dataset_original = pd.read_csv(DATASET('creditcard.csv'))

def brief_dataset(ds):
    """ Print a brief report of the dataset """
    print('shape', ds.shape,
          'distribution', { # a mapping from class to class subset shape
              cls:ds.loc[lambda i: i.Class == cls, :].shape[0] # filter by class
                  for cls in ds.Class.unique() # uniquely enumerate all classes
          })
    
brief_dataset(dataset_original)

shape (284807, 31) distribution {0: 284315, 1: 492}


Take first a few rows for a preview of the dataset.

In [None]:
dataset_original[:5]

## Resample for training

Resample dataset to get more balanced class distribution.

API ref: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html

In [3]:
NORMAL = lambda i: i.Class == 0
FRAUD = lambda i: i.Class == 1

subset_normal = dataset_original.loc[NORMAL, :].sample(n = 4000) # downsample
subset_fraud = dataset_original.loc[FRAUD, :].sample(n = 4000, replace = True) # upsample
dataset_resampled = pd.concat([subset_normal, subset_fraud]).reset_index(drop=True)
brief_dataset(dataset_resampled)

shape (8000, 31) distribution {0: 4000, 1: 4000}


## SVM Training

API ref:

+ [Indexing @ pandas](http://pandas.pydata.org/pandas-docs/stable/indexing.html#different-choices-for-indexing)
+ [SVM API @ scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)

In [4]:
import sklearn.model_selection
from sklearn import svm

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    dataset_resampled.iloc[:,:-1], # features
    dataset_resampled.Class, # labels
    test_size=0.3 # test dataset fraction
)

print('shapes:', [i.shape for i in (X_train, X_test, y_train, y_test)])

classifier = svm.SVC(kernel='rbf', C=1, gamma=0.1).fit(X_train, y_train)
classifier.score(X_test, y_test)

shapes: [(5600, 30), (2400, 30), (5600,), (2400,)]


0.99624999999999997

## Computing cross-validated metrics

Create another resample for cross-validation. (Only because 280000+ rows takes too long, cross-validation here isn't running on the original dataset)

In [None]:
# dataset_cv = dataset_original.sample(n = 5000).reset_index(drop=True)
# sklearn.model_selection.cross_val_score(
#     classifier,
#     dataset_cv.iloc[:,:-1], # features
#     dataset_cv.Class, # labels
#     cv=10 # 10-fold cross validation
# )

1. Divide normal and abnormal datasets into folds
2. Recombine

In [5]:
N_FOLDS = 10
# Shuffle
dataset_cv = dataset_original.sample(frac = 1).reset_index(drop=True)

# Divide folds
folds_normal = np.array_split(dataset_cv.loc[NORMAL, :], N_FOLDS)
folds_fraud = np.array_split(dataset_cv.loc[FRAUD, :], N_FOLDS)
print('Normal folds', [i.shape[0] for i in folds_normal])
print('Fraud folds', [i.shape[0] for i in folds_fraud]) 

# Cross validation
for i in range(N_FOLDS):
    # Split training/test datasets
    train = np.concatenate([np.concatenate([folds_normal[j], folds_fraud[j]]) for j in range(N_FOLDS) if j!=i])
    test = np.concatenate([folds_normal[i], folds_fraud[i]])
    X_train = train[:,:-1]
    X_test = test[:,:-1]
    y_train = train[:,-1]
    y_test = test[:,-1]
    print(i, 'X_train', X_train.shape, 'X_test', X_test.shape, 'y_train', y_train.shape, 'y_test', y_test.shape)
    
    # classifier = svm.SVC(kernel='rbf', C=1, gamma=0.1).fit(X_train, y_train)
    # classifier.score(X_test, y_test)

Normal folds [28432, 28432, 28432, 28432, 28432, 28431, 28431, 28431, 28431, 28431]
Fraud folds [50, 50, 49, 49, 49, 49, 49, 49, 49, 49]
0 X_train (256325, 30) X_test (28482, 30) y_train (256325,) y_test (28482,)
1 X_train (256325, 30) X_test (28482, 30) y_train (256325,) y_test (28482,)
2 X_train (256326, 30) X_test (28481, 30) y_train (256326,) y_test (28481,)
3 X_train (256326, 30) X_test (28481, 30) y_train (256326,) y_test (28481,)
4 X_train (256326, 30) X_test (28481, 30) y_train (256326,) y_test (28481,)
5 X_train (256327, 30) X_test (28480, 30) y_train (256327,) y_test (28480,)
6 X_train (256327, 30) X_test (28480, 30) y_train (256327,) y_test (28480,)
7 X_train (256327, 30) X_test (28480, 30) y_train (256327,) y_test (28480,)
8 X_train (256327, 30) X_test (28480, 30) y_train (256327,) y_test (28480,)
9 X_train (256327, 30) X_test (28480, 30) y_train (256327,) y_test (28480,)
