## Assignment 8

In this assignment, the Credit Card Fraud Detection dataset is used that can be found on [kaggle](https://www.kaggle.com/mlg-ulb/creditcardfraud).

In this notebook a classification task to fraudulent and non fraudelent users is done ising support vector machines (SVMs).  
Also techniques to handle imbalanced data are implemented.
The first part of the workbook focuses on handling imbalanced data. The algorithm part starts [here](#svm_section).

The datasets contains transactions made by credit cards in September 2013 by european cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

In [1]:
# if needed install packages uncommenting and executing the following commands
# !pip install imblearn

In [2]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt 
from sklearn.preprocessing import StandardScaler
from sklearn.utils import resample
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report, confusion_matrix, plot_roc_curve
from imblearn.under_sampling import TomekLinks, RandomUnderSampler, NearMiss
from imblearn.over_sampling import SMOTE, ADASYN
from imblearn.pipeline import Pipeline
from tqdm import tqdm
import os 

os.chdir('C:/Users/anast/OneDrive/Desktop/MSc/MachineLearning/Assignments/Asgmt8_SVM/')

In [3]:
data_file = 'creditcard.csv'

data = pd.read_csv(data_file)

**Scaling**  
Time and amount variables need to be scaled. The rest of the variables (the PCs) are already scaled 


The target variable is extremely imbalanced. Only 0.17% out of all the transactions of the dataset are fraudulent. This problem, may lead any model to overfit the non-fraudulent examples, being unable to recognise fraud. There are different ways to handle such issues.  
Here I will experiment with:
* Undersampling
* Oversampling
* Combination of both

Some nice guides can be found on [DataCamp](https://www.datacamp.com/community/tutorials/diving-deep-imbalanced-data?utm_source=adwords_ppc&utm_campaignid=898687156&utm_adgroupid=48947256715&utm_device=c&utm_keyword=&utm_matchtype=b&utm_network=g&utm_adpostion=&utm_creative=332602034349&utm_targetid=aud-390929969673:dsa-429603003980&utm_loc_interest_ms=&utm_loc_physical_ms=9061579&gclid=CjwKCAiAq8f-BRBtEiwAGr3Dgc65y799jXfSyX1UAugegeLHUDk7lb6izpB-coR1udmOQvHoN76s2xoCpg8QAvD_BwE), [KDnuggets](https://www.kdnuggets.com/2020/01/5-most-useful-techniques-handle-imbalanced-datasets.html) and [Machine Learning Mastery](https://machinelearningmastery.com/undersampling-algorithms-for-imbalanced-classification/).

### Under-sampling

**Random under-sampling** 
 
In random under-sampling, the only a subset of the majority class examples is retained and all the observation of the minority class are retained.

**Pros**: Improve the runtime of the model by reducing the number of training data samples when the training set is gigantic.   
**Cons**: There is high risk of information loss as only a small subset of the majority class training examples is used.


In [4]:
scaler = StandardScaler()
data['Amount'] = scaler.fit_transform(data['Amount'].values.reshape(-1,1))
data['Time'] = scaler.fit_transform(data['Time'].values.reshape(-1,1))

In [5]:
X = data.drop(columns='Class')
y = data['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=133, stratify=y)

### Under-sampling

**Random under-sampling** 
 
In random under-sampling, the only a subset of the majority class examples is retained and all the observation of the minority class are retained.

**Pros**: Improve the runtime of the model by reducing the number of training data samples when the training set is gigantic.   
**Cons**: There is high risk of information loss as only a small subset of the majority class training examples is used.

In [6]:
# train_data = pd.concat([X_train, y_train], axis=1)

# data_fraud = train_data[train_data['Class']==1]
# data_no_fraud = train_data[train_data['Class']==0]

# fraud_count = data_fraud['Class'].count()

# # undersample majority class
# data_no_fraud = resample(data_no_fraud, replace=False, n_samples=int(fraud_count*3), random_state=909)

# data_undersampled = pd.concat([data_fraud, data_no_fraud])

# X_train = data_undersampled.drop(columns='Class')
# y_train = data_undersampled['Class']

**Tomek Links**  

Undersampling can also be achieved using [Tomek links](https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.under_sampling.TomekLinks.html). Tomek links are pairs of examples of opposite classes in close vicinity. It is a fairly expensive algorithm since it has to compute pairwise distances between all examples. After this calculation, the majority elements from the Tomek link are removed, thus providing a better decision boundary for a classifier. Samples from the majorith, the minority or both classes can be removed.  
Undersampling can also be performed on the resulting dataset as discussed [here](https://www.hilarispublisher.com/open-access/classification-of-imbalance-data-using-tomek-link-tlink-combined-with-random-undersampling-rus-as-a-data-reduction-method-2229-8711-S1111.pdf).

In [7]:
# tl = TomekLinks(sampling_strategy='majority')

# X_tl, y_tl = tl.fit_resample(X,y)

**Near Miss**  

Near Miss is again an undersampling method that select examples based on the distance of majority class examples to minority class examples. NearMiss-1, NearMiss-2 and NearMiss-3 are the three versions of this technique. 
* NearMiss-1
Selects examples from the majority class with the lowest mean distance to the three closest examples of the minority class
* NearMiss-2
Selects examples from the majority class with the lowest average distance to the three furthest examples of the minority class
* NearMiss-3
Selects examples from the majority class for each example of the minority class that are closest

In [8]:
# miss = NearMiss(sampling_strategy='majority')

# X_miss, y_miss = miss.fit_resample(X,y)

### Over-sampling

**SMOTE**

Synthetic Minority Oversampling Technique (SMOTE) is an oversampling technique where the synthetic samples are generated for the minority class. This algorithm helps to overcome the overfitting problem posed by random oversampling. It focuses on the feature space to generate new instances with the help of interpolation between the positive instances that lie together.

Specifically, a random example from the minority class is first chosen. Then k of the nearest neighbors for that example are found (typically k=5). A randomly selected neighbor is chosen and a synthetic example is created at a randomly selected point between the two examples in feature space.

In [9]:
# sm = SMOTE(sampling_strategy=0.3, random_state=133, n_jobs=-1)

# X_sm, y_sm = sm.fit_resample(X,y)

**ADASYN**  

Adaptive Synthetic Sampling Approach (ADASYN) is a generalized form of the SMOTE algorithm. Again, this algorithm aims to oversample the minority class by generating synthetic instances for it. But the difference here is it considers the density distribution, ri which decides the no. of synthetic instances generated for samples which are difficult to learn. Due to this, it helps in adaptively changing the decision boundaries based on the samples difficult to learn.

In [10]:
# ada = ADASYN(random_state=133, n_jobs=-1)

# X_ada, y_ada = ada.fit_resample(X_train, y_train)

**Combining over and under-sampling techniques**  


In [11]:
under = RandomUnderSampler(sampling_strategy=0.01, random_state=101)
miss = NearMiss(sampling_strategy=0.08, n_neighbors=3, n_jobs=-1)
tomek = TomekLinks(sampling_strategy='majority', n_jobs=-1)
ada = ADASYN(sampling_strategy=0.1, random_state=359, n_jobs=-1)

imbalance_pipe = Pipeline(steps=[('under_sampling', under), 
                                 ('nearmiss', miss),
                                 #('tomek_links', tomek), 
                                 ('ada', ada)
                                 ])

X_train_resampled, y_train_resampled = imbalance_pipe.fit_resample(X_train, y_train) 

<a id='svm_section'></a>

### SVM  
Support vector machines (SVMs) are supervised classification algorithms. The classifier seperates the data points by finding the optimal hyperplane with the greatest amout of margin between the existing data points (that constitute the different classes). In other words, the best hyperplane is that whose distance to the nearest element of each class is the largest. Support vectors are the data points that are closest to the hyperplane. These are the most relevant datapoints for the classifier.

When data are not linearly seperatable, SVM uses a kernel trick. The kernel takes a low-dimensional input space and transforms it into a higher dimensional space. In other words, you can say that it converts nonseparable problem to separable problems by adding more dimension to it. It is most useful in non-linear separation problem. Kernel trick helps you to build a more accurate classifier.  
Common kernels:
* Linear kernel
* Polynomial kerner
* RBF kernel (radian basis function)


Note: _Although SVM are considered classification approaches, they can be used in both classification and regression tasks._   
Some notes on the maths for SVMs on [analyticsvidhya](https://www.analyticsvidhya.com/blog/2020/10/the-mathematics-behind-svm/) and [Andrew Ng](https://www.youtube.com/watch?v=QKc3Tr7U4Xc).

In [12]:
c_values = [0.1, 10, 0.1, 10, 0.1, 10, 100]
kernels = ['poly', 'poly', 'rbf', 'rbf', 'sigmoid', 'sigmoid', 'sigmoid']
gammas = [0.2, 6, 0.3, 5, 0.5, 2, 5]
degrees = [2, 5, 0, 0, 0, 0, 0]

params = [x for x in zip(c_values, kernels, gammas, degrees)]

In [13]:
print(len(y_train), sum(y_train), sum(y_train)/len(y_train))

227845 394 0.001729245759178389


In [14]:
accuracy = []
precission = []
recall = []
f1 = []

for i in tqdm(range(len(params)), position=0):
    clf = SVC(C=params[i][0], 
              kernel=params[i][1], 
              gamma=params[i][2], 
              degree=params[i][3],
              max_iter=1000,
              tol=0.01)
    clf.fit(X_train_resampled, y_train_resampled)
    preds = clf.predict(X_test)
    accuracy += [accuracy_score(y_test, preds)]
    precission += [precision_score(y_test, preds)]
    recall += [recall_score(y_test, preds)]
    f1 += [f1_score(y_test, preds)]

100%|██████████| 7/7 [00:49<00:00,  7.08s/it]


In [15]:
results_df = pd.DataFrame(data = {'C': c_values,
                                  'Kernel':kernels,
                                  'Gamma': gammas,
                                  'Degree': [x if x>0 else '-' for x in degrees],
                                  'Accuracy': accuracy,
                                  'Precision': precission,
                                  'Recall': recall,
                                  'F1': f1})
results_df

Unnamed: 0,C,Kernel,Gamma,Degree,Accuracy,Precision,Recall,F1
0,0.1,poly,0.2,2,0.929866,0.022777,0.94898,0.044487
1,10.0,poly,6.0,5,0.356325,0.002476,0.928571,0.004939
2,0.1,rbf,0.3,-,0.99828,0.0,0.0,0.0
3,10.0,rbf,5.0,-,0.998385,0.875,0.071429,0.132075
4,0.1,sigmoid,0.5,-,0.945543,0.013286,0.418367,0.025754
5,10.0,sigmoid,2.0,-,0.932885,0.007405,0.285714,0.014437
6,100.0,sigmoid,5.0,-,0.941329,0.013205,0.44898,0.025656


In [16]:
# suppose the best svm model is the first one as it has the best recall
clf = SVC(C=0.1, 
          kernel='poly', 
          gamma=0.2, 
          degree=2,
          max_iter=1000,
          tol=0.01)
clf.fit(X_train_resampled, y_train_resampled)
preds = clf.predict(X_test)

In [17]:
print(classification_report(y_test, preds))
print('Confusion matrix')
print(confusion_matrix(y_test, preds))

              precision    recall  f1-score   support

           0       1.00      0.93      0.96     56864
           1       0.02      0.95      0.04        98

    accuracy                           0.93     56962
   macro avg       0.51      0.94      0.50     56962
weighted avg       1.00      0.93      0.96     56962

Confusion matrix
[[52874  3990]
 [    5    93]]


** Hyperparameter search**

For imbalanced datasets, hyperparameter search is not an easy and straingforward procedure. 
Grid search is usually done with cross validation. Using a validation set that is taken from the undersampled or resamples data, will introduce an overconfidence level to the model. Good results in the validation set may not be followed by good results on the test set. Therefore, the validation set should not be either undersampled or oversampled.  
Some interesting notes are on [researchgate](https://www.researchgate.net/post/should_oversampling_be_done_before_or_within_cross-validation) and [stackexchange](https://datascience.stackexchange.com/questions/61858/oversampling-undersampling-only-train-set-only-or-both-train-and-validation-set).

TODO: Create cross validation, with the correct validation data.


In [18]:
# params = {'C': np.arange(0.1,0.5,0.1),
#           'gamma':np.arange(0.1,0.5,0.1),
#           'degree':range(1,4)
#           }

# scoring = {'Accuracy': make_scorer(accuracy_score),
#            'Precision': make_scorer(precision_score),
#            'Recall': make_scorer(recall_score),
#            'F1': make_scorer(f1_score),
#            }


# clf = SVC(kernel='poly', max_iter=1000, tol=0.01)

# grid = GridSearchCV(clf, 
#                     param_grid=params,
#                     cv=3,
#                     scoring=scoring, 
#                     refit='F1', 
#                     return_train_score=True,
#                     verbose=1,
#                     n_jobs=-1)
# grid.fit(X_train, y_train)

# print(grid.best_score_)
# print(grid.best_estimator_)

#### TODO  
Make this a notebook of different under-sampling and over-sampling techniques.