## Assignment 8

In this assignment, the Credit Card Fraud Detection dataset is used that can be found on [kaggle](https://www.kaggle.com/mlg-ulb/creditcardfraud).

In this notebook a classification task to fraudulent and non fraudelent users is done ising support vector machines (SVMs).  
Also techniques to handle imbalanced data are implemented.
The first part of the workbook focuses on handling imbalanced data. The algorithm and the tuning part starts here (TODO: add link to specific cell)

The datasets contains transactions made by credit cards in September 2013 by european cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

In [28]:
# if needed install packages uncommenting and executing the following commands
# !pip install imblearn

In [59]:
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 
from sklearn.preprocessing import StandardScaler
from sklearn.utils import resample
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from imblearn.under_sampling import TomekLinks
from imblearn.over_sampling import SMOTE, ADASYN
import os 

os.chdir('C:/Users/anast/OneDrive/Desktop/MSc/MachineLearning/Assignments/Asgmt8_SVM/')

In [30]:
data_file = 'creditcard.csv'

data = pd.read_csv(data_file)

**Scaling**  
Time and amount variables need to be scaled. The rest of the variables (the PCs) are already scaled 


The target variable is extremely imbalanced. Only 0.17% out of all the transactions of the dataset are fraudulent. This problem, may lead any model to overfit the non-fraudulent examples, being unable to recognise fraud. There are different ways to handle such issues.  
Here I will experiment with:
* Undersampling
* Oversampling
* Combination of both

Some nice guides can be found on [DataCamp](https://www.datacamp.com/community/tutorials/diving-deep-imbalanced-data?utm_source=adwords_ppc&utm_campaignid=898687156&utm_adgroupid=48947256715&utm_device=c&utm_keyword=&utm_matchtype=b&utm_network=g&utm_adpostion=&utm_creative=332602034349&utm_targetid=aud-390929969673:dsa-429603003980&utm_loc_interest_ms=&utm_loc_physical_ms=9061579&gclid=CjwKCAiAq8f-BRBtEiwAGr3Dgc65y799jXfSyX1UAugegeLHUDk7lb6izpB-coR1udmOQvHoN76s2xoCpg8QAvD_BwE), [KDnuggets](https://www.kdnuggets.com/2020/01/5-most-useful-techniques-handle-imbalanced-datasets.html) and [Machine Learning Mastery](https://machinelearningmastery.com/undersampling-algorithms-for-imbalanced-classification/).

### Under-sampling

**Random under-sampling** 
 
In random under-sampling, the only a subset of the majority class examples is retained and all the observation of the minority class are retained.

**Pros**: Improve the runtime of the model by reducing the number of training data samples when the training set is gigantic.   
**Cons**: There is high risk of information loss as only a small subset of the majority class training examples is used.


In [31]:
data_fraud = data[data['Class']==1]
data_no_fraud = data[data['Class']==0]

fraud_count = data_fraud['Class'].count()

# undersample majority class
data_no_fraud = resample(data_no_fraud, replace=False, n_samples=int(fraud_count*3), random_state=909)

data_undersampled = pd.concat([data_fraud, data_no_fraud])

In [32]:
scaler = StandardScaler()
data_undersampled['Amount'] = scaler.fit_transform(data_undersampled['Amount'].values.reshape(-1,1))
data_undersampled['Time'] = scaler.fit_transform(data_undersampled['Time'].values.reshape(-1,1))

In [33]:
X = data_undersampled.drop(columns='Class')
y = data_undersampled['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=133, stratify=y)

**Tomek Links**  

Undersampling can also be achieved using [Tomek links](https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.under_sampling.TomekLinks.html). Tomek links are pairs of examples of opposite classes in close vicinity. It is a fairly expensive algorithm since it has to compute pairwise distances between all examples. After this calculation, the majority elements from the Tomek link are removed, thus providing a better decision boundary for a classifier. Samples from the majorith, the minority or both classes can be removed.  
Undersampling can also be performed on the resulting dataset as discussed [here](https://www.hilarispublisher.com/open-access/classification-of-imbalance-data-using-tomek-link-tlink-combined-with-random-undersampling-rus-as-a-data-reduction-method-2229-8711-S1111.pdf).

In [34]:
# tl = TomekLinks(sampling_strategy='majority')

# X_tl, y_tl = tl.fit_resample(X,y)

### Over-sampling

**SMOTE**

Synthetic Minority Oversampling Technique (SMOTE) is an oversampling technique where the synthetic samples are generated for the minority class. This algorithm helps to overcome the overfitting problem posed by random oversampling. It focuses on the feature space to generate new instances with the help of interpolation between the positive instances that lie together.

Specifically, a random example from the minority class is first chosen. Then k of the nearest neighbors for that example are found (typically k=5). A randomly selected neighbor is chosen and a synthetic example is created at a randomly selected point between the two examples in feature space.

In [35]:
# sm = SMOTE(sampling_strategy=0.3, random_state=133, n_jobs=-1)

# X_sm, y_sm = sm.fit_resample(X,y)

**ADASYN**  

Adaptive Synthetic Sampling Approach (ADASYN) is a generalized form of the SMOTE algorithm. Again, this algorithm aims to oversample the minority class by generating synthetic instances for it. But the difference here is it considers the density distribution, ri which decides the no. of synthetic instances generated for samples which are difficult to learn. Due to this, it helps in adaptively changing the decision boundaries based on the samples difficult to learn.

In [36]:
# ada = ADASYN(random_state=133, n_jobs=-1)

# X_ada, y_ada = ada.fit_resample(X_train, y_train)

I will be working with the undersampled data, so that i have less examples to train on and reduce the training time of SVM.

**TODO**  
Write a few things about SVM

In [67]:
c_values = [0.1, 10, 0.1, 10, 0.1, 10, 100]
kernels = ['poly', 'poly', 'rbf', 'rbf', 'sigmoid', 'sigmoid', 'sigmoid']
gammas = [0.2, 6, 0.3, 5, 0.5, 2, 5]
degrees = [2, 5, 0, 0, 0, 0, 0]

params = [x for x in zip(c_values, kernels, gammas, degrees)]

In [68]:
def print_metrics(y_test, predictions):  
    print(f'Accuracy {accuracy_score(y_test, predictions)*100:.2f}%'
            f'\nPrecision {precision_score(y_test, predictions)*100:.2f}% '
            f'\nRecall {recall_score(y_test, predictions)*100:.2f}% '
            f'\nF1 {f1_score(y_test, predictions)*100:.2f}% ')

In [70]:
for i in range(len(params)):
    clf = SVC(C=params[i][0], 
              kernel=params[i][1], 
              gamma=params[i][2], 
              degree=params[i][3])
    clf.fit(X_train, y_train)
    print(f'\nSVM trained on parameters: C={params[i][0]}, kernel={params[i][1]}, gamma={params[i][2]}, degree={params[i][3]}.')
    print('The model achieved the following metrics:')
    print_metrics(y_test, clf.predict(X_test))


SVM trained on parameters: C=0.1, kernel=poly, gamma=0.2, degree=2.
The model achieved the following metrics:
Accuracy 95.38%
Precision 92.36% 
Recall 88.96% 
F1 90.62% 

SVM trained on parameters: C=10, kernel=poly, gamma=6, degree=5.
The model achieved the following metrics:
Accuracy 94.77%
Precision 89.57% 
Recall 89.57% 
F1 89.57% 

SVM trained on parameters: C=0.1, kernel=rbf, gamma=0.3, degree=0.
The model achieved the following metrics:
Accuracy 74.92%
Precision 0.00% 
Recall 0.00% 
F1 0.00% 

SVM trained on parameters: C=10, kernel=rbf, gamma=5, degree=0.
The model achieved the following metrics:
Accuracy 76.15%
Precision 100.00% 
Recall 4.91% 
F1 9.36% 

SVM trained on parameters: C=0.1, kernel=sigmoid, gamma=0.5, degree=0.
The model achieved the following metrics:
Accuracy 76.77%
Precision 53.37% 
Recall 58.28% 
F1 55.72% 

SVM trained on parameters: C=10, kernel=sigmoid, gamma=2, degree=0.
The model achieved the following metrics:
Accuracy 81.23%
Precision 62.73% 
Recall 61

#### TODO  
Make this a notebook of different under-sampling, over-sampling techniques.