## Assignment 8

In this assignment, the Credit Card Fraud Detection dataset is used that can be found on [kaggle](https://www.kaggle.com/mlg-ulb/creditcardfraud).

In this workbook a classification task to fraudulent and non fraudelent users is done ising support vector machines (SVMs).  
Also techniques to handle imbalanced data are implemented.

The datasets contains transactions made by credit cards in September 2013 by european cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

In [4]:
# if needed install packages uncommenting and executing the following commands
# !pip install imblearn

In [40]:
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 
from sklearn.preprocessing import StandardScaler
from sklearn.utils import resample
from imblearn.under_sampling import TomekLinks
import os 

os.chdir('C:/Users/anast/OneDrive/Desktop/MSc/MachineLearning/Assignments/Asgmt8_SVM/')

In [7]:
data_file = 'creditcard.csv'

data = pd.read_csv(data_file)

In [11]:
data.columns

Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class'],
      dtype='object')

**Scaling**  
Time and amount variables need to be scaled. The rest of the variables (the PCs) are already scaled 


In [21]:
scaler = StandardScaler()
data['Amount'] = scaler.fit_transform(data['Amount'].values.reshape(-1,1))
data['Time'] = scaler.fit_transform(data['Time'].values.reshape(-1,1))

In [None]:
X = data.drop(columns='Class')
y = data['Class']

The target variable is extremely imbalanced. Only 0.17% out of all the transactions of the dataset are fraudulent. This problem, may lead any model to overfit the non-fraudulent examples, being unable to recognise fraud. There are different ways to handle such issues.  
Here I will experiment with:
* Undersampling
* Oversampling

Some nice guides can be found on [DataCamp](https://www.datacamp.com/community/tutorials/diving-deep-imbalanced-data?utm_source=adwords_ppc&utm_campaignid=898687156&utm_adgroupid=48947256715&utm_device=c&utm_keyword=&utm_matchtype=b&utm_network=g&utm_adpostion=&utm_creative=332602034349&utm_targetid=aud-390929969673:dsa-429603003980&utm_loc_interest_ms=&utm_loc_physical_ms=9061579&gclid=CjwKCAiAq8f-BRBtEiwAGr3Dgc65y799jXfSyX1UAugegeLHUDk7lb6izpB-coR1udmOQvHoN76s2xoCpg8QAvD_BwE), [KDnuggets](https://www.kdnuggets.com/2020/01/5-most-useful-techniques-handle-imbalanced-datasets.html) and [Machine Learning Mastery](https://machinelearningmastery.com/undersampling-algorithms-for-imbalanced-classification/).

**Random under-sampling**  
In random under-sampling, the only a subset of the majority class examples is retained and all the observation of the minority class are retained.

**Pros**: Improve the runtime of the model by reducing the number of training data samples when the training set is gigantic.   
**Cons**: There is high risk of information loss as only a small subset of the majority class training examples is used.


In [38]:
data_fraud = data[data['Class']==1]
data_no_fraud = data[data['Class']==0]

fraud_count = data_fraud['Class'].count()

# undersample majority class
data_no_fraud = resample(data_no_fraud, replace=False, n_samples=int(fraud_count*1.75), random_state=909)

data_undersampled = pd.concat([data_fraud, data_no_fraud])

Undersampling can also be achieved using [Tomek links](https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.under_sampling.TomekLinks.html). Tomek links are pairs of examples of opposite classes in close vicinity.
This algorithm, removes the majority element from the Tomek link, which provides a better decision boundary for a classifier.

In [44]:
tl = TomekLinks(sampling_strategy='majority')

X_tl, y_tl = tl.fit_resample(X,y)
print(data.shape, X.shape)

In [43]:
TomekLinks()

TomekLinks()