# MAP583 - Data Camp
# Course project
# Credit Card Fraud Detection

We have chosen this dataset from Kaggle (https://www.kaggle.com/mlg-ulb/creditcardfraud/data), containing credid card transactions data, and the objective is to predict the transactions which are frauds.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style
import seaborn as sns

%matplotlib inline
# %matplotlib notebook
style.use('ggplot')

## Loading dataset

In [None]:
data = pd.read_csv("../data/creditcard.csv")
data.head()

In [None]:
# Check if there is null data
# data.isnull().sum()

In [None]:
# data.describe()

## Removing 'Time' column and normalizing (scaling) data

In [None]:
data.drop(['Time'], axis=1, inplace=True)

In [None]:
labels = data['Class']

The following is necessary, because of the way the SVDD library is coded

In [None]:
labels_svm = labels.copy()
labels_svm[labels == 1] = -1 # fraud
labels_svm[labels == 0] = 1 # non-fraud

In [None]:
from sklearn.preprocessing import StandardScaler

scaled_features = StandardScaler().fit_transform(data.values)
scaled_data = pd.DataFrame(scaled_features,
                           index=data.index,
                           columns=data.columns)

In [None]:
scaled_data.drop(['Class'], axis=1, inplace=True)
scaled_data.describe()

## Check target class

In [None]:
class_counts = labels_svm.value_counts()
print(class_counts)

# Plot a histogram
class_counts.plot(kind='bar')
plt.title("Fraud distribution")
plt.xlabel("Class")
plt.ylabel("Frequency (log)")
plt.yscale('log')

In [None]:
print('Baseline: {:.3f}%'.format(len(labels_svm[labels_svm == 1]) / len(labels_svm) * 100))

The baseline accuracy is therefore 99.827%, so any model which performs below this threshold isn't doing very well

## Splitting data

In [None]:
from sklearn.model_selection import train_test_split

test_size = 0.2  # "Pareto rule", 80/20
X_train, X_test, y_train, y_test = train_test_split(scaled_data,
                                                    labels_svm,
                                                    test_size=test_size)

## We will use only non-fraud points to train SVDD
In the library we only have SVDD implemented. There is not an implementation of SVDD-neg (a version that incorporates negative examples also).

LIBSVM:

https://github.com/cjlin1/libsvm

https://github.com/cjlin1/libsvm/tree/master/python # bindings em Python

https://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html

https://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf # article

https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf # guide

SVDD:

https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#libsvm_for_svdd_and_finding_the_smallest_sphere_containing_all_data

One-class SVM:

http://scikit-learn.org/stable/auto_examples/svm/plot_oneclass.html

Outras pessoas tiveram a mesma pergunta, mas aparentemente ninguém tem o SVDD-neg implementado
https://www.reddit.com/r/MachineLearning/comments/396o0n/experience_training_support_vector_data/

Biblioteca em MATLAB (tem a ver com o criador de SVDD)

https://www.tudelft.nl/ewi/over-de-faculteit/afdelingen/intelligent-systems/pattern-recognition-bioinformatics/pattern-recognition-laboratory/data-and-software/dd-tools/

In [None]:
non_fraud_X_train = X_train[y_train==1].values.tolist()
non_fraud_y_train = y_train[y_train==1].values.tolist()

In [None]:
from svm import *
from svmutil import *

In [None]:
problem = svm_problem(non_fraud_y_train,
                      non_fraud_X_train,
                      isKernel=False) # set to True if precomputed Kernel

In [None]:
# {'C': 6.325283529810813e-06, 'kernel': {'coef0': 1.9690049850021658, 'gamma': 0.9417836463715797, 'type': 3}}
param = svm_parameter()
param.svm_type = 5
param.kernel_type = 3
param.degree = 2
param.gamma = 0.9417836463715797
param.C = 6.325283529810813e-06
param.coef0 = 1.9690049850021658
param.eps = 0.001
param.cross_validation = False
param.nr_fold = 0
model = svm_train(problem, param)

In [None]:
y_test = y_test.values.tolist()
X_test = X_test.values.tolist()

As we don't use the negative labels in the training set, I am adding them to our test set (maybe this is wrong to do)