# SVM - Banknote Dataset

This task will look at performing classification on a dataset known as the *Banknote Authentication Dataset*.

The dataset consists of 1372 samples, where each sample consists of the following 5 attributes:
1. variance of Wavelet Transformed image (continuous) 
2. skewness of Wavelet Transformed image (continuous) 
3. curtosis of Wavelet Transformed image (continuous) 
4. entropy of image (continuous) 
5. class (integer) 

The output (class) is either a 0 (genuine note), or a 1 (forged note). The task is therefore a binary classification task.

More information on the dataset can be found here: https://archive.ics.uci.edu/ml/datasets/banknote+authentication

In [None]:
import csv
from sklearn.svm import SVC
import matplotlib.pyplot as plt
from random import shuffle

def read_banknote_file(filename="datasets/data_banknote_authentication.csv"):
    x = []
    y = []
    
    with open(filename) as csv_file:
        csv_reader = csv.reader(csv_file)
        for row in csv_reader:
            x.append(list(map(float, row[:-1])))
            y.append([int(row[-1])])

    return x, y

x, y = read_banknote_file()

For this task we will split the dataset into a training dataset and test dataset, using the ratio 80/20. Before we do this, we will shuffle the data to randomize the order of the samples.

In [None]:
def shuffle_data(x, y):
    combined = list(zip(x, y))
    shuffle(combined)
    return zip(*combined)

def split_data(x, y, train_ratio=0.8):
    pivot = int(train_ratio * len(x))
    return x[:pivot], x[pivot:], y[:pivot], y[pivot:]

x, y = shuffle_data(x, y)

x_train, x_test, y_train, y_test = split_data(x, y)    

Build and train our classifier **on the train data**:

In [None]:
svm = SVC(C=0.1, kernel='rbf', gamma=0.001)

svm.fit(x_train, y_train)

Now we will test the trained model on the **training** dataset AND the **test** dataset.

Training dataset:

In [None]:
accuracy = svm.score(x_train, y_train)
print('Model accuracy:',accuracy*100,'%')

Test dataset:

In [None]:
accuracy = svm.score(x_test, y_test)
print('Model accuracy:',accuracy*100,'%')

Testing on the training dataset is generally considered bad practice as the network has already seen the data samples before. As a result, the accuracy of testing on the training dataset is almost always higher.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn import svm


#set up parameters to iterate over
parameters = {'kernel':('linear', 'rbf'),
                'C':[0.0001, 0.001, 0.01, 0.1, 1],
                'gamma':[0.0001, 0.001, 0.01, 0.1, 1]
              }

clf = GridSearchCV(svm.SVC(), parameters)

clf.fit(x, y) #iterate over all configurations

print("Best parameters:", clf.best_params_)