# Performance Analysis

We decided to use Hamming distance to analyse our models. The hamming distance of two binary arrays $ x $ and $ y $ is the number of positions at which the corresponding symbols are different: 
$$H(X,Y) = \sum\limits_{i=1}^n \mathbb{1}(x_i = y_i) $$

We decided to use this metric as it is uniform across the models and gives insight into accuracy of the model. Some of the models predicted the attack type whereas others predicted normal vs non-normal behaviour which means we had to find a standardised test across the two types of model.

In [3]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import scipy as sp
from scipy.spatial import distance

## Matt

In [4]:
pred_matt = pd.read_csv('https://github.com/Galeforse/DST-Assessment-01/raw/main/Data/test_predictions_matt.csv')
test_labels = pd.read_csv('https://github.com/Galeforse/DST-Assessment-01/raw/main/Data/test_labels.csv')

pred_matt = np.array(pred_matt['0'])
test_labels = np.array(test_labels['label'])

hamm_dist_matt = distance.hamming(pred_matt,test_labels)
accuracy_matt = 1 - hamm_dist_matt
hamm_dist_matt = int(hamm_dist_matt * len(test_labels))

pd.DataFrame([[hamm_dist_matt, accuracy_matt]], columns = ['Hamming Distance', 'Accuracy'], index = ['Matt'])

Unnamed: 0,Hamming Distance,Accuracy
Matt,3,0.999939


## Alex

In [9]:
''' R code to get the Hamming distance for Alex's model's result'''
## library(class)
## library(caret)

# pr1 <- knn(KTT_train,KTT_test,cl=KTT_target_category,k=1, use.all=FALSE)
## [...]
## [...]
# pr37 <- knn(KTT_train,KTT_test,cl=KTT_target_category,k=1, use.all=FALSE)

# h<- vector(length=37)
# h[1]<-hamming.distance(as.vector(pr1), kt$Behaviour[1:1333])
## [...]
## [...]
# h[37]<-hamming.distance(as.vector(pr37), kt$Behaviour[46790:48122])

# m <- sum(h) ## The total Hamming Distance

# h <- -h
# h<- h+1333
# h<- h/1333
# summary(h) # mean = 98.2% accuracy. ## The accuracy

pred_alex_df = pd.read_csv('https://github.com/Galeforse/DST-Assessment-01/raw/main/Data/KNN-Performance.csv')

hamm_dest_alex = pred_alex_df.iat[0,0]
accuracy_alex = pred_alex_df.iat[0,1]

pred_alex_df

Unnamed: 0,Hamming Distance,Accuracy
0,886,98.20%


## Luke

## Gabriel

I have two trained models, one using basic Logistic Regression and the other using `GridSearchCV` with cross-validation. Despite the Grid version taking almost 80 times as long to train as the basic model, the actual increase in accuracy was rather small, especially comparing the hamming distance of the two, with that many data points.

In [35]:
pred_gabe_grid = pd.read_csv('https://github.com/Galeforse/DST-Assessment-01/raw/main/Data/grid_y_pred_gabe.csv')
pred_gabe_basic = pd.read_csv('https://github.com/Galeforse/DST-Assessment-01/raw/main/Data/reg_y_pred_gabe.csv')
test_labels = pd.read_csv('https://github.com/Galeforse/DST-Assessment-01/raw/main/Data/y_test_gabe.csv')

pred_gabe_grid = np.array(pred_gabe_grid['0'])
pred_gabe_basic = np.array(pred_gabe_basic['0'])
test_labels = np.array(test_labels['0'])

hamm_dist_gabe = distance.hamming(pred_gabe_grid,test_labels)
accuracy_gabe = 1 - hamm_dist_gabe
hamm_dist_gabe = int(hamm_dist_gabe * len(test_labels))

hamm_dist_gabe2 = distance.hamming(pred_gabe_basic,test_labels)
accuracy_gabe2 = 1 - hamm_dist_gabe2
hamm_dist_gabe2 = int(hamm_dist_gabe2 * len(test_labels))

pd.DataFrame([[hamm_dist_gabe, accuracy_gabe],[hamm_dist_gabe2, accuracy_gabe2]], columns = ['Hamming Distance', 'Accuracy'], index = ['Gabe Grid','Gabe Basic'])

Unnamed: 0,Hamming Distance,Accuracy
Gabe Grid,73,0.998522
Gabe Basic,76,0.998462


## Comparison

In [None]:
pd.DataFrame([[hamm_dist_matt, accuracy_matt],[hamm_dist_alex, accuracy_alex],[hamm_dist_luke, accuracy_luke],[hamm_dist_gabe, accuracy_gabe]], columns = ['Hamming Distance', 'Accuracy'], index = ['Matt', 'Alex', 'Luke', 'Gabriel'])