![alt text](images/HDAT9500Banner.PNG)
<br>

# Chapter 5: One-Class-SVM.

#####################################################################################

Double-click to write down your name and surname.

**Name:**
Alexander

**Surname:**
Kruskal

**Honour Pledge** <p>
    
    
Declaration: <p>
    
    
I declare that this assessment item is my own work, except where acknowledged, and has not been submitted for academic credit elsewhere or previously, or produced independently of this course (e.g. for a third party such as your place of employment) and acknowledge that the assessor of this item may, for the purpose of assessing this item: 

    a. Reproduce this assessment item and provide a copy to another member of the University; and/or 
    b. Communicate a copy of this assessment item to a plagiarism checking service (which may then retain a copy of the assessment item on its database for the purpose of future plagiarism checking). 

#####################################################################################

# Anomaly detection via One-Class-SVM using the breast cancer data set

Suppose that we have a data set in which one class (e.g. malignant) is so underrepresented that there isn't enough data to fit a two-class classifier. Imagine that this is a very rare type of cancer and we do not have many cancer cases (malignant tumours) to obtain samples, and therefore, we do not have enough samples of malignant tumours <font color=red>**to train**</font> our algorithm. 

Thus, imagine that we have no images available of malignant tumours to train our model, only bening cases.


In **anomaly detection** one approach to solve our problem is as follows: 
1. Learn the distribution of the normal cases, benign cases in this example. Thus, we train our model with benign cases ONLY.
2. For newly incoming data, compare how well these fit with the learnt model. The newly incoming data will have both, benign and malingnant cases.
3. As the model has been trained to identify ONLY benign cases, everything that is not benign in the final model, will be classified as "not benign", thus malignant. Therefore, we have to test our model with benign and malingnant cases in order to assess its performance.

The result is an algorithm which proposes the tumours which are least likely to be normal. If datasets are large but anomalies are very few, this could save the doctor a lot of time. 

Other applications of this approach could be
* credit card fraud detection: for each fraudulent transaction, there are thousands of valid transactions
* directing the attention of Department of Health/AIHW/your company case officers browsing suspicious matter reports
* ...

This is our first example of unsupervised learning!

We will use [`OneClassSVM()`](http://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html) from sklearn. 

In [1]:
import warnings; warnings.simplefilter('ignore')

### <font color='blue'> Question 1 (25 marks): </font>

1. <font color='blue'> Extract all the predictors for the benign cases into a matrix `X_B` and for the malignant cases into a matrix `X_M`. Make sure the output, y_B=1 and y_M=-1.
2. <font color='blue'> Split the benign cases into an 70% training and 30% testing set. 
3. <font color='blue'> Remember that we will have only one class to train our model, just benign cases. 
4. The way the final model works is like this: because the model has been trained to identify ONLY benign cases, everything that is not benign in the final model, will be classified as "not benign", thus malignant. You can write it as X_B_train, X_B_test, y_train, y_test= train_test_split(...). 

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.svm               import *
from sklearn.metrics           import *
from sklearn.preprocessing     import *

In [3]:
def outcome_split(data_frame, column_names):
    """
    split dataframe between features and outcomes
    
    data_frames: dataframes
    column_names: string names in list (names after first index are just removed)
    
    return: data_frame with removed column_names, outcome_df
    """
    
    outcome_df = data_frame[column_names[0]].values
    data_frame = data_frame.drop(columns= column_names, axis = 1)
    return (data_frame, outcome_df)

In [4]:
# get data
cancer = pd.read_csv('data/breast-cancer-wisconsin-data/data.csv', sep=',')

#Change diagnosis to 1 where B and -1 where M
cancer['diagnosis'] = cancer['diagnosis'].apply(lambda x: 1 if x == 'B' else -1)

#split by diagnosis
X_B = cancer[cancer.diagnosis == 1]
X_M = cancer[cancer.diagnosis == -1]

#split outcome from features and remove diagnosis and id
X_B, y_B = outcome_split(X_B,['diagnosis', 'id'])
X_M,y_M = outcome_split(X_M,['diagnosis', 'id'])



In [6]:
#split data
from sklearn.model_selection import train_test_split
X_B_train, X_B_test, y_train, y_test = train_test_split(X_B, y_B.ravel(), stratify=y_B, random_state=0, test_size = 0.30)

### <font color='blue'> Question 2: Fit the model. No need to tune the parameters. Set nu to 0.01 and gamma to 0.005. (25 marks)</font>

In [7]:
#scale data
scaler = StandardScaler()
scaler.fit(X_B_train)
X_B_train_scaled = scaler.transform(X_B_train)

#Single class SVM
one_class_svm_cancer= OneClassSVM(nu = .01, gamma = .005)
one_class_svm_cancer.fit(X_B_train_scaled, y_train)

OneClassSVM(cache_size=200, coef0=0.0, degree=3, gamma=0.005, kernel='rbf',
      max_iter=-1, nu=0.01, random_state=None, shrinking=True, tol=0.001,
      verbose=False)

### <font color='blue'> Question 3: Before carrying out the final test, we have to assess how our model performs on the benign cases test set (X_B_test). How good is the model at predicting the right class for data? (15 marks)</font>

In [8]:
X_B_test_scaled = scaler.transform(X_B_test)

In [9]:
y_pred_test = one_class_svm_cancer.predict(X_B_test_scaled)

In [10]:
print("Confusion matrix of the benign only test set:")
cm = confusion_matrix(y_true = y_test, y_pred = y_pred_test)
print(cm)

Confusion matrix of the benign only test set:
[[  0   0]
 [  5 103]]


In [11]:
print("Precision, Recall, F1-score for positive and negative classes of the  benign only test set:")
print(classification_report(y_test, y_pred_test))
print('Accuracy for benign only test: {:.3f}.'.format(accuracy_score(y_test, y_pred_test)))

Precision, Recall, F1-score for positive and negative classes of the  benign only test set:
             precision    recall  f1-score   support

         -1       0.00      0.00      0.00         0
          1       1.00      0.95      0.98       108

avg / total       1.00      0.95      0.98       108

Accuracy for benign only test: 0.954.


<b> Write your thoughts here (100 words max):</b>
#####################################################################################################################

Based on the benign only test set, the model performs really well. precision, recall, and f1 score are not useful measures for the benign only case because there is only True negatives and false negatives.
Accuracy is our best scoring detection.

The confusion matrix shows 5 false negatives and 103 true negatives. There being several missclassified is a good sign because then there is neither extreme over-fitting to the training set - many false negatives - nor extreme under-fitting to the training set - there being not false negatives.

Sithout testing many models, I cannot be sure this is optomised, but it seems to fit well before introducing malignant casses, with an accuracy of 95.4%..


#####################################################################################################################

### <font color='blue'> Question 4: How good is the model at differentiating malignant from benign cases? (35 marks)</font>

In [12]:
#combine malignant cases with test set
X_test_with_M = X_B_test.append(X_M)
y_test_with_M = np.concatenate((y_test, y_M), axis=0)

#scale new data and predict
X_test_with_M_scaled = scaler.transform(X_test_with_M)
y_pred_test_with_M = one_class_svm_cancer.predict(X_test_with_M_scaled)

In [13]:
print("Confusion matrix of the test set:")
cm_M = confusion_matrix(y_true = y_test_with_M, y_pred = y_pred_test_with_M)
print(cm_M)

Confusion matrix of the test set:
[[173  39]
 [  5 103]]


In [14]:
print("Precision, Recall, F1-score for positive and negative classes of the combined benign and melignant test set:")
print(classification_report(y_test_with_M, y_pred_test_with_M))
print('Accuracy for combined benign and malignant test: {:.3f}.'.format(accuracy_score(y_test_with_M, y_pred_test_with_M)))

Precision, Recall, F1-score for positive and negative classes of the combined benign and melignant test set:
             precision    recall  f1-score   support

         -1       0.97      0.82      0.89       212
          1       0.73      0.95      0.82       108

avg / total       0.89      0.86      0.87       320

Accuracy for combined benign and malignant test: 0.863.


<b> Write your thoughts here (100 words max):</b>
#####################################################################################################################

Firstly, I can be pretty sure I combined the malignant and benign cases together correctly because the true and false negatives remained the same as before and we are only adding in malignant cases, which can only be described as true or false positives.

Secondly, the model does seem to do a really good job in general. The accuracy is 86.3%, precision for the positive case is 73% (accuracy of positive predictions), recall is 95% (percentage of positive predictions caught), negative recall is 82%, positive is f1-score 82% and negative f1-score is 89% (average of recall and precision). 

This model performs extremely well, but with some tuning I would hope to increase the positive precision from 73%, especially because missing a rare cancer could be devestating compared to some extra tests. The confusion matrix demonstrates the relatively low precision score, having 39 false negatives to the 173 true positives.


#####################################################################################################################