![alt text](images/HDAT9500Banner.PNG)
<br>

# Chapter 5: One-Class-SVM.

# Anomaly detection via One-Class-SVM using the breast cancer data set

Suppose that we have a data set in which one class (e.g. malignant) is so underrepresented that there isn't enough data to fit a two-class classifier. Imagine that this is a very rare type of cancer and we do not have many cancer cases to obtain samples, and therefore, to train our algorithm.


In **anomaly detection** one approach is as follows: 
1. Learn the distribution of the normal cases.
2. For newly incoming data, compare how well these fit with the learnt distribution. 
3. Rank the new data in decreasing order of the quality of fit with the learnt distribution.

The result is an algorithm which proposes e.g. to a doctor the measurements which are least likely to be normal. If datasets are large but anomalies are very few, this could save the doctor a lot of time. 

Other applications of this approach could be
* credit card fraud detection: for each fraudulent transaction, there are thousands of valid transactions
* directing the attention of Department of Health/AIHW/your company case officers browsing suspicious matter reports
* ...

We will use [`OneClassSVM()`](http://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html) from sklearn. 

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.svm               import *
from sklearn.datasets          import make_blobs
from sklearn.metrics           import *
from sklearn.preprocessing     import *
from sklearn.model_selection   import *
from sklearn.pipeline          import *

We have loaded several submodules of `sklearn`. To learn as you go: 

* The documentation of each method can be accessed by pressing the TAB button inside parenthesis, e.g. in `SVC()`.
* `sklearn` has more detailed documentation online, see e.g. <http://scikit-learn.org/stable/modules/svm.html>. 

In [3]:
# get data
cancer = pd.read_csv('data/breast-cancer-wisconsin-data/data.csv', sep=',')

In [8]:
y = (cancer.diagnosis == 'M')
print(f"""There are {sum(y)} malignant and {sum(1-y)} benign cases.""")

There are 212 malignant and 357 benign cases.


In [9]:
X = cancer.drop(['diagnosis'], axis = 1)
y = cancer[['diagnosis']].values

### <font color='blue'> Question 1 (25 marks): </font>
1. <font color='blue'> Extract all the predictors for the benign cases into a matrix `X_B` and for the malignant cases into a matrix `X_M`. 
2. <font color='blue'> Split the benign cases into an 80% training and 20% testing set. Remember that we will have only one class, benign cases (everything that is not benign in the final classifier, will be malign). Therefore, no need to have "y" as an output. You can write it as X_B_train, X_B_test = train_test_split(...)

<font color='blue'> START CODE HERE </font>

In [0]:
# Write Python code here:


<font color='blue'> END CODE HERE </font>

### <font color='blue'> Question 2 (25 marks):</font>
1. <font color='blue'>Using `Pipeline()`, create an "estimator" for the learnt distribution, which 
    * standardises the input
    * fits a `OneClassSVM()` such that the decision boundary contains the smallest volume containing the most likely 95% of data points. 
2. <font color='blue'>Fit this estimator to the training data. 

<font color='blue'> START CODE HERE </font>

In [0]:
# Write Python code here:


<font color='blue'> END CODE HERE </font>

### <font color='blue'> Question 3 (25 marks):</font>
<font color='blue'>Confirm that your anomaly detector indeed classifies approximately 95% of the training data as "normal". 
How does it classify the test data? 

<font color='blue'> START CODE HERE </font>

On the training data:

In [0]:
# Write Python code here:


0.0456140350877193

On the test data:

In [0]:
# Write Python code here:


0.041666666666666664

<font color='blue'> END CODE HERE </font>

### <font color='blue'> Question 4 (25 marks): </font>
<font color='blue'>Pool the test data and the malignant data into `X_new`=`X_B_test+X_M`. 
Then output the 10 observations from `X_new` which are most likely to be outliers (and thus malignant), and compare with their actual labels.

<font color='blue'>Steps: 
1. <font color='blue'>Make a list in which each item contains the score of the anomaly detector and the label for the new data.
2. <font color='blue'>Order this list along the first entry. 

<font color='blue'> START CODE HERE </font>

In [0]:
# Write Python code here:


<font color='blue'> END CODE HERE </font>