#Binary Classification with K-Nearest Neighbour

Today we are going to be looking at a breast-cancer dataset  and use the KNN algorithm to classify tumors as either malignant or beniign. 

The dataset is historical and anonymized patient data from the US, which contains information on 10 different attributes of patient tumors 

Before we get started we will need to import some libraires:

1. Numpy - Fundamental package for scientific computing with Python
2. Matplotlib -  Python 2D plotting library 
3. Pandas - Library providing high-performance, easy-to-use data structures and data analysis tools.
4. SKLearn - Simple and efficient tools for data mining and data analysis

In [0]:
# Import all the necessary libraries
from __future__ import print_function
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets, metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

##Load Dataset
First of all we want to read the CSV file that contains the dataset we will be using.

In [0]:
# Load the tabular data into the notebook
data = pd.read_csv('https://ai-camp-content.s3.amazonaws.com/breast_cancer.csv')

In [5]:
# Now let's view the first 10 rows in the dataset
data.head(10)

Unnamed: 0,id-number,clump_thickness,uniformity_of_cell_size,uniformity_of_cell_shape,marginal_adhesion,single_epithelial_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses,diagnosis
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2
5,1017122,8,10,10,8,7,10,9,7,1,4
6,1018099,1,1,1,1,2,10,3,1,1,2
7,1018561,2,1,2,1,2,1,3,1,1,2
8,1033078,2,1,1,1,2,1,1,1,5,2
9,1033078,4,2,1,1,2,1,2,1,1,2


##Manipulating the Data

Pandas Dataframe has the indexer *iloc* that is used to select rows and columns by number.

The syntax is `data.iloc[<row_selection>][<column_selection>]`

Below are some examples for using iloc to refesh the topic.

ToDo:
- Use iloc to separate the X and Y values of our dataset

In [14]:
# Examples:
#data.iloc[0] # first row of data frame
#data.iloc[-1] # last row of data frame
data.iloc[:5] # first 5 rows of data frame

#data.iloc[:, 0] # first column of data frame (id-number)
#data.iloc[:, -1] # last column of data frame (diagnosis)
#data.iloc[:, :5] # first 5 columns of data frame (id-number -> marginal_adhesion

#data.iloc[0:5, 0:5] # first 5 rows, first 5 columns

Unnamed: 0,id-number,clump_thickness,uniformity_of_cell_size,uniformity_of_cell_shape,marginal_adhesion,single_epithelial_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses,diagnosis
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


In [0]:
# Now we use iloc to separate the X and Y values of our dataset
# X values are the columns 1->9 (Remember selection is UP TO column 10 not INCLUDING)
X = data.iloc[:,1:10]

# Y values are in column 10
Y = data.iloc[:,10]

##Re-label Y Values
Currently the values in the diagnosis column are either 2 (benign) or 4 (malignant). To make it easier, we will re-label the values as 0 for benign, and 1 for malignant. To do this we can use SKLearns label encoder.

In [0]:
#Create the label encoder
labelencoder_Y = LabelEncoder()

#Call fit_transform on the labelencoder to encode the diagnosis column
Y = labelencoder_Y.fit_transform(Y)

##Split the Data
We need to split the dataset into training and validation data. The training set is much larger than the test set as the model will achieve a higher accuracy with more data to look at. Validation only needs to be a smaller percentage of the overall set as we just need to see if the model is predicting correctly.

Use the SKLearns train_test_split method that we have used in the last tutorial  to split up the data.

ToDo:
- Split the train and test data

In [0]:
#Split the dataset into the Training set and Test set using train_test_split

x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)

##Define and fit the model
We now need to create the classifier that we will use and train it with our data. 

To Do:
- Define the model (KNeighborsClassifier)
- Fit the model
- Predict the diagnosis values for the test data given
- Find out how accurate our model was 

In [27]:
# Create our classifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)

# Fit a model using the classifier and our training data
classifier.fit(x_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [33]:
# Predict the diagnosis values on the test dataset and print the predictions
y_prediction = classifier.predict(x_test)

y_prediction

array([1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1,
       0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0,
       0, 0, 1, 0, 0])

In [29]:
y_test

array([1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1,
       0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0,
       0, 0, 1, 0, 0])

##Analysis of Model Performance
Now we have our model we need to look how well it performed. A good fitting model is one where the difference between the actual values and the predicted values is small and unbiased for train, validation and test data sets. 

###Confusion Matrix
A confusion matrix is a table that is used to describe how well a classification model performs on a set of labelled test data. An example confusion matrix for a binary classifier is:

![](https://miro.medium.com/max/712/1*Z54JgbS4DUwWSknhDCvNTQ.png)

- True positivies (TP) - actual positives that are correctly identified as positive
- True negatives (TN) - actual negatives that are correctly identified as negative
- False positives (FP) -  actual negatives that are incorrectly identified as positive
- False negatives (FN) - actual positives that are incorrectly identified as negative

###Scores
From a confusion matrix, there are many different scores that can be computed to analyse the classifier's performance. The most important are:

**Accuracy** - the ratio of correctly predicted examples to the total examples  \
\begin{equation*}
Accuracy = \frac{TP + TN}{TP + TN + FP + FN}
\end{equation*}

**Precision** - the ratio of correct positive predictions to the total predicted positives. For example, if a breast cancer screening comes back *positive*, how likely are you to *actually* have breast cancer?

\begin{equation*}
Precision = \frac{TP}{TP + FP}
\end{equation*}

**Recall** -  the ratio of correct positive predictions to the total positive examples. For example, if a breast cancer screening comes back *negative*, how likely are you to *actually* have breast cancer that has been missed?

\begin{equation*}
Recall = \frac{TP}{TP + FN}
\end{equation*}

In some situations precision might be more important than recall or vice versa. If both are important then we can use F1 score.

**F1 score** - used when we need a balance between precision and recall

\begin{equation*}
F1 = 2*\frac{precision*recall}{precision+recall}=  2*\frac{TP}{TP + FP + FN}
\end{equation*}

In [34]:
# Now we have our model and predictions, we need to look at how accurate it was
# SciKit learn actually comes with built in functions for evaluating models 

print("Accuracy:\n%s" % metrics.accuracy_score(y_test, y_prediction))

print("Classification report for classifier %s:\n%s\n"
      % (classifier, metrics.classification_report(y_test, y_prediction)))

print("Confusion matrix:\n%s" % metrics.confusion_matrix(y_test, y_prediction))

# Note support = the number of occurrences of each label

Accuracy:
0.9854014598540146
Classification report for classifier KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform'):
              precision    recall  f1-score   support

           0       0.99      0.99      0.99        88
           1       0.98      0.98      0.98        49

    accuracy                           0.99       137
   macro avg       0.98      0.98      0.98       137
weighted avg       0.99      0.99      0.99       137


Confusion matrix:
[[87  1]
 [ 1 48]]


##Exercises

1. Experiment with the train/test ratio to see how it changes how well the model performs. 
    What's the highest and lowest accuracies you can achieve? How did you achieve this? Why is this the case?

In [0]:
# Your code:




2. Using the equations above, manually calculate the accuracy, precision, recall and F1 measure.

In [0]:
# Your code:

