# CS342 Machine Learning
# Lab 3: _K_-NN classification

## Department of Computer Science, University of Warwick

In the this lab, we will explore the use and implementation of a _K_-NN classifier and _k_-fold validation.

# Data files for the lab

If working on one of the DCS machines, the data may be found here:

```/modules/cs342/2020/lab3/data/diabetes.data ```

You may load the data directly from that directory.

If you are using your own machine, copy the data across by running the following command in a terminal window using the remote node corresponding to your username. The name of this remote node uses the last two digits of your username in the form remote-nn, for example, if your username is u1234567 you would connect to remote-67.dcs.warwick.ac.uk (recall to use your USERNAME and correpsonding REMOTE_NN):

```scp USERNAME@REMOTE_NN.dcs.warwick.ac.uk:/modules/cs342/2020/lab3/data/* .```

After entering your DCS password, this will copy the data to your current working directory. You should now have the following files:
```
├──[your working directory]
   └── diabetes.data
```
**Please make sure to use the correct path to these files when working on your own machine. The scripts below assume you are working on the DCS machines. Recall that the *.ipynb file (this file) should be in your working directory.**

### _K_-NN classification

 
We will use the Diabetes dataset from the UCI Machine Learning Repository (file *diabetes.data*). Our goal is to predict if female patients will test positive for diabetes given 8 attributes, including age and blood pressure. For more details on the dataset see: https://www.kaggle.com/uciml/pima-indians-diabetes-database

Import the dataset ( _diabetes.data_ ) into a Pandas data frame and standardise the attributes: for each attribute, or feature, compute its mean and standard deviation (see Lab 1) and replace each feature value by:

(value - mean)/standard_deviation. 

Note that the last column corresponds to the class label: 1 for the positive class and 0 for the negative class. Also note that the _*.data_ file has no header. By default, Pandas will read the first row of a _.data_ file as the column name. This behaviour can be disabled by modifying the header argument. See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

In [1]:
from __future__ import division
import pandas as pd


# import diabetes dataset
diabetes = pd.read_csv("./diabetes.data", header=None)

# standardise attributes (subtract the mean and divide by the standard deviation)
standardised = (diabetes - diabetes.mean()) / diabetes.std()

# DO NOT INCLUDE THE CLASS LABEL
X = standardised.drop(8, axis=1)
y = diabetes[8]

print(X)

            0         1         2         3         4         5         6  \
0    0.639530  0.847771  0.149543  0.906679 -0.692439  0.203880  0.468187   
1   -0.844335 -1.122665 -0.160441  0.530556 -0.692439 -0.683976 -0.364823   
2    1.233077  1.942458 -0.263769 -1.287373 -0.692439 -1.102537  0.604004   
3   -0.844335 -0.997558 -0.160441  0.154433  0.123221 -0.493721 -0.920163   
4   -1.141108  0.503727 -1.503707  0.906679  0.765337  1.408828  5.481337   
..        ...       ...       ...       ...       ...       ...       ...   
763  1.826623 -0.622237  0.356200  1.721613  0.869464  0.115094 -0.908090   
764 -0.547562  0.034575  0.046215  0.405181 -0.692439  0.609757 -0.398023   
765  0.342757  0.003299  0.149543  0.154433  0.279412 -0.734711 -0.684747   
766 -0.844335  0.159683 -0.470426 -1.287373 -0.692439 -0.240048 -0.370859   
767 -0.844335 -0.872451  0.046215  0.655930 -0.692439 -0.201997 -0.473476   

            7  
0    1.425067  
1   -0.190548  
2   -0.105515  
3   -1.0408

Based on our two classes, i.e.,  the negative class and the positive class, write a function that takes as input your predicted targets and the true targets (i.e., the ground truth), and estimates the *Accuracy* of the classifier,
defined as:



\begin{equation*}
 Accuracy = \frac{TP + TN}{TP + FN + FP + TN},
\end{equation*}

where $TP$ = No. of True Positives (model predicts positive and true value is positive), $FP$ = No. of False Positives (model predicts positive and true value is negative), $TN$ = No. of True Negatives (model predicts negative and true value is negative), and $FN$ = No. of False Negatives (model predicts negative and true value is positive).

Perform _K_-NN classification using the scikit implementation (*sklearn.neighbors.KNeighborsClassifier* ) for
_K_ = {1, 2, 3, 4, 5}. Use 10-fold cross-validation ( _sklearn.model_selection_ ) to choose the best value of _K_. Make sure to display the accuracy value of each classifier. Which is the most accurate classifier based on 10-fold cross-validation?

**Hint:** You may find that the *KFold* function within *sklearn.model_selection* is useful to keep track of
the samples assigned to each fold when performing 10-fold cross validation.

In [2]:
import numpy
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import KFold

# define function to  calculate accuracy

def accuracy(predicted: numpy.ndarray, true: numpy.ndarray) -> float:
    test = predicted == true
    return sum(test) / len(test)


# define function to perform K-NN classification using k-fold validation

def knn(k: int, X: pd.DataFrame, y: pd.Series, test: pd.DataFrame):
    neigh = KNeighborsClassifier(n_neighbors=k)
    neigh.fit(X, y)
    return neigh.predict(test)

def tenfold(k: int, X: pd.DataFrame, y: pd.DataFrame):
    kf = KFold(n_splits=10, random_state=None)
    accuracies = []
    for train_index, test_index in kf.split(X):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        prediction = knn(k, X_train, y_train, X_test)
        accuracies.append(accuracy(prediction, y_test))
    return sum(accuracies) / len(accuracies)


# cross-validate with K = 1, 2, 3, 4, and 5

for k in range(1, 6):
    print(f"k = {k}:", tenfold(k, X, y))

k = 1: 0.708321941216678
k = 2: 0.7173786739576214
k = 3: 0.7460526315789473
k = 4: 0.73298017771702
k = 5: 0.7421394395078605
