# K-nearest neighbours classification
## Instructions:
* Go through the notebook and complete the tasks. 
* Make sure you understand the examples given. If you need help, refer to the Essential readings or the documentation link provided, or go to the Topic 2 discussion forum. 
* When a question allows a free-form answer (e.g. what do you observe?), create a new markdown cell below and answer the question in the notebook. 
* Save your notebooks when you are done.
 
**Task 1:**
Run the cell below to load our data. Notice the last line, where we add some random Gaussian noise to our data to make the task more challenging (data in real life usually contains some form of noise).


In [2]:
%matplotlib inline

from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt

iris = datasets.load_iris()

#view a description of the dataset 
print(iris.DESCR)

#Set X a samples times features matrix, Y equal to the targets
X=iris.data 
y=iris.target 


#we add some random noise to our data to make the task more challenging
X=X+np.random.normal(0,0.4,X.shape)


.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

**Task 2:**
1.	How many data samples do we have?
2.	Print the value below using shape on ```X``` appropriately. 

In [3]:
#Enter code here
X.shape
print(X.shape[0])

150


**Task 3:**
1.	How many features do we have?
2.	Print the value below using shape on ```X``` appropriately. 


In [4]:
#Enter code here
print(X.shape[1])

4


**Task 4:**
1.	How many classes do we have?
2.	Print the value below using ```np.unique``` appropriately. 


In [5]:
#Enter code here
print(len(np.unique(y)))

3


**Task 5:**
1.	How many samples do we have that belong to class 1?
2.	Print this in the cell below using the ```np.where``` function appropriately. 


In [6]:
#Enter code here
ones = np.where(y==1)
print(len(ones[0]))

50


**Task 6:** 

Assume we want to generate a list of shuffled indices of our data. Use the function ```numpy.random.permutation``` to do that. In the cell below, you can already see how to create a list of indices that is not shuffled.


In [7]:
#L=list(range(X.shape[0]))
#print(L)
#Enter code here
L1 = list(np.random.permutation(range(X.shape[0])))
print(L1)

[48, 104, 107, 132, 136, 49, 58, 108, 13, 106, 117, 120, 12, 20, 6, 31, 37, 124, 15, 59, 86, 114, 5, 77, 44, 17, 105, 22, 9, 11, 39, 149, 94, 36, 72, 18, 109, 147, 0, 80, 8, 73, 57, 55, 122, 70, 90, 112, 53, 115, 7, 148, 61, 54, 24, 99, 66, 27, 81, 10, 138, 96, 79, 85, 123, 128, 103, 127, 113, 52, 131, 121, 134, 41, 4, 118, 68, 51, 100, 116, 62, 91, 29, 78, 75, 28, 87, 141, 111, 34, 3, 137, 65, 139, 74, 144, 83, 102, 146, 140, 60, 56, 63, 130, 97, 19, 33, 1, 126, 2, 42, 21, 67, 133, 76, 125, 95, 30, 71, 64, 142, 26, 98, 143, 145, 69, 93, 32, 92, 119, 35, 82, 88, 16, 47, 135, 14, 23, 45, 38, 84, 46, 43, 50, 129, 89, 110, 40, 101, 25]


**Task 7:**
Here is an example of using the k-NN classifier. We split our data to training and testing (with a 0.2 percentage for our test data), fit on the training data, test on the testing data. 
Go through the code and make sure you understand it.
Now do the same for the next cell, which prints the confusion matrix and the total accuracy. 
You can find some documentation to help you here: http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html. 
Note that for this lab, we use the Euclidean distance along with 10 neighbours.


In [51]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

#split to train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
#define knn classifier, with 5 neighbors and use the euclidian distance
knn=KNeighborsClassifier(n_neighbors=5, metric='euclidean')
#define training and testing data, fit the classifier
knn.fit(X_train,y_train)
#predict values for test data based on training data
y_pred=knn.predict(X_test)
#print values
print(y_test) # true values
print(y_pred) # predicted values


[2 2 1 0 0 0 0 1 1 0 2 1 1 1 2 2 1 1 2 2 2 0 0 1 0 2 0 2 1 0]
[1 2 1 0 0 0 0 1 1 0 1 1 1 1 2 2 1 1 2 2 2 0 0 1 0 2 0 2 2 0]


In [52]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))


[[10  0  0]
 [ 0  9  1]
 [ 0  2  8]]
0.9


**Task 8:**
Write your <b>own</b> functions that return the confusion matrix given the true and predicted labels, as well as the accuracy. To do so, fill in the code in the next two cells. 


In [53]:
# model answer

#create a matrix with entries equal to zero, and subsequently build the confusion matrix
#the method should return the confusion matrix in a numpy array
def myConfMat(y_ground,y_pred,classno):
    C= np.zeros((classno,classno),dtype=np.int)
    for i in range(0,len(y_test)):
            
    #complete this line - assign the appropriate value to C[i,j]
           
        C[y_ground[i],y_pred[i]]+=1
    return C

#note: len(np.unique(y))  indicates the dimensions of the confusion matrix (why?)
print(myConfMat(y_test,y_pred,len(np.unique(y))))

[[10  0  0]
 [ 0  9  1]
 [ 0  2  8]]


In [46]:
same = np.where(y_test == y_pred)
len(same[0])

26

In [54]:
# model answers

#use the numpy function where to return the accuracy given the true/predicted labels.  i.e., #correct/#total
def myAccuracy(y_test,y_pred):
    same = np.where(y_test == y_pred)
    A = len(same[0]) / len(y_test)      
    return A
    
    
print(myAccuracy(y_test,y_pred))



0.9


**Optional task:**</span> Write your own functions to calculate class-relative precision and recall. Compare these to the sklearn functions ``precision_score`` and ``recall_score`` on your y_test and y_pred values.

In [55]:
classes = np.unique(y_test)
C = myConfMat(y_test,y_pred,len(np.unique(y)))
for i in range(0, len(classes)):
    tp = C[i,i]
    col = sum(C[:,i])
    prec = tp / col
    
    

    

    

In [56]:
#hint: you can use the output from your myConfMat function above

def myPrecision(y_ground,y_pred):
    classes = np.unique(y_ground)
    precision = np.zeros(classes.shape)
    for i in range(0, len(classes)):
        tp = C[i,i]
        col = sum(C[:,i])
        precision[i] = tp / col        
    return precision


def myRecall(y_test,y_pred):
    classes = np.unique(y_pred)
    recall = np.zeros(classes.shape) 
    for i in range(0, len(classes)):
        tp = C[i,i]
        row = sum(C[i,:])
        recall[i] = tp / row        
    return recall

print('classes:      %s' % np.unique(y_pred) )    
print('my precision: %s' % myPrecision(y_test,y_pred))
print('my recall:    %s' % myRecall(y_test,y_pred))


classes:      [0 1 2]
my precision: [1.         0.81818182 0.88888889]
my recall:    [1.  0.9 0.8]


In [57]:
from sklearn.metrics import precision_score, recall_score 
# check that your functions do the same thing as the library versions

print('library precision: %s' % precision_score(y_test,y_pred,average=None))
print('library recall: %s' % recall_score(y_test,y_pred,average=None))



library precision: [1.         0.81818182 0.88888889]
library recall: [1.  0.9 0.8]
