# 5. Classification and Cross-Validation

**Instructions:**
* go through the notebook and complete the **tasks** .  
* Make sure you understand the examples given!
* When a question allows a free-form answer (e.g., ``what do you observe?``) create a new markdown cell below and answer the question in the notebook.
* ** Save your notebooks when you are done! **

In the previous lab, we loaded up the iris dataset for flower classification, and performed simple exploratory data analysis, i.e., we visualized the data available (features given class labels) in order to understand characteristics of the data (e.g., that some classes are easier to be separated from others based on some features, etc.)

If you don't remember much about this, please revisit the corresponding lab (Lab 2) before moving on.

In this lab, we will go through the process of actually training a classifier on a dataset (training set), and evaluating the performance of the classifier on unknown data (test set)

<hr>
<span style="color:rgb(170,0,0)">**Task:**</span> Run the cell below to load our data. Notice the last line, where we add some random Gaussian noise to our data to make the task more challenging (data in real life usually contains some form of noise).

In [2]:
%matplotlib inline


from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt

iris = datasets.load_iris()

#view a description of the dataset (uncomment next line to do so)
print(iris.DESCR)

#Set X a samples times features matrix, Y equal to the targets (targets = classes to use as labels)
X=iris.data 
y=iris.target 


#we add some random noise to our data to make the task more challenging
X=np.random.normal(0,0.05,X.shape)

Iris Plants Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris d

<hr>
<span style="color:rgb(170,0,0)">**Task:**</span> How many data samples do we have?  Print the value below using ``shape`` on X appropriately.

In [3]:
print(X.shape[0])

150


<hr>
<span style="color:rgb(170,0,0)">**Task:**</span> How many features do we have?  Print the value below using ``shape`` on X appropriately.

In [4]:
print(X.shape[1]) 

4


<hr>
<span style="color:rgb(170,0,0)">**Task:**</span> How many classes do we have?  Print the value below using ``np.unique`` appropriately.

In [5]:
classes = np.unique(y) # y = iris.data (see above)
print("Classes:")
print(classes)
print("Number of Classes:")
print(classes.size)

Classes:
[0 1 2]
Number of Classes:
3


<hr>
<span style="color:rgb(170,0,0)">**Task:**</span> How many samples do we have that belong to class 1?  Use the ``np.where`` function appropriately on y to print this in the cell below.

In [6]:
indecesOfClassifiedAsZero = np.where(y==0)
print(indecesOfClassifiedAsZero[0].size)

50


<hr>
<span style="color:rgb(170,0,0)">**Task:**</span> Assume we want to generate a list of shuffled indices of our data.  Use the function ``numpy.random.permute`` to do that.  In the cell below, you can already see how to create a list of indices that is **not** shuffled.

In [7]:
L=list(range(X.shape[0]))
print(L)
#Enter code here

shuffle = np.random.permutation(L)
print()
print(shuffle)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149]

[ 49  68 142  37  23  18  36 119  66  63 120  48  26  73  17  96  20  15
 118  81 148  38  80 124 126  88   6  42 114 143 138  51  28  74 116  82
  69 144  34  44  59 107  35 123  92 125 127  65  60  32 147  64  33  71
 111 109  11  70  19  29  62 102   4  10  79  24 134  41 113   9  13 139
  43  84  40  16  98 141 121  58   1  46 117 108  25 132 112  52 1

<hr>
<span style="color:rgb(170,0,0)">**Task:**</span> Here is an example of using the k-NN classifier.  We split our data to training and testing (with a 0.2 percentage for our test data), fit on the training data, test on the testing data.  Go through the code and make sure you understand it.  Subsequently, do the same for the next cell, that prints the confusion matrix and the total accuracy.  (documentation: http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

**note: for this lab, we use the euclidean distance along with 10 neighbours**

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

#split to train and test (python allows to distribute N returned values across N variables)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
#define knn classifier, with 5 neighbors and use the euclidian distance
knn=KNeighborsClassifier(n_neighbors=10, metric='euclidean')
#define training and testing data, fit the classifier
knn.fit(X_train,y_train)
#predict values for test data based on training data
y_pred=knn.predict(X_test)
#print values
print()
print(y_test) # true values
print(y_pred) # predicted values


[0 0 0 2 0 0 1 2 0 0 1 1 2 1 2 0 0 2 1 2 2 2 2 0 1 0 2 0 2 0]
[1 0 0 0 0 0 1 2 2 1 0 1 2 1 1 1 0 0 1 2 1 0 2 0 0 0 1 1 0 1]


In [9]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))

[[7 5 1]
 [2 4 0]
 [4 3 4]]
0.5


<hr>
<span style="color:rgb(170,0,0)">**Task:**</span> Write your **own** functions that return the confusion matrix given the true and predicted labels, as well as the accuracy.  To do so, fill in the code in the next two cells.

In [10]:
#create a matrix with entries equal to zero, and subsequently build the confusion matrix
#the method should return the confusion matrix in a numpy array
def myConfMat(y_test,y_pred,classno):
    C=np.matlib.zeros((classno, classno)) # initialize the confusion matrix to zeros (X by Y)
    for i in range (0, y_test.size): # loop through the test data (actual results)
        for x in range(0, classno): # loop through the x axis of the confusion matrix
            for y in range(0, classno): # loop through the y of the confusion matrix
                if((y_test[i] == x) and (y_pred[i] == y)): # test 
                     C[x,y] += 1;
            
    #loop through all results and update the confusion matrix
    return C

#note: len(np.unique(y))  indicates the dimensions of the confusion matrix (why?)
print(myConfMat(y_test,y_pred,len(np.unique(y))))

[[ 7.  5.  1.]
 [ 2.  4.  0.]
 [ 4.  3.  4.]]


In [14]:
#use the numpy function where to return the accuracy given the true/predicted labels.  i.e., #correct/#total
def myAccuracy(y_test,y_pred):
    matches = 0
    attempts = y_test.size
    for i in range (0, attempts):
        if(y_test[i] == y_pred[i]):
            matches+=1
        
    accuracy=matches/attempts
    
    return accuracy
    
    
print(myAccuracy(y_test,y_pred))

0.5


<hr>
<span style="color:rgb(170,0,0)">**Task:**</span> Write your own cross-validation function.  In this case, we are using a fixed distance (euclidean) and a fixed number of neighbours (10) so we do **not** need to create a validation set.

Your function (see cell below) firstly splits the indices of each of our data into bins according to the number of folds (here: 5-fold).

Then, you should loop through all folds, split the data into training and testing by selecting the appropriate bins (see slides on cross-validation), train on training data and save the test result as the accuracy for each fold (see list accuracy_fold).  This is the list that your function should return in the end.  Remember that the ``extend`` function extends a list with more values.  

The final print call in the end of the cell should print the list of accuracies, with five values, one for each fold.

In [None]:
def myCrossVal(X,y,foldK):
    accuracy_fold=[] #list to store accuracies folds
    
    
    #TASK: use the function np.random.permutation to generate a list of shuffled indices from in the range (0,number of data)
    #(you did this already in a task above)
    #indices=#...
    #print(indices)
    
    #TASK: use the function array_split to split the indices to k different bins:
    #uncomment line below
    #bins=
    #print(bings)
    
    
    #loop through folds
    for i in range(0,foldK):
        foldTrain=[] # list to save current indices for training
        foldTest=[]  # list to save current indices for testing
        #TASK: take bin i for testing, rest for training.  Can use the function extend to add indices to foldTrain and foldTest
        #train kNN classifier
        #test on test data
        #append the new accuracy to your accuracy_fold list.  You can use accuracy_score or your myAccuracy function.
    return accuracy_fold;
    
accuracy_fold=myCrossVal(X,y,5)
print(accuracy_fold)

<hr>
<span style="color:rgb(170,0,0)">**Task:**</span> Print the average accuracy and standard deviation of your results over the 5 folds. (functions ``mean`` and ``std``)

In [None]:
#######################################