# Exercise 1 | TKO_2096 Application of Data Analysis 2021

#### Nested cross-validation for K-nearest neighbors <br>
- Use Python 3 to program a nested cross-validation for the k-nearest neighbors (kNN) method so that the number of neighbours k is automatically selected from the range 1 to 10. In other words, the base learning algorithm is kNN but the actual learning algorithm, whose prediction performance will be evaluated with nested CV, is kNN with automatic CV-based model selection (see the lectures and the pseudo codes presented on them for more info on this interpretation).
- As a kNN implementation, you can use sklearn: http://scikit-learn.org/stable/modules/neighbors.html but your own kNN implementation can also be used if you like to keep more control on what is happening in the learning process. The CV implementation should be easily modifiable, since the forthcoming exercises involve different problem-dependent CV variations.
- Use the nested CV implementation on the iris data and report the resulting classification accuracy. Hint: you can use the nested CV example provided on sklearn documentation: https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html as a starting point and compare your nested CV implementation with that but do NOT use the ready made CV implementations of sklearn as the idea of the exercise is to learn to split the data on your own. The other exercises need more sophisticated data splitting which are not necessarily available in libraries.
- Return your solution for each exercise BOTH as a Jupyter Notebook file and as a PDF-file made from it.
- Return the report to the course page on **Monday 1st of February** at the latest.  

## Import libraries

In [31]:
#In this cell import all libraries you need. For example: 
import numpy as np
import math
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.utils import shuffle

## Results the nested cross-validation

In [30]:
#In this cell run your script for nested CV and print the result.
# 

def split_data(X, y, k):
# split the data into equal sized folds (might result in some remainders that are not included in the split)
    foldsize = int(len(X)/k)
    X_folds = []
    y_folds = []
    for i in range(k):
        X_fold = []
        y_fold = []
        while len(X_fold) < foldsize: # add rows to the fold until the fold size is met
            X_fold.append(X[0])
            y_fold.append(y[0])
            X = X[1:] # since we'll shuffle the data, just remove the first row from the array
            y = y[1:]
        X_folds.append(X_fold)
        y_folds.append(y_fold)
    return X_folds, y_folds # returns k folds

def find_k(X_folds, y_folds): # this is the inner loop function to find the best k
    k_acc = []
    for i in range(1,11): # k is an integer, 1 <= k <= 10
        acc = []
        for j in range(len(X_folds)-1): # do a cross validation for this k
            X_dev = X_folds[j] # choose the jth fold as a validation set
            X_train = np.delete(X_folds, j, 0) # delete the jth fold from the rest -> training set
            X_train = np.concatenate([x for x in X_train]) # concatenate training folds into 2D array
            y_dev = y_folds[j] # same for y
            y_train = np.delete(y_folds, j, 0)
            y_train = np.concatenate([y for y in y_train])
            knn = KNeighborsClassifier(n_neighbors=i) # KNN with k = i
            knn.fit(X_train, y_train)
            pred = knn.predict(X_dev)
            acc.append(accuracy_score(y_dev, pred)) # add the accuracy score with this split to a list
        k_acc.append(np.average(acc)) # add the average of the accuracies with this k into a list
    best_k = k_acc.index(max(k_acc))+1
    return best_k

def evaluate(folds): # the outer loop function, takes the number of folds as an argument
    acc = []
    X_folds, y_folds = split_data(X,y,folds) # splitting the data into folds
    for i in range(folds): # outer loop for evaluation
        X_test = X_folds[i] # take the ith fold as a test fold and delete it from the rest -> train
        X_train = np.delete(X_folds, i, 0)
        y_test = y_folds[i]
        y_train = np.delete(y_folds, i, 0)
        k = find_k(X_train, y_train) # find the best k for this train set
        X_train = np.concatenate([x for x in X_train]) # concatenate into 2D array to evaluate
        y_train = np.concatenate([y for y in y_train])
        knn = KNeighborsClassifier(n_neighbors=k) # KNN with the best k
        knn.fit(X_train, y_train)
        pred = knn.predict(X_test)
        acc.append(accuracy_score(y_test, pred)) # add accuracy score with the best k for this split
    print(np.average(acc)) # prints out the accuracy average for the whole data

#load the data
iris = load_iris()
X, y = iris.data, iris.target
X, y = shuffle(X, y)

evaluate(5) # give the function an integer as an argument
evaluate(10) # data size is 150, so these result in an even split
evaluate(8) # this works as well

0.9733333333333334
0.96
0.9791666666666667
