# Hw 3: Nearest Neighbors
## Submit to Dropbox Hw3 by Oct 21 

### Part 1. 

Your programs should read files provided. In this format, each instance is described on a single line. The feature values are separated by
commas, and the last value on each line is the class label (for classification).  Lines starting with '%' are comments

- Your programs should  implement a k-nearest neighbor learner in a function according to the following guidelines:
  - Assume that for classification tasks, the class attribute is named 'class' and it is the last attribute listed among all the attributes.
  - Assume that all features will be numeric.
  - Use Euclidean distance to compute distances between instances.
  - Implement basic k-NN.
  - If there is a tie among multiple instances to be in the k-nearest neighbors, the tie should be broken in favor of those instances that come first in the data file.
  - If there is a tie in the class predicted by the k-nearest neighbors, then among the classes that have the same number of votes, the tie should be broken in favor of the class comes first in the data file.
- You should include a function myKNN and should accept three arguments as follows:
  - myKNN(traindata,testdata, k)
  - The myKNN function should use the training set and the given value of k to make classifications/predictions for every instance in the test set. This can be called from a main calling function.
  - The main program should  use  p-fold cross validation (set p =10) with just the training data to select the value of k (used by NN) to use for the test set by evaluating k1 k2 k3…. (set it to any values you like) and selecting the one that results in the minimal cross-validated error within the training set.
  - To measure error, you should use mean absolute error. The following link shows how to use cross validation with python, including generating indices for each fold.
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

- As output, your programs should print the value of k used for the test set on the first line, and then the predictions for the test-set instances.
- For each instance in the test set, your program should print one line of output with spaces separating the fields.
- For a classification task, each output line should list the predicted class label, and actual class label.
- This should be followed by a line listing the number of correctly classified test instances, and the total number of instances in the test set.
- This should be followed by a line listing the mean absolute error for the test instances, and the total number of instances in the test set.
- Copy and paste this output to the .docx file you will submit to canvas.

You should test your code on the following two data sets:
- yeast_train.txt
- yeast_test.txt

### Part 2.

For this part you will explore the effect of the k parameter on predictive accuracy.

- For the yeast data set, draw a plot showing how test-set accuracy varies as a function of k. Your plot should show accuracy for k = 1, 5, 10, 15, 20 after p-fold cross validation (where p=10).
- For the yeast data set, construct confusion matrices for the k = 1 and k = 15 test-set results (you don’t need to do cross validation for this). Show these confusion matrices and briefly discuss what the matrices tell you about the effect of k on the misclassifications. See how to create confusion matrices here.

http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/

The python code for confusion matrices can be found at 
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

Put these results in the .docx file (from both parts) and submit to dropbox.

In [39]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.model_selection import KFold
import scipy.stats as sp
import statistics as st

In [17]:
def normalize(data):
    n, m= data.shape
    avg = np.mean(data, axis=0)
    for i in range(0, m):
        temp = data[:,i] - avg[i]
        s = np.std(data[:,i])
        data[:,i] = temp / s
    return data

In [43]:
def myKNN(train, label_train, test, k):
    n1, m1 = train.shape
    n2, m2 = test.shape
    distance = euclidean_distances(train, test) # n1 by n2
    distance = distance.transpose()
    y_test = np.zeros(n2, dtype=str)
    
    for i in range (0, n2):
        ind = np.argsort(distance[i,:])
        k_top_labels = label_train[ind[0:k]]
        # l = sp.mode(k_top_labels)
        l = np.unique(k_top_labels)
        y_test[i] = st.mode(l)

    return y_test

In [25]:
train = pd.read_csv("./data/yeast_train.txt", header=None)
label_train = train.iloc[:, 8]
train = train.iloc[:,0:8]
n1, m1 = train.shape
print(train.shape)

(1039, 8)


In [26]:
test = pd.read_csv("./data/yeast_test.txt", header=None)
label_test = test.iloc[:, 8]
test = test.iloc[:,0:8]
n2, m2 = test.shape
print(test.shape)

(445, 8)


In [27]:
data = np.concatenate((train, test), axis = 0)
data = normalize(data)

In [28]:
train = data[0:n1, :]
test = data[n1:n1+n2, : ]

In [44]:
df = pd.DataFrame(columns=["k", "accuracy"])

k_results = []
accuracy_results = [] 
# Homework
# k_choices = np.arange(10, 100, step=10)
k_choices = np.array([1, 5, 10, 15, 20])

for k in k_choices:
    kf = KFold(n_splits=10)
    # kf.get_n_splits(train)

    for train_index, test_index in kf.split(train):
        y_test = myKNN(train=train[train_index], label_train=label_train[train_index], test=train[test_index,:], k=k) # return values is prediction
        n = test_index.shape
        acc = ((y_test == label_train[test_index]).astype('uint8')).sum() / n[0]
        k_results.append(k)
        accuracy_results.append(acc)

df["k"] = k_results
df["accuracy"] = accuracy_results
df = df.groupby("k").mean().reset_index()
mae = df.sort_values(by="accuracy", ascending=False).iloc[0]["accuracy"]
best_k = df.sort_values(by="accuracy", ascending=False).iloc[0]["k"]
print(f"Best k: {best_k}")
print(f"Mean Absolute Error: {mae:.2f}")

KeyError: "None of [Index([94], dtype='int64')] are in the [index]"

In [None]:
plt.figure(figsize=(12, 10))
sns.lineplot(x = k_results, y = accuracy_results, markers="o")
plt.xlabel("K values")
plt.ylabel("Accuracy Scores")
plt.show()