
Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
# Name format: Last name, First name
NAME = ""
COLLABORATORS = ""

# Hands-On KNN
***

In this notebook we'll investigate Scikit-Learn's implementation of K-Nearest Neighbors. In addition, we'll look at how we can evaluate the performance our classifiers with a confusion matrix.  

**Names**:

**At the end of class, each student should upload this notebook to Canvas to receive participation points.**


**Ack**: Based on initial work by Chris Ketelsen, Chenhao Tan

In [None]:
import matplotlib.pylab as plt
import numpy as np
import helpers

### Part 1: Classifying Iris Species 
***

In this problem we'll use K-Nearest Neighbors to classify species of irises based on certain physical characteristics.  The so-called [_iris dataset_](https://en.wikipedia.org/wiki/Iris_flower_data_set) is a popular dataset for prototyping classification algorithms. We can load the iris dataset from Scikit-Learn directly. The dataset contains four features: sepal length, sepal width, pedal length, and pedal width and three classes defined by the species of iris: setosa, versicolor, and virginica. We'll only use the sepal dimensions so that we can easily visualize the data. 

Execute the following code cell to load training and validation sets for the iris data set and then plot the data.    

In [None]:
X_train, y_train, X_valid, y_valid, target_names = helpers.load_iris()
print("classes = ", target_names)
helpers.plot_iris(X_train, y_train)

**Part A: Basic description**: How many examples are in the training set?  How many examples belong to each of the three classes? Write code to print the shapes of the matrices X_train and y_train.

Hint: `X_train` and `y_train` are Numpy n-dimensional arrays. Use the .shape() function.

In [None]:
#BEGIN: Part A

# YOUR CODE HERE

#END: Part A

**Part B: Use Sk-Learn Classifier**: Next we'll train a KNN classifier to predict iris species based on the sepal measurement features.  The KNN classifier in Scikit-Learn is called [KNeighborsClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html).  Go now and check out the documentation. Define and fit a model with $K=3$ to the training set.  The `plot_knn_boundary` function will then plot the KNN decision boundary against the data. 

In [None]:
from sklearn.neighbors import KNeighborsClassifier


def build_model(X_train, y_train, num_neighbors: int):
    '''
    build_model
    args:
          X_train : Numpy array, shape (m,k)
          Y_train : Numpy array, shape (m,)
          num_neighbors : python integer, the k in "knn"
    returns: sk-learn KNeighborsClassifier object that has been fitted
    '''
    # model = ClassifierModel()
    #BEGIN B
    raise NotImplementedError()

    #END B
    return model

In [None]:
# These tests are a sanity-check that your code meets the spec. They are not exhaustive tests.
knn = build_model(X_train, y_train, 3)

assert (type(knn) == KNeighborsClassifier)
ex_0_prediction = knn.predict(X_train[0][None, :])
assert (ex_0_prediction == 2)

ax = helpers.plot_iris(X_train, y_train)
helpers.plot_decision_surface(X_train, knn, ax)

**Part C**: Play with the value of $K$ above.  How does the character of the decision boundary change with $K$? 

In [None]:
fig, ax = plt.subplots(nrows=2, ncols=3, figsize=(15, 15))
ax = ax.reshape(-1)

for idx, k in enumerate(range(1, 12, 2)):
    print("Building model with k={}".format(k))

    #BEGIN C
    # YOUR CODE HERE
    raise NotImplementedError()
    # Put all your code above this line.
    #END C
    # Here's how to plot multiple sub plots in a clean manner
    helpers.plot_decision_surface(X_train, model, ax[idx])

**Part D**: Until this point we've been plotting the KNN decision boundary against the training data, but really we're interested in how our model does on the validation set.  The following code will train a 1-NN classifier and plot the decision boundary against the validation data. How many points in total are misclassified?  Which species get confused with each other the most? 

Write code below, then put your answers to the above questions in the "analysis" cell below the code cell.

In [None]:
#BEGIN D
# YOUR CODE HERE
raise NotImplementedError()
#END D

YOUR ANSWER HERE

**Part E: Confusion Matrix**: Counting misclassified points becomes much more difficult when our data sets are very large.  One convenient method for analyzing misclassification is by constructing the so-called confusion matrix. The confusion matrix is `(# classes)` $\times$ `(# classes)` matrix such that the entry $C_{ij}$ is the number of examples with _true_ label $i$ predicted to have label $j$. 

We can compute the confusion matrix using Scikit-Learn's [confusion_matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) function. Read the documentation and then fill in the missing code to compute the confusion matrix for the validation data and the 1-NN classifier.  Do the entries in $C$ agree (roughly) with the visual counts you made above? 

Step-by-step:
1. Call the `build_model` function that you wrote to create a 1-NN classifier.
2. Call the `predict` method of your classifier on the validation data, `X_valid`
3. Use the confusion matrix method (see the docs) to calculate the matrix.
4. Print the matrix.

In [None]:
from sklearn.metrics import confusion_matrix


def calc_confusion_matrix(model, X_eval, y_eval):
    ''' calc_confusion_matrix
    Args:
        model : a KNearestNeighbors model, fitted. The output of build_model.
        X_eval : Numpy array, data (m, features) you wish to use as input to the model
        y_eval : Numpy array, data (m,) you wish to use as confusion matrix
    '''
    #BEGIN E
    # conf_matrix = None
    # Your code here (~2 lines) - predict on data_x, then get conf_matrix between data_y and your predictions
    # YOUR CODE HERE
    raise NotImplementedError()
    #END E
    return conf_matrix

In [None]:
# Sanity check on k=1 for your code
nn1_model = build_model(X_train, y_train, 1)
conf_matrix = calc_confusion_matrix(nn1_model, X_valid, y_valid)
print("Confusion on validation data, k=1")
print(conf_matrix)
assert (conf_matrix.sum() == X_valid.shape[0])

**Part F**: Vary the number of nearest neighbors used in KNN above and recompute the confusion matrix.  Describe your results. Does there seem to be a particular setting that works better than the others for the validation data ?

In [None]:
for idx, k in enumerate(range(1, 12, 2)):
    print("Building model with k={}".format(k))
    #BEGIN F
    # YOUR CODE HERE
    raise NotImplementedError()
    #END F

**Part G: Calculate per-class accuracy**: Now that we have a confusion matrix, we would like to get a summary of our performance for each matrix. One way to do this would be to calculate the miss-classification rate per class. Remember that the rows correspond to labels of validation data. The sum accross the first row corresponds to the total number of elements in the first class in the validation data. 

Now, write a function to calculate the per class missclassification rate. It should be a single 3-element vector.

HINT: If you have trouble with this, be sure to do the Numpy review as described in Homework 0. There are any numbers to do this, but we used 1x .sum(), 1x np.diag() and one division operation on the matrix.

In [None]:
def get_per_class_miss_rate(conf_matrix):
    '''
    get_per_class_miss_rate
    Args:
        conf_matrix - a MxM Numpy array, the output of calc_confusion_matrix
    '''
    per_class_miss_rate = np.zeros((3,))
    #BEGIN G
    # YOUR CODE HERE
    raise NotImplementedError()
    #END G

    return per_class_miss_rate

In [None]:
# Sanity check your code:
nn_model = build_model(X_train, y_train, 3)
cm = calc_confusion_matrix(nn_model, X_valid, y_valid)
miss_rates = get_per_class_miss_rate(cm)
print(miss_rates)
assert (np.allclose(miss_rates, [0.0, 0.4285, 0.3333], atol=1e-3))

In [None]:
# Print the per-class error rates for different values of k
for idx, k in enumerate(range(1, 12, 2)):
    print("Building model with k={}".format(k))
    knn = build_model(X_train, y_train, k)
    cm = calc_confusion_matrix(knn, X_valid, y_valid)
    pcmr = get_per_class_miss_rate(cm)
    print("Per class miss-classification: {}".format(pcmr))