## Let's walk through a k-Nearest Neighbors Classifier and find the best value for k.  

## In this kaggle notebook, I'm testing my ability to run a knn classifier on a new data set.  



In [None]:
# import some libraries.  I don't think we'll need all these, but it doesn't hurt to have them ready. 
%matplotlib inline
import h5py
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd 
import os
import numpy as np
from IPython.display import Image
import sys
from sklearn import datasets
from sklearn.model_selection import train_test_split 
from sklearn.neighbors import KNeighborsClassifier 



In [None]:
# load the data
digits = datasets.load_digits()

In [None]:
# look at the keys
print(digits.keys())
print(digits.DESCR)

In [None]:
# Print the shape of the images and data keys
print(digits.images.shape)
print(digits.data.shape)


In [None]:
# Display digit 1010
plt.imshow(digits.images[1010], cmap=plt.cm.gray_r, interpolation='nearest')


In [None]:
# Create feature and target arrays
X = digits.data
y = digits.target


### Let's split the data into testing and training sets.  Kaggle already does the splitting for us, but it is good to know how to do this on our own in the real world.  

In [None]:
# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify=y)


### Fit the Knn classifier with 7 neighbors.  You can chose another number here.  

### We'll go through some steps below to determine which number might work best for this data.  For now we will use 7 to see what happens.  

In [None]:
# Create a k-NN classifier with 7 neighbors: knn
knn = KNeighborsClassifier(n_neighbors=7)


In [None]:
# Fit the classifier to the training data
knn.fit(X_train, y_train)


In [None]:
# Print the accuracy
print(knn.score(X_test, y_test))


### Not bad! 98 percent.  

In [None]:
# Setup arrays to store train and test accuracies
neighbors = np.arange(1, 9)
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))


## Finding a good k value

### Let's now compute and plot the training and testing accuracy scores for a variety of different neighbor values. By observing how the accuracy scores differ for the training and testing sets with different values of k, you can develop an intuition for overfitting and underfitting a model.

#### Let's now streamline some of the steps we did above into a for loop. 

In [None]:
# Loop over different values of k
for i, k in enumerate(neighbors):
    # Setup a k-NN Classifier with k neighbors: knn
    knn = KNeighborsClassifier(n_neighbors=k)

    # Fit the classifier to the training data
    knn.fit(X_train, y_train)
    
    #Compute accuracy on the training set
    train_accuracy[i] = knn.score(X_train, y_train)

    #Compute accuracy on the testing set
    test_accuracy[i] = knn.score(X_test, y_test)



In [None]:
# Generate plot
plt.title('k-NN: Varying Number of Neighbors')
plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy')
plt.plot(neighbors, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.show()


As we can see from the plot, it appears like the test accuracy is highest when using 3 and 5 neighbors.  7 isn't too bad, but using 8 neighbors or more seems to result in a simple model that under fits the data. 