## k-Nearest Neighbors

adapted from Killian W, rewritten to use sklearn, and multiple other changes (PG, Sep 2019)

Data is available as [digits.npy](https://courses.cit.cornell.edu/info3950_2023sp/digits.npy)
and [faces.npy](https://courses.cit.cornell.edu/info3950_2023sp/faces.npy)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier

In [None]:
def plotimage(M, c=2): # plot an image and draw a red/blue/green box around it (specified by "d")
    B = np.zeros((ydim+2, xdim+2)) #with boundary
    B[1:-1,1:-1] = M.reshape(ydim, xdim) #fill interior
    c = ['#FF0000','#00FF00','#0000FF'][int(c)] # R, G, B
    plt.hlines([0,ydim+1],0,xdim+1, color=c, lw=3)
    plt.vlines([0,xdim+1],0,ydim+1, color=c, lw=3)
    plt.imshow(B, cmap='gray', interpolation='none')
    plt.axis('off')
    
def onclick(event):
    global who
    preds = knn3.predict(X_test[who:who+4]) == y_test[who:who+4] #true or false for the four
    nbrs = knn3.kneighbors(X_test[who:who+4])[1] #just the neighbors, not distances
    
    for i in range(4): #rows
        plt.subplot(4,4, 4*i+1) #first column
        plotimage(X_test[who], 2 if preds[i] else 0)  #blue or red
        if i==0: plt.title('TEST')
        
        for j in range(3): #2nd through fourth columns
            plt.subplot(4,4,4*i+j+2) 
            plotimage(X_train[nbrs[i,j]], y_train[nbrs[i,j]]==y_test[who])
            if i==0: plt.title('{}-NN'.format(j+1))
        who+=1

In [None]:
#X_train, y_train, X_test, y_test, xdim, ydim = np.load('faces.npy')
X_train, y_train, X_test, y_test, xdim, ydim = np.load('digits.npy', allow_pickle=True) #first try digits data
print (X_train.shape, y_train.shape, X_test.shape, y_test.shape)

knn3 = KNeighborsClassifier(n_neighbors=3).fit(X_train, y_train)
knn3

There are 7291 training images and 2007 test images in the digits data, each 16x16=256 pixels.<br>
Pixel values range from -1 to 1.<br>
Each is labelled by the digit 0-9.

In [None]:
# might need this to suppress warnings about scipy mode change to keepdims
from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning) # ignore future warnings

In [None]:
%matplotlib notebook 
who=0
fig=plt.figure(figsize=(6,6))
cid = fig.canvas.mpl_connect('button_press_event', onclick)
onclick(None)
print('Click to cycle through test images (blue=test, green=correct, red=wrong)');

In [None]:
X_train, y_train, X_test, y_test, xdim, ydim = np.load('faces.npy', allow_pickle=True) #now try faces
#X_train, y_train, X_test, y_test, xdim, ydim = np.load('digits.npy')
print (X_train.shape, y_train.shape, X_test.shape, y_test.shape)

knn3 = KNeighborsClassifier(n_neighbors=3).fit(X_train, y_train)
knn3

There are 280 training images and 120 test images in the face data, each 31x38=1178 pixels.<br>
Pixels range in value from 0 to 245.<br>
They consist of 10 images of each of 40 subjects, with 7 images in the training set and the other three in the test set.

In [None]:
%matplotlib notebook 
who=0
fig=plt.figure(figsize=(6,6))
cid = fig.canvas.mpl_connect('button_press_event', onclick)
onclick(None)
print('Click to cycle through test images (blue=test, green=correct, red=wrong)');

Notes on the data: The underlying 16x16 digit data can be found here:<br>
https://cs.nyu.edu/~roweis/data/_old_list<br>
as linked from here:<br>
https://cs.nyu.edu/~roweis/data/usps_16x16.mat<br>
with the description:
<blockquote>
Handwritten digits from US post office, "0" through "9".<br>
Two cell arrays, one for training, one for testing. d{1} is "one"... d{9} is "nine"...d{10} is "zero".<br>
7291 train cases, 2007 test cases<br>
Warning: test data comes from totally different distribution than training data. Use at your own risk.
</blockquote>

The face images are originally from:<br>
https://cam-orl.co.uk/facedatabase.html<br>
with the description:<br>
<blockquote>
Our Database of Faces, (formerly 'The ORL Database of Faces'), contains a set of face images taken between April 1992 and April 1994 at the lab. The database was used in the context of a face recognition project carried out in collaboration with the Speech, Vision and Robotics Group of the Cambridge University Engineering Department.<br>
There are ten different images of each of 40 distinct subjects. For some subjects, the images were taken at different times, varying the lighting, facial expressions (open / closed eyes, smiling / not smiling) and facial details (glasses / no glasses). All the images were taken against a dark homogeneous background with the subjects in an upright, frontal position (with tolerance for some side movement). A [preview image](https://cam-orl.co.uk/facesataglance.html) of the Database of Faces is available.<br>
... The size of each image is 92x112 pixels, with 256 grey levels per pixel.
</blockquote>

In [None]:
%matplotlib inline
plt.figure(figsize=(12,4))
t=4 #subject 4
for i,M in enumerate(X_train[(y_train==t).flatten()]):
    plt.subplot(2,5,i+1)
    plotimage(M)
for i,M in enumerate(X_test[(y_test==t).flatten()]):
    plt.subplot(2,5,i+8)
    plotimage(M,1)

In [None]:
plt.figure(figsize=(6,6)) #enlarged to see pixels
plt.imshow(X_train[266].reshape(38,31), cmap='gray', interpolation='none')
plt.axis('off');

In [None]:
#same image at original 92x112 pixel resolution, with 256 grey levels per pixel
from IPython.display import Image
Image('s8.6.png', width=265)