# $k$NN Classifiers

In class, we saw that using $k$NN density estimation leads to a simple rule for classification: draw a ball of radius $r_k(x)$ around $x$, and return the label that occurs most often within the ball. This notebook will implement this simple idea.

In [None]:
import numpy as np
import sklearn.datasets
import matplotlib.pyplot as plt

%matplotlib inline
plt.rcParams['figure.figsize'] = (8,8)

## Moon data

`sklearn` comes with some data out of the box. The following code generates two crescent moons, with some noise:

In [None]:
moons_features_array, moons_labels = sklearn.datasets.make_moons(200, noise=.3)

`moons_features_array` is a 2-d NumPy array. You have experience working with 1-d NumPy arrays from DSC 10, but maybe not so much practice with 2-d arrays, so I'll go ahead and convert it to a list of 1-d arrays right off the bat:

In [None]:
moons_features = list(moons_features_array)

`moons_labels` is a 1-d array of labels (either 0 or 1), telling us which crescent moon the point belongs to.

In [None]:
moons_labels

Let's take a look at the data:

In [None]:
moons_x_1, moons_x_2 = moons_features_array.T

In [None]:
plt.scatter(moons_x_1, moons_x_2, c=moons_labels)

A yellow point has label 1, while a purple point has label 0.

## The Classifier

Now we write a function which will take in a point z, along with the data, and return a predicted label. The first thing the function must do is find all points which are within radius $r_k(z)$ of z; that is, it finds the $k$ closest points.

In [None]:
def k_closest_points(z, features, k=3):
    """Find the k closest points to z in the features.
    
    Returns a list of pairs. Each pair contains:
    
        (distance to z, index of point)
    
    """
    # find the distance from z to every point
    distances = []
    for ix, x in enumerate(features):
        distance = np.sum((x - z)**2)**(1/2)
        distances.append((distance, ix))
    
    return sorted(distances)[:k]

Let's check and make sure this function is doing what we expect. The below will plot the point $z$ in red, a circle of radius $r_k(z)$ around $z$, and all points within the circle in orange. Try changing $z$ and $k$ and see what happens.

In [None]:
z = [0, 0]
k = 6

plt.scatter(moons_x_1, moons_x_2)
plt.scatter(*z, color='red')

closest = k_closest_points(z, moons_features, k=k)

for _, ix in closest:
    x, y = moons_features[ix]
    plt.scatter(x, y, color='orange')
    
r = closest[-1][0]
circle = plt.Circle(z, r, fill=False)
plt.gca().add_artist(circle)
plt.gca().set_aspect('equal')

Now we can write our classifier function. It should return the label that is found most frequently within the circle. Because the labels are either 0 or 1 here, this amounts to summing up the labels and returning 1 if the sum  is greater than k/2. Try changing $z$ and $k$ and see what happens.

In [None]:
def knn_classify(z, features, labels, k=3):
    closest = k_closest_points(z, features, k)
    votes = [labels[ix] for _,ix in closest]
    return int(sum(votes) > k/2)

In [None]:
z = [0, 1]
k = 6

plt.scatter(moons_x_1, moons_x_2, c=moons_labels)
plt.scatter(*z, color='red')

closest = k_closest_points(z, moons_features, k=k)
    
r = closest[-1][0]
circle = plt.Circle(z, r, fill=False)
plt.gca().add_artist(circle)
plt.gca().set_aspect('equal')

prediction = knn_classify(z, moons_features, moons_labels, k=k)
print('Prediction:', prediction)

## MNIST data

Now let's look at a slightly larger and more interesting dataset: the MNIST handwritten image dataset. We'll use this as an opportunity to re-write the kNN classifier "correctly", using fast NumPy functions.

In [None]:
mnist_data = np.load('mnist.npz')
mnist_train_features = mnist_data['train'].T.astype(float)
mnist_train_labels = mnist_data['train_labels'].flatten()
mnist_test_features = mnist_data['test'].T.astype(float)
mnist_test_labels = mnist_data['test_labels'].flatten()

Our data is now in a $60,000 \times 784$ array. There are 60,000 examples, each being a 784-dimensional vector.

In [None]:
mnist_train_features.shape

In [None]:
mnist_train_features[0]

Each of these vectors is actually a 28x28 image, "flattened" into a vector. We can reshape and visualize it:

In [None]:
plt.imshow(mnist_train_features[33_000].reshape(28, -1), cmap='gray')

## The (fast) classifier

Now we will re-write `k_closest_points` and `knn_classify` from last lecture, but faster.

In [None]:
import scipy.spatial.distance

def k_closest_points(z, features, k=3):
    distances = scipy.spatial.distance_matrix([z], features).flatten()
    return np.argpartition(distances, k)[:k]

Let's try it out. We saw that vector #33,000 is a five. What are its closest neighbors?

In [None]:
closest = k_closest_points(mnist_train_features[33_000], mnist_train_features, k=7)
for ix in closest:
    plt.figure()
    plt.imshow(mnist_train_features[ix].reshape(28, -1), cmap='gray')

Next, we re-write the classification function to use fast NumPy functions:

In [None]:
def knn_classifier(z, features, labels, k=5):
    closest_ix = k_closest_points(z, features, k)
    closest_labels = labels[closest_ix]
    values, counts = np.unique(closest_labels, return_counts=True)
    return values[np.argmax(counts)]

Let's try it out on unseen data:

In [None]:
ix = 6500

plt.imshow(mnist_test_features[ix].reshape(-1, 28), cmap='gray')

prediction = knn_classifier(
    mnist_test_features[ix], 
    mnist_train_features, 
    mnist_train_labels
)

print('Prediction:', prediction)

We now run the classifier on 100 random unseen examples. How many does it get right?

In [None]:
def estimate_knn_classification_error(train_features, train_labels, test_features, test_labels, k=3, trials=50):
    correct = 0
    for i in range(trials):
        ix = np.random.randint(len(test_features))
        prediction = knn_classifier(
            test_features[ix], 
            train_features, 
            train_labels
        )
        if prediction == test_labels[ix]:
            correct += 1
    return correct / trials

In [None]:
estimate_knn_classification_error(
    mnist_train_features, 
    mnist_train_labels, 
    mnist_test_features, 
    mnist_test_labels
)

## Adding noisy dimensions

The MNIST data is nice in two ways: it has very little noise, and there aren't spurious dimensions to "confuse" our classifier. Let's add a bunch of noisy dimensions and see how our $k$NN classifier performs.

In [None]:
NUMBER_OF_NEW_ROWS = 28*3
NOISE_MU = 200
NOISE_SIGMA = 50

In [None]:
def add_noisy_dimensions(data):
    noisy_data = np.pad(data, [[0, 0], [0, NUMBER_OF_NEW_ROWS * 28]], 'constant')
    appended_shape = (noisy_data.shape[0], NUMBER_OF_NEW_ROWS*28)
    noisy_data += np.random.normal(NOISE_MU, NOISE_SIGMA, noisy_data.shape)
    return np.clip(noisy_data, 0, 255)

In [None]:
noisy_train_features = add_noisy_dimensions(mnist_train_features)
noisy_test_features = add_noisy_dimensions(mnist_test_features)

In [None]:
plt.imshow(noisy_train_features[33_000].reshape(-1, 28))

How does adding noise affect the nearest neighbors?

In [None]:
for ix in k_closest_points(noisy_train_features[33_000], noisy_train_features, k=7):
    plt.figure()
    plt.imshow(noisy_train_features[ix].reshape(-1, 28))
    plt.title(f'Label: {mnist_train_labels[ix]}')

The additional noisy dimensions must adversely affect the accuracy...

In [None]:
estimate_knn_classification_error(
    noisy_train_features, 
    mnist_train_labels, 
    noisy_test_features, 
    mnist_test_labels,
    trials=20
)

**Question**: How can we remove extra noisy dimensions from the data?

- 0: Elevation / quantitative /meters / Elevation in meters 
- 1: Aspect / quantitative / azimuth / Aspect in degrees azimuth 
- 2: Slope / quantitative / degrees / Slope in degrees 
- 3: Horizontal_Distance_To_Hydrology / quantitative / meters / Horz Dist to nearest surface water features 
- 4: Vertical_Distance_To_Hydrology / quantitative / meters / Vert Dist to nearest surface water features 
- 5: Horizontal_Distance_To_Roadways / quantitative / meters / Horz Dist to nearest roadway 
- 6: Hillshade_9am / quantitative / 0 to 255 index / Hillshade index at 9am, summer solstice 
- 7: Hillshade_Noon / quantitative / 0 to 255 index / Hillshade index at noon, summer soltice 
- 8: Hillshade_3pm / quantitative / 0 to 255 index / Hillshade index at 3pm, summer solstice 
- 9: Horizontal_Distance_To_Fire_Points / quantitative / meters / Horz Dist to nearest wildfire ignition points 
- 10-13: Wilderness_Area (4 binary columns) / qualitative / 0 (absence) or 1 (presence) / Wilderness area designation
- 14-53: Soil_Type (40 binary columns) / qualitative / 0 (absence) or 1 (presence) / Soil Type designation 
- 54: Cover_Type (7 types) / integer / 1 to 7 / Forest Cover Type designation

In [None]:
data = np.load('covtype.npz')['data']

In [None]:
standardized = (data - data.mean(axis=0)) / data.std(axis=0)

In [None]:
cov = np.cov(standardized.T)

In [None]:
cov[np.diag_indices_from(cov)] = 0

In [None]:
K = 10
plt.matshow(np.abs(cov[:K,:K]))

In [None]:
plt.matshow(np.abs(cov))