# Nearest Neighbor Classification - manual implementation

### Import

In [None]:
import numpy as np

### Data

In this homework notebook we use **nearest neighbor classification** to classify back injuries for patients in a hospital, based on measurements of the shape and orientation of their pelvis and spine.

The data set contains information from **310** patients. For each patient, there are: six measurements (the x) and a label (the y). The label has **3** possible values, `’NO’` (normal), `’DH’` (herniated disk), or `’SL’` (spondilolysthesis). 

We now load the dataset. We divide the data into a training set of 248 patients and a separate test set of 62 patients. The following arrays are created:

* **`trainx`** : The training data's features, one point per row.
* **`trainy`** : The training data's labels.
* **`testx`** : The test data's features, one point per row.
* **`testy`** : The test data's labels.

We will use the training set (`trainx` and `trainy`), with nearest neighbor classification, to predict labels for the test data (`testx`). We will then compare these predictions with the correct labels, `testy`.

Notice that we code the three labels as `0. = ’NO’, 1. = ’DH’, 2. = ’SL’`.

In [None]:
data_file = !find ../.. | grep -i column_3C.dat
data_file[0]

In [None]:
!head ../../_data/column_3C.dat

In [None]:
# Load data set and code labels as 0 = ’NO’, 1 = ’DH’, 2 = ’SL’
labels = [b'NO', b'DH', b'SL']
data = np.loadtxt(data_file[0], converters={6: lambda s: labels.index(s)} )

In [None]:
# Separate features from labels
x = data[:, 0:6]
y = data[:, 6]

### Train-test split

In [None]:
training_indices = list(range(0,20)) + list(range(40,188)) + list(range(230,310))
test_indices = list(range(20,40)) + list(range(188,230))

trainx = x[training_indices,:]
trainy = y[training_indices]
testx = x[test_indices,:]
testy = y[test_indices]

### KNN distance metrics

To compute nearest neighbors in our data set, we need to first be able to compute distances between data points. 
A natural distance function is _Euclidean distance aka the L2-norm_: for two vectors $x, y \in \mathbb{R}^d$, their Euclidean distance is defined as   
$$L2-norm = \|x - y\|^2 = \sqrt{\sum_{i=1}^d (x_i - y_i)^2}.$$

Another distance metric is the _Manhattan distance or taxicab distance aka the L1-norm_:   
$$L1-norm = \|x - y\| = \sum_{i=1}^d |x_i - y_i|.$$

In [None]:
def L1(x, y):
    """L1 (Manhattan) distance.
    :param x, y: vectors x, y
    :returns: distance"""
    return np.sum(np.abs(x-y))

In [None]:
def L2(x, y):
    """L2 (Euclidean) distance.
    :param x, y: vectors x, y
    :returns: distance"""
    return np.sum(np.square(x-y))

In [None]:
def find_NN(trainx, samplex, norm):
    '''Compute L2 distances from x to full dataset.
    :param samplex: vector x
    :returns: the index of its nearest neighbor in dataset'''
    distances = [norm(samplex, trainx[i,]) for i in range(len(trainx))]
    return np.argmin(distances)

In [None]:
def NN_classifier(trainx, trainy, samplex, norm):
    """Get NN index and return class
    :param x: vector x
    :returns: predicted class
    """
    index = find_NN(trainx, samplex, norm)
    return trainy[index]

In [None]:
def NN(trainx, trainy, testx, norm=L2):
    """Nearest Neighbor Classifier
    :params: trainx, trainy, testx, norm=L2
    :returns: predicted labels for testset
    """
    predicted_values = [NN_classifier(trainx, trainy, samplex, norm) for i, samplex in enumerate(testx)]
    return np.asarray(predicted_values)

#### test

In [None]:
testy_L2 = NN(trainx, trainy, testx, L2)

assert( type( testy_L2).__name__ == 'ndarray' )
assert( len(testy_L2) == 62 ) 
assert( np.all( testy_L2[50:60] == [ 0.,  0.,  0.,  0.,  2.,  0.,  2.,  0.,  0.,  0.] )  )
assert( np.all( testy_L2[0:10] == [ 0.,  0.,  0.,  1.,  1.,  0.,  1.,  0.,  0.,  1.] ) )

In [None]:
testy_L1 = NN(trainx, trainy, testx, L1)

assert( type( testy_L1).__name__ == 'ndarray' )
assert( len(testy_L1) == 62 ) 
assert( not all(testy_L1 == testy_L2) )
assert( all(testy_L1[50:60]== [ 0.,  2.,  1.,  0.,  2.,  0.,  0.,  0.,  0.,  0.]) )
assert( all( testy_L1[0:10] == [ 0.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  0.,  1.]) )

### Test errors and the confusion matrix

Let's see if the L1 and L2 distance functions yield different error rates for nearest neighbor classification of the test data.

In [None]:
def error_rate(testy, testy_pred):
    return float(sum(testy!=testy_pred))/len(testy) 

print("Error rate of NN_L1: ", error_rate(testy, testy_L1) )
print("Error rate of NN_L2: ", error_rate(testy, testy_L2) )

We will now look a bit more deeply into the specific types of errors made by nearest neighbor classification, by constructing the <font color="magenta">*confusion matrix*</font>.

Since there are three labels, the confusion matrix is a 3x3 matrix that shows the number of misclassifications for each label. For example, the entry at row DH, column SL, contains the number of test points whose correct label was DH but which were classified as SL.

<img style="width:200px" src="../../_data/confusion_matrix.png">




In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(testy, testy_L1)
cm
cm.diagonal()

In [None]:
def confusion(testy, testy_pred, diag_zeros=False):
    """Get Confusion Matrix.
    :params: correct labels, predicted NN labels 
    :params diag_zeros: keep matrix diagonal to zero's (remove the correct predictions)
    :returns: 3x3 np.array representing the confusion matrix"""
    n, m = len(np.unique(testy)), len(np.unique(testy_pred))
    conf_matrix = np.zeros((n, m))
    for (y, p), _ in np.ndenumerate(conf_matrix):
            if not (y == p and diag_zeros):
                conf_matrix[y, p] = sum([1 for label, pred in zip(testy, testy_pred) if label==y if pred==p])
    return conf_matrix

In [None]:
confusion(testy, testy_L1)
confusion(testy, testy_L2)

In [None]:
confusion(testy, testy_L2)[0,1]

### Misclassifications

In [None]:
confusion(testy, testy_L2).sum()   # total [r, c]
confusion(testy, testy_L2).sum(0)  # by column (prediction) - sum of rows
confusion(testy, testy_L2).sum(1)  # by row (label) - sum of cols

### Most frequent misclassification

In [None]:
np.argmax(confusion(testy, testy_L2))
confusion(testy, testy_L2).ravel()[np.argmax(confusion(testy, testy_L2))]

### Classification differences in norms L1 vs. L2

In [None]:
sum(testy_L1 != testy_L2)

In [None]:
(confusion(testy, testy_L1) != confusion(testy, testy_L2)).sum()