# Nearest neighbor for spinal cord injury

in this particular project, we are willing to use nearest neighbor algorithm to classify the type of injury that the patient has in the hospital, predicated on the measurements of the shape and orientation of their spine and pelvis. 

The dataset holds information from 310 patiens. For each patient, there are six measurements (6 independent variables) and only one label (dependent variable), the label has three possible values, `’NO’` (normal), `’DH’` (herniated disk), or `’SL’` (spondilolysthesis)

# 1. Setup notebook

We are going to import all the necessary packages particularly for the first section of the project, Notice that we will not import any of Sklearn packages, Because, our main mission is to implement a nearest neighbor classifier manually.

In [1]:
import numpy as np

We load our dataset. We slice our data into a training set of 210 patients and testing set of 100 patients. The following array are going to be created after breaking the data into two uneven chunks:


*   `’x_train’` : represents the training data's features, has one observation per row.
*   `’y_train’` : represents the training data's labels.
*   `’x_test’` : represents the testing data's features, has one observation per row.
*   `’y_test’` : represents the testing data's labels.

We will use the training sets (`x_train`,`y_train`), with nearest neighbor classification to predict the labels for the test data (`x_test`), and will compare the estimated labels against the true ones.

The three values associated with the label will be endoded as following : 


*   0 = `NO`
*   1 = `DH`
*   2 = `SL`






In [2]:
# Mount google drive files in google colab
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# load our dataset and attribute to each label a number.
labels = [b'NO', b'DH', b'SL']
data = np.loadtxt('/content/drive/MyDrive/Colab Notebooks/NN_spine/NN_spine/column_3C.dat', converters={6: lambda s: labels.index(s)})

#Seprate features from labels.
x = data[:,0:6]
y = data[:,6]

# Divide the data into training and test set.
training_indices = list(range(0,20)) + list(range(40,188)) + list(range(230,310))
test_indices = list(range(20,40)) + list(range(188,230))

x_train = x[training_indices,:]
y_train = y[training_indices]
x_test = x[test_indices,:]
y_test = y[test_indices]

# 2. Nearest neighbor classifier based on the euclidean distance (L2)

To compute the nearest neighbor in our dataset, we need to be able to calculate the distance between data points.

`Euclidean distance `: for two vectors $x, y \in \mathbb{R}^d$, their Euclidean distance is defined as 
$$\|x - y\| = \sqrt{\sum_{i=1}^d (x_i - y_i)^2}.$$
Often we get red of the square root, and simply compute _squared Euclidean distance_:
$$\|x - y\|^2 = \sum_{i=1}^d (x_i - y_i)^2.$$
For the purposes of nearest neighbor computations, the two are equivalent: for three vectors $x, y, z \in \mathbb{R}^d$, we have $\|x - y\| \leq \|x - z\|$ if and only if $\|x - y\|^2 \leq \|x - z\|^2$.

Now we just need to be able to compute squared Euclidean distance. The following function does so.


In [4]:
## Build a function that help us to compute distance between two given vectors
def squared_dist (x,y):
  return np.sqrt(np.sum(np.square(x - y)))

# Compute distance between the first and fifth vector in our training set:
print ('the actual distance between the first and fifth observation is :', squared_dist(x_train[0,:],x_train[4,:] ))

the actual distance between the first and fifth observation is : 25.069339839732518


# 3. Compute the nearest neighbors

Since we've defined our distance function, we can turn to the nearest neighbor classification

In [49]:
# take a vector and return the index of its nearest neighbor in training set
def find_NN (x_test):
  # Compute the distances between the vector x_test and each vector of training set:
  for i in range(len(x_test)):
    for j in range(len(y_train)): 
      distances = [squared_dist(x_test, x_train[j,])]
  # Get the index of smallest distance
  return np.argmin(distances)


## Take a vector and return the label of its nearest neighbor in training set
def NN_Classifier (x_test):
  # Get the index of its nearest neighbor 
  index = find_NN(x_test)
  # Return corresponding class
  return y_train[index]

# 4. Processing the full test set

Let's apply our nearest neighbor classifier over the entire test set.

Note to classify each test set, our code takes a full pass over each of the 248 training examples. Thus we should not expect testing to be very fast. 

In [55]:
## Predict on each test data point (and time it!)
import time
t_before = time.time()
test_predictions = [NN_Classifier(x_test)]
t_after = time.time()

# how much error our classifier makes to predict the class of each observation in test set
err_positions = np.not_equal(test_predictions, y_test)
error = 1 - float(np.sum(err_positions))/len(y_test)

print("Error of nearest neighbor classifier: ", error)
print("Classification time (seconds): ", t_after - t_before)

Error of nearest neighbor classifier:  0.32258064516129037
Classification time (seconds):  0.16842055320739746


In [7]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, test_predictions)
cm

array([[17,  1,  2],
       [10, 10,  0],
       [ 0,  0, 22]])

## 5.Simple cross validation 
We are going to split our data into a training and test set in the ratio of 80:20, We are going to keep 80 percent of our data to train our neareast neighbor and the remaining percentage to assess the performance of our model in predicting the labels of our test data set. We're going to train our model with different values of k, then we're going to capture its accuracy on our test data. 

In [8]:
# Import the relevant packages 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score

In [9]:
# Split our data into train and test set 
X_1, X_test1, y_1, y_test1 = train_test_split(x, y, test_size = 0.20)
# Split the train set into cross validation train and cross validation test
X_tr, X_cv, y_tr, y_cv = train_test_split(X_1, y_1, test_size = 0.2)

for i in range (1,50,2):
  #instantiate learning model ()
  knn = KNeighborsClassifier(n_neighbors = i)
  # fitting the model on cross validation train set
  knn.fit(X_tr, y_tr)
  # Predicting the labels on cross validation train
  pred1 = knn.predict(X_cv)
  # Assessing the accuracy of our model
  acc = accuracy_score(y_cv, pred1, normalize = True) * float(100)
  print('\n CV ACCURACY FOR K = %d is %d%%' % (i, acc))


 CV ACCURACY FOR K = 1 is 82%

 CV ACCURACY FOR K = 3 is 80%

 CV ACCURACY FOR K = 5 is 78%

 CV ACCURACY FOR K = 7 is 78%

 CV ACCURACY FOR K = 9 is 80%

 CV ACCURACY FOR K = 11 is 88%

 CV ACCURACY FOR K = 13 is 84%

 CV ACCURACY FOR K = 15 is 84%

 CV ACCURACY FOR K = 17 is 84%

 CV ACCURACY FOR K = 19 is 82%

 CV ACCURACY FOR K = 21 is 82%

 CV ACCURACY FOR K = 23 is 82%

 CV ACCURACY FOR K = 25 is 88%

 CV ACCURACY FOR K = 27 is 84%

 CV ACCURACY FOR K = 29 is 84%

 CV ACCURACY FOR K = 31 is 84%

 CV ACCURACY FOR K = 33 is 88%

 CV ACCURACY FOR K = 35 is 90%

 CV ACCURACY FOR K = 37 is 90%

 CV ACCURACY FOR K = 39 is 92%

 CV ACCURACY FOR K = 41 is 90%

 CV ACCURACY FOR K = 43 is 90%

 CV ACCURACY FOR K = 45 is 92%

 CV ACCURACY FOR K = 47 is 88%

 CV ACCURACY FOR K = 49 is 86%


In [10]:
# We pick the optimal k at which we get the highest accuracy score
## instatiate the learning model (k = 5)
knn = KNeighborsClassifier(n_neighbors = 5)
## Fitting our model on train data
knn.fit(X_1, y_1)
## Predicting the response on train data
pred2 = knn.predict(X_test1)
## Eaxamine the accuracy of our model 
acc2 = accuracy_score(y_test1, pred2) * float(100)
print('\n The accuracy of our nearest neighbor classifier for k = 5 is : %f%%' %(acc2))


 The accuracy of our nearest neighbor classifier for k = 5 is : 80.645161%


## 6. 10 fold cross validation 

In [12]:
# import the important package
import matplotlib.pyplot as plt

# Create a list of odd numbers 
List = list(range(1,50))
Neighbor = list(filter(lambda x : x % 2 != 0, List))

# Empty list in which we're going to store cv scores
cv_scores = []

# Perform 10-fold cross validation 
for k in Neighbor : 
  knn = KNeighborsClassifier(n_neighbors = k)
  scores = cross_val_score(knn, X_1, y_1, cv = 10, scoring = 'accuracy')
  cv_scores.append(scores.mean())

# Compute the misclassification error
MSE = [1 - x for x in cv_scores]

# Find the optimal k 
optimal_k = Neighbor[MSE.index(min(MSE))]
print('\n The optimal number of neighbor is %d.' % optimal_k)


 The optimal number of neighbor is 7.
