This lab on K-Nearest Neighbors is a python adaptation of p. 163-167 of "Introduction to Statistical Learning with Applications in R" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. Originally adapted by Jordi Warmenhoven (github.com/JWarmenhoven/ISLR-python), modified by R. Jordan Crouser at Smith College for SDS293: Machine Learning (Spring 2016).

In [1]:
import pandas as pd
import numpy as np

# K-Nearest Neighbors
In this lab, we will perform KNN clustering on the Smarket dataset from ISLR. This data set consists of percentage returns for the S&P 500 stock index over 1,250 days, from the beginning of 2001 until the end of 2005.

For each date, we have recorded the percentage returns for each of the five previous trading days (Lag1 through Lag5).

We have also recorded:

Volume (the number of shares traded on the previous day, in billions)
Today (the percentage return on the date in question)
Direction (whether the market was Up or Down on this date).
We can use the head() function to look at the first few rows:

In [4]:
df = pd.read_csv('Smarket_lab.csv', index_col=0, parse_dates = True)
df.head()

  df = pd.read_csv('Smarket.csv', index_col=0, parse_dates = True)


Unnamed: 0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today,Direction
1,2001,0.381,-0.192,-2.624,-1.055,5.01,1.1913,0.959,Up
2,2001,0.959,0.381,-0.192,-2.624,-1.055,1.2965,1.032,Up
3,2001,1.032,0.959,0.381,-0.192,-2.624,1.4112,-0.623,Down
4,2001,-0.623,1.032,0.959,0.381,-0.192,1.276,0.614,Up
5,2001,0.614,-0.623,1.032,0.959,0.381,1.2057,0.213,Up


Today we're going to try to predict Direction using percentage returns from the previous two days (Lag1 and Lag2). We'll build our model using the KNeighborsClassifier() function, which is part of the neighbors submodule of SciKitLearn (sklearn). We'll also grab a couple of useful tools from the metrics submodule:

In [6]:
from sklearn import neighbors
from sklearn.metrics import confusion_matrix, classification_report

This function works rather differently from the other model-fitting functions that we have encountered thus far. Rather than a two-step approach in which we first fit the model and then we use the model to make predictions, knn() forms predictions using a single command. The function requires four inputs.

- A matrix containing the predictors associated with the training data, labeled X_train below.
- A matrix containing the predictors associated with the test data, labeled X_test below.
- A vector containing the class labels for the training observations, labeled Y_train below.
- A value for K, the number of nearest neighbors to be used by the classifier.

We'll first create a vector corresponding to the observations from 2001 through 2004, which we'll use to train the model. We will then use this vector to create a held out data set of observations from 2005 on which we will test. We'll also pull out our training and test labels.

In [55]:
# X_train = df[:2004][['Lag1','Lag2']]
# y_train = df[:2004]['Direction']
#
# X_test = df[2004:][['Lag1','Lag2']]
# y_test = df[2004:]['Direction']

X_train = df[['Lag1', 'Lag2']].loc[df.Year < 2005]
y_train = df['Direction'].loc[df.Year < 2005]

X_test = df[['Lag1', 'Lag2']].loc[df.Year > 2004]
y_test = df['Direction'].loc[df.Year > 2004]

Now the neighbors.KNeighborsClassifier() function can be used to predict the market’s movement for the dates in 2005.

In [56]:
knn = neighbors.KNeighborsClassifier(n_neighbors = 1)
pred = knn.fit(X_train, y_train).predict(X_test)

The confusion_matrix() function can be used to produce a confusion matrix in order to determine how many observations were correctly or incorrectly classified. The classification_report() function gives us some summary statistics on the classifier's performance:

In [57]:
print(confusion_matrix(y_test, pred).T)
print(classification_report(y_test, pred, digits=3))

[[43 58]
 [68 83]]
              precision    recall  f1-score   support

        Down      0.426     0.387     0.406       111
          Up      0.550     0.589     0.568       141

    accuracy                          0.500       252
   macro avg      0.488     0.488     0.487       252
weighted avg      0.495     0.500     0.497       252



The results using  K=1
  are not very good, since only 50% of the observations are correctly predicted. Of course, it may be that  K=1
  results in an overly flexible fit to the data. Below, we repeat the analysis using  K=3.

In [58]:
knn = neighbors.KNeighborsClassifier(n_neighbors=3)
pred = knn.fit(X_train, y_train).predict(X_test)

print(confusion_matrix(y_test, pred).T)
print(classification_report(y_test, pred, digits=3))

[[48 55]
 [63 86]]
              precision    recall  f1-score   support

        Down      0.466     0.432     0.449       111
          Up      0.577     0.610     0.593       141

    accuracy                          0.532       252
   macro avg      0.522     0.521     0.521       252
weighted avg      0.528     0.532     0.529       252



In [61]:
for k in range(1,10):
    print(f"K = {k}")
    knn = neighbors.KNeighborsClassifier(n_neighbors=k)
    pred = knn.fit(X_train, y_train).predict(X_test)

    print(confusion_matrix(y_test, pred).T)
    print(classification_report(y_test, pred, digits=3))

K = 1
[[43 58]
 [68 83]]
              precision    recall  f1-score   support

        Down      0.426     0.387     0.406       111
          Up      0.550     0.589     0.568       141

    accuracy                          0.500       252
   macro avg      0.488     0.488     0.487       252
weighted avg      0.495     0.500     0.497       252

K = 2
[[74 93]
 [37 48]]
              precision    recall  f1-score   support

        Down      0.443     0.667     0.532       111
          Up      0.565     0.340     0.425       141

    accuracy                          0.484       252
   macro avg      0.504     0.504     0.479       252
weighted avg      0.511     0.484     0.472       252

K = 3
[[48 55]
 [63 86]]
              precision    recall  f1-score   support

        Down      0.466     0.432     0.449       111
          Up      0.577     0.610     0.593       141

    accuracy                          0.532       252
   macro avg      0.522     0.521     0.521       252

### It looks like for classifying this dataset, KNN might not be the right approach.