# Imports

In [1]:
import pandas as pd
import numpy as np

# K-Nearest Neighbors

In this lab, we will perform KNN on the ${\tt Smarket}$ dataset.. This data set consists of percentage returns for the S&P 500 stock index over 1,250 days, from the
beginning of 2001 until the end of 2005. For each date, we have recorded
the percentage returns for each of the five previous trading days, ${\tt Lag1}$
through ${\tt Lag5}$. We have also recorded ${\tt Volume}$ (the number of shares traded on the previous day, in billions), ${\tt Today}$ (the percentage return on the date
in question) and ${\tt Direction}$ (whether the market was ${\tt Up}$ or ${\tt Down}$ on this
date). We can use the ${\tt head(...)}$ function to look at the first few rows:

In [2]:
df = pd.read_csv('Smarket.csv', usecols=range(1,10), index_col=0, parse_dates=True)
df.head()

Unnamed: 0_level_0,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today,Direction
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2001-01-01,0.381,-0.192,-2.624,-1.055,5.01,1.1913,0.959,Up
2001-01-01,0.959,0.381,-0.192,-2.624,-1.055,1.2965,1.032,Up
2001-01-01,1.032,0.959,0.381,-0.192,-2.624,1.4112,-0.623,Down
2001-01-01,-0.623,1.032,0.959,0.381,-0.192,1.276,0.614,Up
2001-01-01,0.614,-0.623,1.032,0.959,0.381,1.2057,0.213,Up


Today we're going to try to predict ${\tt Direction}$ using percentage returns from the previous two days (${\tt Lag1}$ and ${\tt Lag2}$). We'll build our model using the ${\tt KNeighborsClassifier()}$ function, which is part of the
${\tt neighbors}$ submodule of SciKitLearn (${\tt sklearn}$). We'll also grab a couple of useful tools from the ${\tt metrics}$ submodule:

In [3]:
from sklearn import neighbors
from sklearn.metrics import confusion_matrix, classification_report

This function works rather differently from the other model-fitting
functions that we have encountered thus far. Rather than a two-step
approach in which we first fit the model and then we use the model to make
predictions, ${\tt knn()}$ forms predictions using a single command. The function
requires four inputs.
   1. A matrix containing the predictors associated with the training data,
labeled ${\tt X\_train}$ below.
   2. A matrix containing the predictors associated with the data for which
we wish to make predictions, labeled ${\tt X\_test}$ below.
   3. A vector containing the class labels for the training observations,
labeled ${\tt y\_train}$ below.
   4. A value for $K$, the number of nearest neighbors to be used by the
classifier.

We'll first create a vector corresponding to the observations from 2001 through 2004, which we'll use to train the model. We will then use this vector to create a held out data set of observations from 2005 on which we will test. We'll also pull out our training and test labels.

In [4]:
X_train = df[:'2004'][['Lag1','Lag2']]
y_train = df[:'2004']['Direction']

X_test = df['2005':][['Lag1','Lag2']]
y_test = df['2005':]['Direction']

Now the ${\tt neighbors.KNeighborsClassifier()}$ function can be used to predict the market’s movement for
the dates in 2005.

In [5]:
knn = neighbors.KNeighborsClassifier(n_neighbors=1)
pred = knn.fit(X_train, y_train).predict(X_test)

The ${\tt confusion\_matrix()}$ function can be used to produce a **confusion matrix** in order to determine how many observations were correctly or incorrectly classified. The ${\tt classification\_report()}$ function gives us some summary statistics on the classifier's performance:

In [6]:
print(confusion_matrix(y_test, pred).T)
print(classification_report(y_test, pred, digits=3))

[[43 58]
 [68 83]]
             precision    recall  f1-score   support

       Down      0.426     0.387     0.406       111
         Up      0.550     0.589     0.568       141

avg / total      0.495     0.500     0.497       252



The results using $K = 1$ are not very good, since only 50% of the observations
are correctly predicted. Of course, it may be that $K = 1$ results in an
overly flexible fit to the data. Below, we repeat the analysis using $K = 3$.

In [7]:
knn = neighbors.KNeighborsClassifier(n_neighbors=3)
pred = knn.fit(X_train, y_train).predict(X_test)
print(confusion_matrix(y_test, pred).T)
print(classification_report(y_test, pred, digits=3))

[[48 55]
 [63 86]]
             precision    recall  f1-score   support

       Down      0.466     0.432     0.449       111
         Up      0.577     0.610     0.593       141

avg / total      0.528     0.532     0.529       252



The results have improved slightly. Try looping through a few other $K$ values to see if you can get any further improvement:

In [9]:
# Your code here
for k_val in range(1,10):
    knn = neighbors.KNeighborsClassifier(n_neighbors=k_val)
    pred = knn.fit(X_train, y_train).predict(X_test)
    print(confusion_matrix(y_test, pred).T)
    print(classification_report(y_test, pred, digits=3))

[[43 58]
 [68 83]]
             precision    recall  f1-score   support

       Down      0.426     0.387     0.406       111
         Up      0.550     0.589     0.568       141

avg / total      0.495     0.500     0.497       252

[[74 93]
 [37 48]]
             precision    recall  f1-score   support

       Down      0.443     0.667     0.532       111
         Up      0.565     0.340     0.425       141

avg / total      0.511     0.484     0.472       252

[[48 55]
 [63 86]]
             precision    recall  f1-score   support

       Down      0.466     0.432     0.449       111
         Up      0.577     0.610     0.593       141

avg / total      0.528     0.532     0.529       252

[[71 82]
 [40 59]]
             precision    recall  f1-score   support

       Down      0.464     0.640     0.538       111
         Up      0.596     0.418     0.492       141

avg / total      0.538     0.516     0.512       252

[[40 59]
 [71 82]]
             precision    recall  f1-score   

It looks like for classifying this dataset, ${KNN}$ might not be the right approach.

# An Application to Caravan Insurance Data
Let's see how the ${\tt KNN}$ approach performs on the ${\tt Caravan}$ data set. This data set includes 85 predictors that measure demographic characteristics for 5,822 individuals. The response variable is
${\tt Purchase}$, which indicates whether or not a given individual purchases a
caravan insurance policy. In this data set, only 6% of people purchased
caravan insurance.

In [10]:
df2 = pd.read_csv('Caravan.csv')
df2["Purchase"].value_counts()

No     5474
Yes     348
Name: Purchase, dtype: int64

In [11]:
df2.head()

Unnamed: 0.1,Unnamed: 0,MOSTYPE,MAANTHUI,MGEMOMV,MGEMLEEF,MOSHOOFD,MGODRK,MGODPR,MGODOV,MGODGE,...,APERSONG,AGEZONG,AWAOREG,ABRAND,AZEILPL,APLEZIER,AFIETS,AINBOED,ABYSTAND,Purchase
0,1,33,1,3,2,8,0,5,1,3,...,0,0,0,1,0,0,0,0,0,No
1,2,37,1,2,2,8,1,4,1,4,...,0,0,0,1,0,0,0,0,0,No
2,3,37,1,2,2,8,0,4,2,4,...,0,0,0,1,0,0,0,0,0,No
3,4,9,1,3,3,3,2,3,2,4,...,0,0,0,1,0,0,0,0,0,No
4,5,40,1,4,2,10,1,4,1,4,...,0,0,0,1,0,0,0,0,0,No


Because the ${\tt KNN}$ classifier predicts the class of a given test observation by
identifying the observations that are nearest to it, the scale of the variables
matters. Any variables that are on a large scale will have a much larger
effect on the distance between the observations, and hence on the ${\tt KNN}$
classifier, than variables that are on a small scale. 

For instance, imagine a
data set that contains two variables, salary and age (measured in dollars
and years, respectively). As far as ${\tt KNN}$ is concerned, a difference of \$1,000
in salary is enormous compared to a difference of 50 years in age. Consequently,
salary will drive the ${\tt KNN}$ classification results, and age will have
almost no effect. 

This is contrary to our intuition that a salary difference
of \$1,000 is quite small compared to an age difference of 50 years. Furthermore,
the importance of scale to the ${\tt KNN}$ classifier leads to another issue:
if we measured salary in Japanese yen, or if we measured age in minutes,
then we’d get quite different classification results from what we get if these
two variables are measured in dollars and years.

A good way to handle this problem is to **standardize** the data so that all
variables are given a mean of zero and a standard deviation of one. Then
all variables will be on a comparable scale. The ${\tt scale()}$ function from the ${\tt preprocessing}$ submodule of SciKitLearn does just
this. In standardizing the data, we exclude column 86, because that is the
qualitative ${\tt Purchase}$ variable.

In [12]:
from sklearn import preprocessing
y = df2.Purchase
X = df2.drop('Purchase', axis=1).astype('float64')
X_scaled = preprocessing.scale(X)
print(np.std(X_scaled))

1.0


Now every column of ${\tt X\_scaled}$ has a standard deviation of one and
a mean of zero.

We'll now split the observations into a test set, containing the first 1,000
observations, and a training set, containing the remaining observations.

In [13]:
X_train = X_scaled[1000:,:]
y_train = y[1000:]
X_test = X_scaled[:1000,:]
y_test = y[:1000]

Let's fit a ${\tt KNN}$ model on the training data using $K = 1$, and evaluate its
performance on the test data.

In [14]:
knn = neighbors.KNeighborsClassifier(n_neighbors=1)
pred = knn.fit(X_train, y_train).predict(X_test)
print(classification_report(y_test, pred, digits=3))

             precision    recall  f1-score   support

         No      0.948     0.937     0.943       941
        Yes      0.157     0.186     0.171        59

avg / total      0.902     0.893     0.897      1000



The KNN error rate on the 1,000 test observations is just under 12%. At first glance, this may appear to be fairly good. However, since only 6% of customers purchased insurance, we could get the error rate down to 6% by always predicting ${\tt No}$ regardless of the values of the predictors!

Suppose that there is some non-trivial cost to trying to sell insurance
to a given individual. For instance, perhaps a salesperson must visit each
potential customer. If the company tries to sell insurance to a random
selection of customers, then the success rate will be only 6%, which may
be far too low given the costs involved. 

Instead, the company would like
to try to sell insurance only to customers who are likely to buy it. So the
overall error rate is not of interest. Instead, the fraction of individuals that
are correctly predicted to buy insurance is of interest.

It turns out that ${\tt KNN}$ with $K = 1$ does far better than random guessing
among the customers that are predicted to buy insurance:

In [16]:
print(confusion_matrix(y_test, pred).T)

[[882  48]
 [ 59  11]]


Among 77 such
customers, 9, or 11.7%, actually do purchase insurance. This is double the
rate that one would obtain from random guessing. Let's see if increasing $K$ helps! Try out a few different $K$ values below. Feeling adventurous? Write a function that figures out the best value for $K$.

In [16]:
for k_val in range(1,10):
    knn = neighbors.KNeighborsClassifier(n_neighbors=k_val)
    pred = knn.fit(X_train, y_train).predict(X_test)
    print(confusion_matrix(y_test, pred).T)
    print(classification_report(y_test, pred, digits=3))

[[882  48]
 [ 59  11]]
             precision    recall  f1-score   support

         No      0.948     0.937     0.943       941
        Yes      0.157     0.186     0.171        59

avg / total      0.902     0.893     0.897      1000

[[931  57]
 [ 10   2]]
             precision    recall  f1-score   support

         No      0.942     0.989     0.965       941
        Yes      0.167     0.034     0.056        59

avg / total      0.897     0.933     0.912      1000

[[921  53]
 [ 20   6]]
             precision    recall  f1-score   support

         No      0.946     0.979     0.962       941
        Yes      0.231     0.102     0.141        59

avg / total      0.903     0.927     0.913      1000

[[938  58]
 [  3   1]]
             precision    recall  f1-score   support

         No      0.942     0.997     0.969       941
        Yes      0.250     0.017     0.032        59

avg / total      0.901     0.939     0.913      1000

[[934  55]
 [  7   4]]
             precision   

  'precision', 'predicted', average, warn_for)


[[941  59]
 [  0   0]]
             precision    recall  f1-score   support

         No      0.941     1.000     0.970       941
        Yes      0.000     0.000     0.000        59

avg / total      0.885     0.941     0.912      1000

