# Classification - k-nearest-neighbors - Education Attainment

In [1]:
# Import feature subset with Education Column and one hot encoded values

from sklearn import neighbors, datasets
import pandas as pd

originalDF = pd.read_csv('educationFeatureSubset.csv')
dfOHE = pd.read_csv('oheTransformedData.csv')
dfOHE.fillna(0, inplace=True)

X = dfOHE

#separate target values
y = originalDF['Education_Attainment'].values

# create the model
knn = neighbors.KNeighborsClassifier(n_neighbors=8)

# fit the model
knn.fit(X, y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=8, p=2,
           weights='uniform')

With knn, you can determine membership probabilities for each of the 3 labels. As you can see, the predict() function just picks the most likely label.

In [2]:
# What kind of occupation has years on internet (1-3), web ordering (yes),Not_Purchasing_Security, age(35) 
# call the "predict" method:
result = knn.predict([[1,0,0,0,0,1,0,0,0,0,0,0,1,35],])

print(result)

['College']


In [3]:
knn.predict_proba([[1,0,0,0,0,1,0,0,0,0,0,0,1,35],]) 

array([[0.5  , 0.   , 0.   , 0.   , 0.125, 0.   , 0.   , 0.25 , 0.125]])

In the next block of code, we take each pair of predictors from the four available in the Iris data set, and use the k-nearest-neighbour algorithm with k=3,5,7. 

In [4]:
import pylab as pl
import numpy as np
from matplotlib.colors import ListedColormap
import itertools
import re, string

import sys
sys.path.append('../resources')
from w6support import plot_2d_class

# Make sure the pic subdirectory exists
import os, errno
try:
    os.makedirs('pic')
except OSError as e:
    if e.errno != errno.EEXIST:
        raise

# Create color maps for 3-class classification problem, as with iris
cmap_light = ListedColormap(['#FFDDDD', '#DDFFDD', '#DDDDFF'])
cmap_bold = ListedColormap(['#FF2222', '#22FF22', '#8888FF'])

#predNames = list(iris.data) # https://stackoverflow.com/a/19483025, except iris.data is an array, not a dataframe
predNames = dfOHE.columns
df=pd.DataFrame(dfOHE, columns=predNames)
nTrain = df.shape[0]
y = originalDF['Education_Attainment'].values
pattern = re.compile('[\W_]+', re.UNICODE) # https://stackoverflow.com/a/1277047
for neighborCnt in range(3,8,2): # from 3 to a maximum of 8, in steps of 2, so 3,5,7
  knn = neighbors.KNeighborsClassifier(n_neighbors=neighborCnt)
  for twoCols in itertools.combinations(predNames, 2): # https://stackoverflow.com/a/374645
    X = df[list(twoCols)]  # we only take two features at a time
    colNames = X.columns
    c1 = colNames[:1][0] # first of 2
    c2 = colNames[-1:][0] # last of 2
    c1 = pattern.sub("",c1.title()) # Make titlecase, then remove non-alphanumeric characters
    c2 = pattern.sub("",c2.title())
    knn.fit(X, y)
    plotTitle = "k = %i %s fit to the %s dataset" % (neighborCnt, "nearest-neighbours", "Occupation")
    fileTitle = "../../pic/k_%i_%s_%s_%s_%s.pdf" % (neighborCnt, "nearest-neighbours", "Occupation", c1, c2)
    print("Plotting file %s" % (fileTitle))
    plot_2d_class(X, y, nTrain, knn, plotTitle, fileTitle, cmap_light, cmap_bold)


Plotting file ../../pic/k_3_nearest-neighbours_Occupation_YearsOnInternet0_YearsOnInternet1.pdf


ValueError: 'c' argument must either be valid as mpl color(s) or as numbers to be mapped to colors. Here c = ['Masters' 'Some_College' 'College' ... 'Special' 'College' 'Masters'].

Error in callback <function install_repl_displayhook.<locals>.post_execute at 0x1a1ffccd90> (for post_execute):


TypeError: iteration over a 0-d array

## Model Validation

The k-nearest-neighbours classification "model" should be validated. Clearly, the parameter $k$ is critical to its performance. Generally, smaller values of $k$ fit the training set more accurately (less bias) but generalise less well to test data (more variance). The opposite applies to larger values of $k$.

With $k$ set to its minimum value ($k = 1$), it fits the training set exactly and the confusion matrix is optimal:

In [5]:
from sklearn.neighbors import KNeighborsClassifier
X, y = dfOHE, originalDF['Education_Attainment'].values
knn1 = KNeighborsClassifier(n_neighbors=2)
knn1.fit(X, y)
y_pred1 = knn1.predict(X)
print(np.all(y == y_pred1))

False


The *confusion matrix* highlights where classification differences arise, as these occur on the off-diagognal elements of the matrix:

In [6]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
print(accuracy_score(y, y_pred1))
print(confusion_matrix(y, y_pred1))
print(classification_report(y, y_pred1, digits=3))

0.3442817570241393
[[1777  119   24  284  178   49   70  316    0]
 [ 135   43    3   16   32    1    5   20    0]
 [  19    4  124   44    3    2    0    5    0]
 [ 370   28  100  458   54   26   17  193    0]
 [ 760   89   20  131  197   35   34  138    0]
 [  52    9   24   38    3   19    1   17    0]
 [ 143   16    5   24   24   11   17   31    0]
 [1328   96   57  583  227   82   73  845    0]
 [ 197   13    5   99   35   14    9   88    0]]
              precision    recall  f1-score   support

     College      0.372     0.631     0.468      2817
    Doctoral      0.103     0.169     0.128       255
     Grammar      0.343     0.617     0.440       201
 High_School      0.273     0.368     0.313      1246
     Masters      0.262     0.140     0.183      1404
       Other      0.079     0.117     0.095       163
Professional      0.075     0.063     0.068       271
Some_College      0.511     0.257     0.342      3291
     Special      0.000     0.000     0.000       460

   mic

  'precision', 'predicted', average, warn_for)


All 50 training samples for each class are identified correctly, as expected when $k = 1$ (accuracy score is 1, off-diagonal terms are 0, the classification report (relative to the trsining set) is "too good to be true"...

Note:

1. The _Recall_ of the $i^{\mbox{th}}$ predictor is $R_i \equiv c_{ii} / \sum_j c_{ij}$, which is the ratio of the $i^{\mbox{th}}$ diagonal element to the sum of the elements of the confusion matrix $C = \{c_{ij}\}$ in that _column_.
2. The _Precision_ of the $j^{\mbox{th}}$ predictor is $P_j \equiv c_{jj} / \sum_i c_{ij}$, which is the ratio of the $j^{\mbox{th}}$ diagonal element to the sum of the elements of the confusion matrix $C = \{c_{ij}\}$ in that _row_.
3. $F_1$-score is defined as $F_1 = 2\frac{R_i P_i}{R_i + P_i}$.

To test how the model generalizes to the training set, we hold back some of the training data by splitting the training data into a _training set_ and a _testing set_. We hold back 20% and stratify based on the data labels $y$, so each of the row counts in the confusion matrix should be $0.2 * 50 = 10$.

In [7]:
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, stratify=y)
knn1.fit(Xtrain, ytrain)
ypred1s = knn1.predict(Xtest)
print(accuracy_score(ytest, ypred1s))
print(confusion_matrix(ytest, ypred1s))
print(classification_report(ytest, ypred1s, digits=3))

0.2888229475766568
[[324  19   0  56  74   2  20  68   1]
 [ 31   5   0   2   8   0   2   3   0]
 [  3   0  22   8   5   0   0   2   0]
 [ 78   2  24  70  19   2   2  52   0]
 [139  28   0  23  51   0  10  30   0]
 [ 10   2   6   5   4   0   0   6   0]
 [ 33   0   0   5  11   0   3   2   0]
 [284  15  12 143  57   5  31 109   2]
 [ 44   2   0  16  11   1   6  12   0]]
              precision    recall  f1-score   support

     College      0.342     0.574     0.429       564
    Doctoral      0.068     0.098     0.081        51
     Grammar      0.344     0.550     0.423        40
 High_School      0.213     0.281     0.243       249
     Masters      0.212     0.181     0.196       281
       Other      0.000     0.000     0.000        33
Professional      0.041     0.056     0.047        54
Some_College      0.384     0.166     0.231       658
     Special      0.000     0.000     0.000        92

   micro avg      0.289     0.289     0.289      2022
   macro avg      0.178     0.212

Note the confusion (off-diagonal nonzero elements) between Iris species 2 and species 3. For comparison, we look at the confusion matrix when $k = 3$. Firstly, we try with all the training data (not holding any observations back for a test set).

In [8]:
knn3 = KNeighborsClassifier(n_neighbors=3)
knn3.fit(X, y)
y_pred3 = knn3.predict(X)
print(accuracy_score(y, y_pred3))
print(confusion_matrix(y, y_pred3))
print(classification_report(y, y_pred3, digits=3))

0.3745548080728136
[[1755   65   32  196  235    3    5  521    5]
 [ 133   35    4   11   37    0    1   34    0]
 [  19    1   94   74    1    0    0   12    0]
 [ 367   15   72  376   43    1    0  371    1]
 [ 786   48   19   95  237    2    5  210    2]
 [  56    2   21   32   11    8    0   33    0]
 [ 145   10    7   20   25    3    7   54    0]
 [1253   63   63  437  182   14    4 1271    4]
 [ 210    9    8   68   31    4    2  125    3]]
              precision    recall  f1-score   support

     College      0.372     0.623     0.465      2817
    Doctoral      0.141     0.137     0.139       255
     Grammar      0.294     0.468     0.361       201
 High_School      0.287     0.302     0.294      1246
     Masters      0.296     0.169     0.215      1404
       Other      0.229     0.049     0.081       163
Professional      0.292     0.026     0.047       271
Some_College      0.483     0.386     0.429      3291
     Special      0.200     0.007     0.013       460

   mic

Note that 6 observations (3 each of species 2 and 3) are not classified the same as the human experts. However, this might also indicate something interesting about those observations. They could be outliers (not classified correctly) but, at the very least, they are extreme observations.

Now we try holding back 20% of the training set for use as test observations, leaving 80% of the training data to train the classifier. We then look at what happens to the confusion matrix. Note that sampling the data like this could result in *better* relative performance, depending on what happens to the 6 problematic observations.

In [9]:
knn3.fit(Xtrain, ytrain)
ypred3s = knn3.predict(Xtest)
print(accuracy_score(ytest, ypred3s))
print(confusion_matrix(ytest, ypred3s))
print(classification_report(ytest, ypred3s, digits=3))

0.3239366963402572
[[309  17   1  52  63   0   3 108  11]
 [ 30   6   0   1   8   0   0   5   1]
 [  3   0  17  13   1   0   0   5   1]
 [ 77   0  15  64  13   0   1  78   1]
 [134  18   0  27  44   0   5  49   4]
 [ 11   2   5   5   1   0   0   8   1]
 [ 25   0   0   8  13   0   1   7   0]
 [247   8  10 123  43   0   6 214   7]
 [ 44   2   0  20   7   0   1  18   0]]
              precision    recall  f1-score   support

     College      0.351     0.548     0.428       564
    Doctoral      0.113     0.118     0.115        51
     Grammar      0.354     0.425     0.386        40
 High_School      0.204     0.257     0.228       249
     Masters      0.228     0.157     0.186       281
       Other      0.000     0.000     0.000        33
Professional      0.059     0.019     0.028        54
Some_College      0.435     0.325     0.372       658
     Special      0.000     0.000     0.000        92

   micro avg      0.324     0.324     0.324      2022
   macro avg      0.194     0.205

  'precision', 'predicted', average, warn_for)
