# K-Nearest Neighbors (KNN)

The process so far has been to use tree-based algorithms to generate classification models.  Now let's try something completely different:  classification based on how "near" records are to one another.

One of the key decisions to make with KNN is how many neighbors to compare.  Typically, we want the number to be large enough that we get a good idea of what it means to be near another data point, but small enough that we don't include too many records.  At the extreme, if we have k = 100 and only have 101 datapoints, everybody is neighbors with everybody else, and so the process falls apart.

For the sake of simplicity, we'll start with 3.  You can also try other values of k and see if you get different results.

In [None]:
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

clf = KNeighborsClassifier(n_neighbors=3)

Now let's prep the data.  Because we'll do it the same way for each, we only need to do this once.  I'll also remove the bits where we analyze the data, as we've seen it enough times already.

In [None]:
campus_data = "../data/CampusRecruitment.csv"
df = pd.read_csv(campus_data, header=0)
y = df['status']
X = df.drop(['status', 'salary'], axis=1)

## Pre-Processing

For this dataset, we want to use the columns leading up to `status` to determine if different college graduates were placed at a job.  Because the salary is determined by the placement status, we can't use it to predict if a new graduate will be placed, so we'll have to drop that column.  Note that if we were interested in doing a regression analysis, we could try to predict the salary given placement, but we're keeping it classy and sticking to classification algorithms only .

Unlike the heart attack dataset, this dataset includes non-numeric features.

In [None]:
df

Before we can feed this data into the knn algorithm (or pretty much any other classification algorithm), we need to convert any text data into numeric data.  There are a few common techniques for encoding.  The technique we will use for our dataset is called one-hot encoding.  What it does is "pivot" the categorical data, so that each distinct categorical value gets its own feature.  For example, `gender` has two values, M and F.  One-hot encoding will create new new features, one for `gender=M` and one for `gender=F`.  We need to do this for each of the non-numeric features.

In [None]:
enc = preprocessing.OneHotEncoder()

# Fit the input features to our encoder
enc.fit(X)

# Perform the transformation on our dataset
X = enc.transform(X).toarray()
X.shape

The `shape` here shows that we have the same number of rows as before (215), but the number of columns went from 15 to 873.  This huge increase came about because of all of the unique string values in the dataset.

In [None]:
X

By contrast, I'm going to perform a simple label encoding of the `status`.

In [None]:
le = preprocessing.LabelEncoder()
y = le.fit_transform(y)
y

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1740)

In [None]:
X_train

Now let's train the kNN model.

In [None]:
clf = clf.fit(X_train, y_train)

## How'd We Do?

Let's first use the `accuracy_score` method in sklearn to see just how well we did.

In [None]:
predicted = clf.predict(X_test)
accuracy_score(y_test, predicted)

Now let's review the confusion matrix and classification report.

In [None]:
confusion_matrix(y_test, predicted)

In [None]:
print(classification_report(y_test, predicted))

## Comparing to XGBoost

Out of curiosity, how does this fare versus a tree-based algorithm like XGBoost?

In [None]:
import xgboost as xgb

clf = xgb.XGBClassifier(max_depth=8, n_estimators=55, use_label_encoder=False, eval_metric='logloss')
clf = clf.fit(X_train, y_train)

In [None]:
predicted = clf.predict(X_test)
accuracy_score(y_test, predicted)

It looks like the kNN technique actually does better than XGBoost here.  We could try other hyperparmeters (using other values of max depth or number of estimators, etc.) to narrow this down further if we wanted, but that's a pretty big difference in accuracy.  Let's see if we can understand why the XGBoost model didn't work out as well.

In [None]:
confusion_matrix(y_test, predicted)

In [None]:
print(classification_report(y_test, predicted))

The model was a bit worse at correctly guessing non-placements as well as correctly guessing placements.  In other words, this is worse across the board.

The moral of the story is that we shouldn't forget about non-tree methods of classification!