# Chapter 7: k-Nearest Neighbors (k-NN)


> (c) 2019-2020 Galit Shmueli, Peter C. Bruce, Peter Gedeck 
>
> _Data Mining for Business Analytics: Concepts, Techniques, and Applications in Python_ (First Edition) 
> Galit Shmueli, Peter C. Bruce, Peter Gedeck, and Nitin R. Patel. 2019.
>
> Date: 2020-03-08
>
> Python Version: 3.8.2
> Jupyter Notebook Version: 5.6.1
>
> Packages:
>   - pandas: 1.0.1
>   - scikit-learn: 0.22.2
>
> The assistance from Mr. Kuber Deokar and Ms. Anuja Kulkarni in preparing these solutions is gratefully acknowledged.


In [1]:
# Import required packages for this chapter
from pathlib import Path
import math

import pandas as pd

from sklearn.metrics import pairwise
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, mean_squared_error
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor #, NearestNeighbors

%matplotlib inline

In [3]:
# Working directory:
#
# We assume that data are kept in the same directory as the notebook. If you keep your 
# data in a different folder, replace the argument of the `Path`
DATA = Path('.')
# and then load data using 
#
# pd.read_csv(DATA / ‘filename.csv’)

# Problem 7.3 Predicting Housing Median Prices

The file _BostonHousing.csv_ contains information on over 500 census tracts in Boston, where for each tract multiple variables
are recorded. The last column (CAT. MEDV) was derived from MEDV, such that it obtains the value 1 if MEDV > 30 and 0 otherwise. Consider the goal of predicting the median value (MEDV) of a tract, given the information in the first 12 columns.
Partition the data into training (60%) and validation (40%) sets.

__7.3.a.__ Perform a k-NN prediction with all 12 predictors (ignore the CAT. MEDV column), trying values of k from 1 to 5. Make sure to normalize the data. What is the best k? What does it mean?

__Answer__

#### Data preparation
Load the data and remove unnecessary columns (CAT. MEDV). Split the data into training (60%) and validation (40%) sets (use `random_state=1`).

In [4]:
# Load the data
house_df = pd.read_csv(DATA / 'BostonHousing.csv')

# Drop CAT.MEDV column
house_df = house_df.drop(columns=['CAT. MEDV'])

# Make sure that the result is as expected
house_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,5.33,36.2


In [5]:
# split dataset into training (60%) and validation (40%) sets
train_df, valid_df = train_test_split(house_df, test_size=0.4, random_state=1)
print('Training set:', train_df.shape, 'Validation set:', valid_df.shape)

Training set: (303, 13) Validation set: (203, 13)


In [6]:
# normalize training and validation sets. The transformation is trained using the training set only.
# if you don't convert the integer columns to real numbers (float64), the StandardScaler will raise a DataConversionWarning. 
# This is expected
outcome = 'MEDV'
predictors = list(house_df.columns)
predictors.remove(outcome)

scaler = preprocessing.StandardScaler()
scaler.fit(train_df[predictors])

# Transform the predictors of training validation and newCustomer
train_X = scaler.transform(train_df[predictors])
train_y = train_df[outcome]
valid_X = scaler.transform(valid_df[predictors])
valid_y = valid_df[outcome]

In [9]:
# Train a regressor for different values of k
results = []
for k in range(1, 6):
    knn = KNeighborsRegressor(n_neighbors=k).fit(train_X, train_y)
    results.append({
        'k': k,
        'RMSE': math.sqrt(mean_squared_error(valid_y, knn.predict(valid_X)))
    })

# Convert results to a pandas data frame
results = pd.DataFrame(results)
results

Unnamed: 0,k,RMSE
0,1,5.403228
1,2,4.778562
2,3,4.671801
3,4,4.789219
4,5,5.014823


Here best `k = 3`. This means that, for a given record, MEDV is predicted by averaging the MEDVs for the 3 closest records, proximity being measured by the distance between the vectors of predictor values.

__7.3.b.__ Predict the MEDV for a tract with the following information, using the best k:

CRIM = 0.2, ZN = 0, INDUS = 7, CHAS = 0, NOX = 0.538, RM = 6, AGE = 62, DIS = 4.7, RAD = 4, TAX = 307, PTRATIO = 21, LSTAT = 10.

__Answer__

In [30]:
# new tract
newTract = pd.DataFrame([{'CRIM': 0.2, 'ZN': 0, 'INDUS': 7, 'CHAS': 0, 'NOX': 0.538, 'RM': 6, 'AGE': 62, 'DIS': 4.7, 'RAD': 4, 
                          'TAX': 307, 'PTRATIO': 21, 'LSTAT': 10}],
                       columns=['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'LSTAT'])
newTract

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT
0,0.2,0,7,0,0.538,6,62,4.7,4,307,21,10


In [31]:
# normalize new record
newTractNorm = pd.DataFrame(scaler.transform(newTract), 
                               columns=['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'LSTAT'])

newTractNorm

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT
0,-0.403622,-0.481603,-0.620687,-0.293294,-0.153758,-0.358814,-0.243285,0.400608,-0.640284,-0.604731,1.197866,-0.421956


In [32]:
# train knn model with k=3
knn = KNeighborsRegressor(n_neighbors=3)
knn.fit(train_X, train_y)


KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                    weights='uniform')

In [33]:
# predict value of new tract
knn.predict(newTractNorm)

array([18.76666667])

The predicted price of new tract is $18.77k.

__7.3.c.__ If we used the above k-NN algorithm to score the training data, what would be the error of the training set?

__Answer__

In the training set, the error is zero because the training cases are matched to themselves. 

__7.3.d.__ Why is the validation data error overly optimistic compared to the error rate when applying this k-NN predictor to new data?

__Answer__

The validation error measures the error for the "best k" among multiple k's tried out for the validation data, so that particular k is optimized for the particular validation data set that was used in selecting it. It may not be as suitable for the new data.

__7.3.e.__ If the purpose is to predict MEDV for several thousands of new tracts, what would be the disadvantage of using k-NN prediction? List the operations that the algorithm goes through in order to produce each prediction.

__Answer__

KNN does not yield a uniform rule that can be applied to each new record to be predicted -- the whole "model building" process has to be repeated for each new record to be classified.

Specifically, the algorithm must calculate the distance from a new record to each of the training records, select the n-closest training records, determine the average target value for the n-closest training records, then score that target value to the new record, then repeat this process for each of the new records.