# K-NN Classifier - Predicting Housing Median Prices

This program is a solution to the problem 7.3 of chapter 7 of following book. 

Data Mining for Business Analytics: Concepts, Techniques, and Applications in Python, First Edition.

Galit Shmueli, Peter C. Bruce, Peter Gedeck, and Nitin R. Patel

© 2020 John Wiley & Sons, Inc. Published 2020 by John Wiley & Sons, Inc.

## Importing Libraries

In [112]:
import numpy as np
import pandas as pd

import sklearn as skl
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

Printing versions of libraries

In [113]:
print('numpy version: {}'.format(np.__version__))
print('pandas version: {}'.format(pd.__version__))
print('sklearn version: {}'.format(skl.__version__))

numpy version: 1.23.5
pandas version: 1.5.3
sklearn version: 1.2.1


## Loading Dataset

In [114]:
df = pd.read_csv('BostonHousing.csv')
print('Shape: ', df.shape)
df.head()

Shape:  (506, 14)


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT,MEDV,CAT. MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,4.98,24.0,0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,9.14,21.6,0
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,4.03,34.7,1
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,2.94,33.4,1
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,5.33,36.2,1


Dropping 'CAT.MEDV' column because it is not required for our classification tasks.

In [115]:
df = df.drop(['CAT. MEDV'], axis=1)
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,5.33,36.2


Partitioning the data into training (60%) and validation (40%) sets.

In [116]:
trainData, validData = train_test_split(df, test_size=0.4, random_state=26)

## Normalizing Dataset

In [117]:
# initialize normalized training, validation, and complete data frames
# use the training data to learn the transformation.
scaler = preprocessing.StandardScaler()
scaler.fit(trainData.iloc[:, :-1]) 

# Transform the full dataaset 
dfNorm = pd.concat([pd.DataFrame(
    scaler.transform(df.iloc[:, :-1]), 
    columns= df.columns.tolist()[:-1]),
                       df[['MEDV']]], axis=1)

display(dfNorm.head())

trainNorm = dfNorm.iloc[trainData.index]
validNorm = dfNorm.iloc[validData.index]

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT,MEDV
0,-0.449562,0.409767,-1.324554,-0.251312,-0.157308,0.383503,-0.187168,0.201756,-0.979759,-0.667473,-1.533571,-1.063564,24.0
1,-0.446734,-0.428676,-0.617972,-0.251312,-0.793718,0.168181,0.319255,0.63647,-0.863705,-0.993859,-0.334603,-0.493105,21.6
2,-0.446737,-0.428676,-0.617972,-0.251312,-0.793718,1.236404,-0.338726,0.63647,-0.863705,-0.993859,-0.334603,-1.193837,34.7
3,-0.446052,-0.428676,-1.343851,-0.251312,-0.895175,0.974941,-0.904293,1.17923,-0.747651,-1.114743,0.097025,-1.343308,33.4
4,-0.441111,-0.428676,-1.343851,-0.251312,-0.895175,1.183273,-0.593785,1.17923,-0.747651,-1.114743,0.097025,-1.015568,36.2


## Finding best value of k

In [118]:
train_X = trainNorm.drop('MEDV', axis=1)
train_y = trainNorm['MEDV']
valid_X = validNorm.drop('MEDV', axis=1)
valid_y = validNorm['MEDV']

# Train a classifier for different values of k
results = []
for k in range(1, 6):
    knn = KNeighborsRegressor(n_neighbors=k).fit(train_X, train_y)
    results.append({
        'k': k,
        'RMSE': mean_squared_error(valid_y, knn.predict(valid_X), squared=False)
    })
    
# Convert results to a pandas data frame
results = pd.DataFrame(results)
print(results)

   k      RMSE
0  1  5.273145
1  2  4.668889
2  3  4.748908
3  4  5.017560
4  5  5.110195


## Result

From above table we see that root-mean-squared error (RMSE) is lowest when k = 2. Therefore, the best value of k is 2.

## Predicting MEDV for a new tract

Creating dataset for new tract

In [119]:
newTract = pd.DataFrame([{'CRIM': 0.2, 
                          'ZN': 0, 
                          'INDUS': 7, 
                          'CHAS': 0, 
                          'NOX': 0.538, 
                          'RM': 6, 
                          'AGE': 62, 
                          'DIS': 4.7, 
                          'RAD': 4, 
                          'TAX': 307, 
                          'PTRATIO': 21, 
                          'LSTAT': 10}])
display(newTract)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT
0,0.2,0,7,0,0.538,6,62,4.7,4,307,21,10


Normalizing prospect

In [120]:
newTractNorm = pd.DataFrame(scaler.transform(newTract), columns=newTract.columns)
display(newTractNorm)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT
0,-0.423468,-0.428676,-0.628363,-0.251312,-0.157308,-0.420461,-0.305457,0.504088,-0.631596,-0.600987,1.200076,-0.375174


Predicting the MEDV for the new tract using the best k = 2

In [121]:
knn = KNeighborsRegressor(n_neighbors=2).fit(train_X, train_y)
print('MEDV: ',knn.predict(newTractNorm)[0])

MEDV:  20.15


Scoring training data to find out the error of the training set

In [122]:
print('RMSE', mean_squared_error(train_y, knn.predict(train_X), squared=False))

RMSE 2.401001922328073


Computing validation data error

In [123]:
print('RMSE', mean_squared_error(valid_y, knn.predict(valid_X), squared=False))

RMSE 4.668888750815089


## Result

Why is the validation data error overly optimistic compared to the error rate when applying this k-NN predictor to new data?

The validation data is not independent from the training data. Therefore, the validation data error is biased by the similarity between the training and validation sets. The more similar they are, the lower the validation error will be. Therefore, the validation data error is overly optimistic compared to the error rate when applying this k-NN predictor to new data.

The disadvantage of using k-NN prediction with a purpose to predict MEDV for several thousands of new tracts is that the prediction can't be applied to make real-time predictions. This is because k-NN is a 'lazy-learner', which means time-consuming computation is deferred to the time of prediction. For every record to be predicted, its distance from the entire set of training records is only computed at the time of prediction, which may not be feasible in real-time because of limited computational resources and the need for real-time prediciton. 