# 7.3 Predicting Housing Median Prices. 

The file BostonHousing.csv contains information on over 500 census tracts in Boston, where for each tract multiple variables are recorded. The last column (CAT.MEDV) was derived from MEDV, such that it obtains the value 1 if MEDV > 30 and 0 otherwise. Consider the goal of predicting the median value (MEDV) of a tract, given the information in the first 12 columns.

Partition the data into training (60%) and validation (40%) sets.

- a. Perform a k NN prediction with all 12 predictors (ignore the CAT.MEDV column), trying values of k from 1 to 5. Make sure to normalize the data. What is the best k ? What does it mean?
- b. Predict the MEDV for a tract with the following information, using the best k:

| CRIM | ZN   | INDUS | CHAS | NOX  | RM   | AGE  | DIS  | RAD  | TAX  | PTRATIO | LSTAT |
| ---  | ---  | ---   | ---  | ---  | ---  | ---  | ---  | ---  | ---  | ---     | ---   |
| 0.2  | 0    | 7     | 0    | 0.538| 6    | 62   | 4.7  | 4    | 307  | 21      | 10    | 

- c. If we used the above k NN algorithm to score the training data, what would be the error of the training set?
- d. Why is the validation data error overly optimistic compared to the error rate when applying this k NN predictor to new data?
- e. If the purpose is to predict MEDV for several thousands of new tracts, what would be the disadvantage of using k NN prediction? List the operations that the algorithm goes through in order to produce each prediction. 

Notes
1. If you are interested in reproducibility of results, check the accuracy only for each odd k.
2. Partitioning such a small dataset is unwise in practice, as results will heavily rely on the particular partition. For instance, if you use a different partitioning, you might obtain a different “optimal” k . We use this example for illustration only. 
 

In [2]:
# Start writing code here...
import numpy as np
import networkx as nx
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns
#
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, Ridge, LassoCV, BayesianRidge
from dmba import regressionSummary, exhaustive_search, gainsChart, liftChart
from dmba import backward_elimination, forward_selection, stepwise_selection
from dmba import adjusted_r2_score, AIC_score, BIC_score
##new libraries that you need for k-nn regression
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor #, NearestNeighbors
from sklearn.metrics import accuracy_score, confusion_matrix, mean_squared_error
import math

no display found. Using non-interactive Agg backend


In [3]:
bostonhousing_df = pd.read_csv('BostonHousing.csv')
bostonhousing_df

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT,MEDV,CAT. MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296,15.3,4.98,24.0,0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,9.14,21.6,0
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,4.03,34.7,1
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,2.94,33.4,1
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,5.33,36.2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0,9.67,22.4,0
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273,21.0,9.08,20.6,0
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0,5.64,23.9,0
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273,21.0,6.48,22.0,0


In [4]:
# create a list of predictor variables by remvoing output variables and text columns
excludeColumns = ('CAT. MEDV', 'MEDV')
predictors = [s for s in bostonhousing_df.columns if s not in excludeColumns]
outcome = 'MEDV'

# partition data
X = bostonhousing_df[predictors]
y = bostonhousing_df[outcome]

trainData, validData = train_test_split(bostonhousing_df, test_size=0.4, random_state=26)

## newTract
# | CRIM | ZN   | INDUS | CHAS | NOX  | RM   | AGE  | DIS  | RAD  | TAX  | PTRATIO | LSTAT |
# | ---  | ---  | ---   | ---  | ---  | ---  | ---  | ---  | ---  | ---  | ---     | ---   |
# | 0.2  | 0    | 7     | 0    | 0.538| 6    | 62   | 4.7  | 4    | 307  | 21      | 10    | 
newTract = pd.DataFrame([{'CRIM': 0.2,'ZN': 0, 'INDUS': 7, 'CHAS': 0, 'NOX': 0.538, 'RM': 6, 'AGE': 62, 'DIS': 4.7, 'RAD': 4, 'TAX': 307, 'PTRATIO': 21, 'LSAT': 10}])

print('Training. : ', trainData.shape)
print('Validation: ', validData.shape)


Training. :  (303, 14)
Validation:  (203, 14)


In [5]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Normalize the data
scaler = StandardScaler()
normbh_df = pd.DataFrame(scaler.fit_transform(bostonhousing_df),index=bostonhousing_df.index, columns=bostonhousing_df.columns)
normbh_df


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT,MEDV,CAT. MEDV
0,-0.419782,0.284830,-1.287909,-0.272599,-0.144217,0.413672,-0.120013,0.140214,-0.982843,-0.666608,-1.459000,-1.075562,0.159686,-0.446153
1,-0.417339,-0.487722,-0.593381,-0.272599,-0.740262,0.194274,0.367166,0.557160,-0.867883,-0.987329,-0.303094,-0.492439,-0.101524,-0.446153
2,-0.417342,-0.487722,-0.593381,-0.272599,-0.740262,1.282714,-0.265812,0.557160,-0.867883,-0.987329,-0.303094,-1.208727,1.324247,2.241386
3,-0.416750,-0.487722,-1.306878,-0.272599,-0.835284,1.016303,-0.809889,1.077737,-0.752922,-1.106115,0.113032,-1.361517,1.182758,2.241386
4,-0.412482,-0.487722,-1.306878,-0.272599,-0.835284,1.228577,-0.511180,1.077737,-0.752922,-1.106115,0.113032,-1.026501,1.487503,2.241386
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,-0.413229,-0.487722,0.115738,-0.272599,0.158124,0.439316,0.018673,-0.625796,-0.982843,-0.803212,1.176466,-0.418147,-0.014454,-0.446153
502,-0.415249,-0.487722,0.115738,-0.272599,0.158124,-0.234548,0.288933,-0.716639,-0.982843,-0.803212,1.176466,-0.500850,-0.210362,-0.446153
503,-0.413447,-0.487722,0.115738,-0.272599,0.158124,0.984960,0.797449,-0.773684,-0.982843,-0.803212,1.176466,-0.983048,0.148802,-0.446153
504,-0.407764,-0.487722,0.115738,-0.272599,0.158124,0.725672,0.736996,-0.668437,-0.982843,-0.803212,1.176466,-0.865302,-0.057989,-0.446153


In [6]:
# use NearestNeighbors from scikit-learn to compute knn
from sklearn.neighbors import NearestNeighbors, KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Transform the full dataset
trainNorm = normbh_df.iloc[trainData.index]
validNorm = normbh_df.iloc[validData.index]

# create a list of predictor variables by remvoing output variables and text columns
excludeColumns = ('CAT. MEDV', 'MEDV')
predictors = [s for s in normbh_df.columns if s not in excludeColumns]
outcome = 'MEDV'

train_X = trainNorm[predictors]
train_y = trainNorm[outcome]
valid_X = validNorm[predictors]
valid_y = validNorm[outcome]

# Train a classifier for different values of k
results = []

# for k in range(1, 5):
for k in range(1, 6):
    # knn = KNeighborsClassifier(n_neighbors=k).fit(train_X, train_y)
    knn = KNeighborsRegressor(n_neighbors=k).fit(train_X, train_y)
    results.append({
        'k': k,
        # 'accuracy': accuracy_score(valid_y, knn.predict(valid_X))
        'RMSE': math.sqrt(mean_squared_error(valid_y, knn.predict(valid_X)))
        })

# Convert results to a pandas data frame
results = pd.DataFrame(results)
print(results)

   k      RMSE
0  1  0.571678
1  2  0.498648
2  3  0.513700
3  4  0.540960
4  5  0.555734


### Predicting

The model with k=1 or k=5 seem to be performing the best. Leaving additional neightbors makes analysis more challenging so to simplify, in this case, k=1 should yield a good prediction.

We run our model with k=1 on the data in the newTract to see where it compares.

In [7]:
knn = NearestNeighbors(n_neighbors=1)
knn.fit(trainNorm.iloc[:, 0:12])
distances, indices = knn.kneighbors(newTract)

# indices is a list of lists, we are only interested in the first element
trainNorm.iloc[indices[0], :]

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT,MEDV,CAT. MEDV
491,-0.408212,-0.487722,2.422565,-0.272599,0.469104,-0.429726,1.074822,-0.916009,-0.637962,1.798194,0.76034,0.759313,-0.972224,-0.446153


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=52e9ae2e-8d42-48c9-9988-588f5a262306' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>