# **KNN for regression problem**
*continuous and categorical variables*

Dataset: https://www.kaggle.com/c/house-prices-advanced-regression-techniques

Subsest of columns used for example: LotArea OverallQual YearBuilt RoofStyle CentralAir SalePrice.

Data are interpreted as continuous variables.

In [3]:
# importing necessary libraries
import numpy as np
import csv
import matplotlib.pyplot as plt
import pandas as pd

In [5]:
# reading given data and visualizing it
df = pd.read_csv("mixedDataExample.tsv",sep='\t')
df.head()

Unnamed: 0,LotArea,OverallQual,YearBuilt,RoofStyle,CentralAir,SalePrice
0,8450,7,2003,Gable,Y,208500
1,9600,6,1976,Gable,Y,181500
2,11250,7,2001,Gable,Y,223500
3,9550,7,1915,Gable,Y,140000
4,14260,8,2000,Gable,Y,250000


KNN is a distance based method. Usually, the Euclidean distance is used to compare records. Significantly different values of different parameters can have unbalanced impact in distance value. Thus, standardization (mean normalization) is required.

In [94]:
# standardize data based on the initial (train) dataset (to be later used for new record to calculate)
def standardize_data(processed_dataset, initial_dataset, column_name):
  processed_dataset[column_name] =(processed_dataset[column_name]-initial_dataset[column_name].mean())/initial_dataset[column_name].std()

In [95]:
# standardize data 
prep_data = df.copy()
standardize_data(prep_data, df, "LotArea")
standardize_data(prep_data, df, "OverallQual")
standardize_data(prep_data, df, "YearBuilt")
prep_data.head(10)

Unnamed: 0,LotArea,OverallQual,YearBuilt,RoofStyle,CentralAir,SalePrice
0,-0.207071,0.651256,1.050634,Gable,Y,208500
1,-0.091855,-0.071812,0.15668,Gable,Y,181500
2,0.073455,0.651256,0.984415,Gable,Y,223500
3,-0.096864,0.651256,-1.862993,Gable,Y,140000
4,0.37502,1.374324,0.951306,Gable,Y,250000
5,0.360493,-0.794879,0.71954,Gable,Y,143000
6,-0.043364,1.374324,1.083743,Gable,Y,307000
7,-0.013508,0.651256,0.057352,Gable,Y,200000
8,-0.440508,0.651256,-1.333243,Gable,Y,129900
9,-0.310264,-0.794879,-1.068368,Gable,Y,118000


In distance-based methods, categorical variables (with no ordinal relationship between the categories) can be changed to one hot encoding. Boolean variables are also encoded as 1 (true) and 0 (false). 

In [96]:
prepared_data = prep_data.copy()
# change boolean to numerical
prepared_data['CentralAir'].iloc[:][prepared_data['CentralAir'] == "Y"] = 1  
prepared_data['CentralAir'].iloc[:][prepared_data['CentralAir'] == "N"] = 0  
# apply one hot encoding to roof style
roof_one_hot = pd.get_dummies(prepared_data["RoofStyle"], prefix = 'roof')
# concat prepared data and roof_one_hot and drop variable "RoofStyle"
salePrice_column = prepared_data["SalePrice"]
prepared_data = prepared_data.drop(columns = ["SalePrice"])
prepared_data = pd.concat([prepared_data, roof_one_hot, salePrice_column], axis = 1)
prepared_data = prepared_data.drop(columns = ["RoofStyle"])
prepared_data.head()

Unnamed: 0,LotArea,OverallQual,YearBuilt,CentralAir,roof_Flat,roof_Gable,roof_Gambrel,roof_Hip,roof_Mansard,roof_Shed,SalePrice
0,-0.207071,0.651256,1.050634,1,0,1,0,0,0,0,208500
1,-0.091855,-0.071812,0.15668,1,0,1,0,0,0,0,181500
2,0.073455,0.651256,0.984415,1,0,1,0,0,0,0,223500
3,-0.096864,0.651256,-1.862993,1,0,1,0,0,0,0,140000
4,0.37502,1.374324,0.951306,1,0,1,0,0,0,0,250000


In [97]:
# initialize K value (number of nearest neighbours)
K = 8

In [98]:
# query example (partly already transformed to correspond categorical variables)
query =  pd.DataFrame(np.array([[18500, 5, 1960, 1, 0,1,0,0,0,0]]),
                   columns=["LotArea","OverallQual", "YearBuilt", "CentralAir", "roof_Flat",	"roof_Gable",	"roof_Gambrel",	"roof_Hip",	"roof_Mansard",	"roof_Shed"])
# standardize query 
prepared_query = query.copy()
standardize_data(prepared_query, df, "LotArea")
standardize_data(prepared_query, df, "OverallQual")
standardize_data(prepared_query, df, "YearBuilt")
prepared_query.head()

Unnamed: 0,LotArea,OverallQual,YearBuilt,CentralAir,roof_Flat,roof_Gable,roof_Gambrel,roof_Hip,roof_Mansard,roof_Shed
0,0.799816,-0.794879,-0.37307,1,0,1,0,0,0,0


In [99]:
# calculate distance for each record in data and query 
distances_indices = []

for i in range(np.shape(prepared_data)[0]):
    # use  Euclidean distance
    distance = np.linalg.norm(prepared_data.iloc[[i],:-1].to_numpy()-prepared_query.iloc[[0]].to_numpy())
    distances_indices.append((distance, i))

sorted_distances_indices = sorted(distances_indices)
k_nearest_distances_indices = sorted_distances_indices[:K]
# print k nearest distances and indices
print(k_nearest_distances_indices)
# print k most similar records
indices = list(zip(*k_nearest_distances_indices))[1]

df_nearest = df.loc[indices,:]
df_nearest

[(0.08165376753649693, 1151), (0.16319336941084817, 41), (0.23984351495167971, 28), (0.2625506123314917, 887), (0.37893494555279605, 556), (0.39777345772510625, 1077), (0.45309915443624454, 668), (0.46591719361260137, 1173)]


Unnamed: 0,LotArea,OverallQual,YearBuilt,RoofStyle,CentralAir,SalePrice
1151,17755,5,1959,Gable,Y,149900
41,16905,5,1959,Gable,Y,170000
28,16321,5,1957,Gable,Y,207500
887,16466,5,1955,Gable,Y,135500
556,14850,5,1957,Gable,Y,141000
1077,15870,5,1969,Gable,Y,138800
668,14175,5,1956,Gable,Y,168000
1173,18030,5,1946,Gable,Y,200500


After distances are calculated and K nearest neighbours filtered, the predicted price is equal to mean value of the K nearest neighbours.

In [100]:
print("Predicted price for ")
print(query)
print("predicted price: {0:7.2f}".format(df_nearest["SalePrice"].mean()))

Predicted price for 
   LotArea  OverallQual  YearBuilt  ...  roof_Hip  roof_Mansard  roof_Shed
0    18500            5       1960  ...         0             0          0

[1 rows x 10 columns]
predicted price: 163900.00


**To do**: 
Evaluate the accuracy of the method by splitting data to train and test datasets. Use train dataset to calculate transformation and make SalePrice predictions for flats in the test dataset. Calculate mean square error between the predicted and actual values of the dataset. 