# **KNN for regression problem**
*continuous variables*

Dataset: https://www.kaggle.com/c/house-prices-advanced-regression-techniques

Subsest of columns used for example: LotArea OverallQual YearBuilt SalePrice.

Data are interpreted as continuous variables.

In [1]:
# importing necessary libraries
import numpy as np
import csv
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
# reading given data and visualizing it
df = pd.read_csv("../data/mixedDataExample.tsv",sep='\t')
df = df[['LotArea', 'OverallQual', 'YearBuilt', 'SalePrice']]
df.head()

Unnamed: 0,LotArea,OverallQual,YearBuilt,SalePrice
0,8450,7,2003,208500
1,9600,6,1976,181500
2,11250,7,2001,223500
3,9550,7,1915,140000
4,14260,8,2000,250000


KNN is a distance based method. Usually, the Euclidean distance is used to compare records. Significantly different values of different parameters can have unbalanced impact in distance value. Thus, standardization (mean normalization) is required.

In [3]:
# standardize data based on the initial (train) dataset (to be later used for new record to calculate)
def standardize_data(processed_dataset, initial_dataset, column_name):
  processed_dataset[column_name] =(processed_dataset[column_name]-initial_dataset[column_name].mean())/initial_dataset[column_name].std()

In [4]:
# standardize data 
prepared_data = df.copy()
standardize_data(prepared_data, df, "LotArea")
standardize_data(prepared_data, df, "OverallQual")
standardize_data(prepared_data, df, "YearBuilt")
prepared_data.head()

Unnamed: 0,LotArea,OverallQual,YearBuilt,SalePrice
0,-0.207071,0.651256,1.050634,208500
1,-0.091855,-0.071812,0.15668,181500
2,0.073455,0.651256,0.984415,223500
3,-0.096864,0.651256,-1.862993,140000
4,0.37502,1.374324,0.951306,250000


In [5]:
# initialize K value (number of nearest neighbours)
K = 8

In [7]:
# query example
query =  pd.DataFrame(np.array([[8500, 5, 2000]]),
                   columns=["LotArea","OverallQual", "YearBuilt"])
# standardize query 
prepared_query = query.copy()
standardize_data(prepared_query, df, "LotArea")
standardize_data(prepared_query, df, "OverallQual")
standardize_data(prepared_query, df, "YearBuilt")

In [8]:
# calculate distance for each record in data and query 
distances_indices = []

for i in range(np.shape(prepared_data)[0]):
    # use  Euclidean distance
    distance = np.linalg.norm(prepared_data.iloc[[i],:-1].to_numpy()-prepared_query.iloc[[0]].to_numpy())
    distances_indices.append((distance, i))

sorted_distances_indices = sorted(distances_indices)
k_nearest_distances_indices = sorted_distances_indices[:K]
# print k nearest distances and indices
print(k_nearest_distances_indices)
# print k most similar records
indices = list(zip(*k_nearest_distances_indices))[1]

df_nearest = df.loc[indices,:]
df_nearest

[(0.13689914249206395, 376), (0.19481153834817005, 838), (0.1986891328279999, 117), (0.20055785815601382, 1079), (0.20323160697296797, 1307), (0.21221561737670938, 1047), (0.2219761902134229, 880), (0.23197367002520608, 613)]


Unnamed: 0,LotArea,OverallQual,YearBuilt,SalePrice
376,8846,5,1996,148000
838,9525,5,1995,144000
117,8536,5,2006,155000
1079,8775,5,1994,126000
1307,8072,5,1994,138000
1047,9245,5,1994,145000
880,7024,5,2005,157000
613,8402,5,2007,147000


After distances are calculated and K nearest neighbours filtered, the predicted price is equal to mean value of the K nearest neighbours.

In [9]:
print("Predicted price for ")
print(query)
print("predicted price: {0:7.2f}".format(df_nearest["SalePrice"].mean()))

Predicted price for 
   LotArea  OverallQual  YearBuilt
0     8500            5       2000
predicted price: 145000.00


**To do**: 
Evaluate the accuracy of the method by splitting data to train and test datasets. Use train dataset to calculate transformation and make SalePrice predictions for flats in the test dataset. Calculate mean square error between the predicted and actual values of the dataset. 