# K Nearest Neighbor Tutorial
- Author: Congxin (David) Xu
- Date: 2020/12/21

## Description

This tutorial is going to discuss how to implement K Nearest Neighbor model in `Python`. 

## Package Dependency

- [`pandas`](https://pandas.pydata.org/)
  - We will mainly use `pandas` for data manipulation and visualization. 
- [`sklearn`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
  - Title: scikit-learn: machine learning in Python
  - This is package that contains the `sklearn.neighbors.KNeighborsClassifier` function that will perform the K-Nearest-Neighbor regression

## Use Case

- Solving regression type of problem
- Fill in missing period handling

## Caution

- Do not use to predict something the training data has not seen before.
- Need to find a way to convert categorical predictors to numeric predictors.

## Tutorial
Load the required modules

In [1]:
import pandas
import sklearn.neighbors

The data we will use is the housing price data from [Kaggle](https://www.kaggle.com/c/house-prices-advanced-regression-techniques).

- Response Variable: **`price`**

**Read and Preview the Training Data**

In [2]:
train = pandas.read_csv('Data\\realestate-train.csv')
train.head()

Unnamed: 0,price,PoolArea,GarageCars,Fireplaces,TotRmsAbvGrd,Baths,SqFeet,CentralAir,Age,LotSize,BldgType,HouseStyle,condition
0,208.5,0,2,0,8,3,1710,Y,5,8450,1Fam,2Story,5
1,140.0,0,3,1,7,1,1717,Y,91,9550,1Fam,2Story,5
2,250.0,0,3,1,9,3,2198,Y,8,14260,1Fam,2Story,5
3,143.0,0,2,0,5,2,1362,Y,16,14115,1Fam,1.5Fin,5
4,307.0,0,2,1,7,2,1694,Y,3,10084,1Fam,1Story,5


**Read and Preview the Testing Data**

In [3]:
test = pandas.read_csv('Data\\realestate-test.csv')
test.head()

Unnamed: 0,PoolArea,GarageCars,Fireplaces,TotRmsAbvGrd,Baths,SqFeet,CentralAir,Age,LotSize,BldgType,HouseStyle,condition
0,0,2,0,6,2,1516,Y,45,10004,1Fam,1Story,6
1,0,1,0,4,1,616,Y,85,6000,1Fam,1Story,7
2,0,1,1,8,2,1696,Y,45,13673,1Fam,1Story,5
3,0,2,0,6,3,1479,Y,34,13517,1Fam,2Story,8
4,0,2,1,8,2,2217,Y,37,15865,1Fam,1Story,6


**For this tutorial, we will just focus on the following predictors:**

- `SqFeet`: *numeric*
- `Age`: *numeric*
- `BldgType`: *Categorical*

**The predictors are selected based on intuition and they are somewhat random. The purpose is to show that KNN can work with numeric and categorical variable.**

In [10]:
# Select the necessary columns in train
x_train = train[['SqFeet', 'Age', 'BldgType']]
y_train = train[['price']]
# Convert Categroical variable to numeric variable
tmp = pandas.factorize(x_train.BldgType)[0]
x_train.loc[:,'BldgType'] = tmp

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [8]:
x_train[['Age', 'BldgType']].groupby('BldgType').count()

Unnamed: 0_level_0,Age
BldgType,Unnamed: 1_level_1
0,962
1,25
2,39
3,96
4,38


In [5]:
# Select the necessary columns test
x_test = test[['SqFeet', 'Age', 'BldgType']]
# Convert Categroical variable to numeric variable
tmp = pandas.factorize(x_test.BldgType)[0]
x_test.loc[:,'BldgType'] = tmp

**Running the KNN Model**

In [6]:
# Choose a base case for 5 nearest neighbors
k = 5
