# K Nearest Neighbor Tutorial
- Author: Congxin (David) Xu
- Date: 2020/12/21

## Description

This tutorial is going to discuss how to implement K Nearest Neighbor model in `Python`. 

## Package Dependency

- [`pandas`](https://pandas.pydata.org/)
  - We will mainly use `pandas` for data manipulation and visualization.
- [`numpy`](https://numpy.org/)
  - We will mainly use `numpy` for calculations and data manipulation. 
- [`sklearn`](https://scikit-learn.org/stable/)
  - Title: scikit-learn: machine learning in Python
  - This is package that contains the `sklearn.neighbors.KNeighborsRegressor` function that will perform the K-Nearest-Neighbor regression
  - We will also use the function `sklearn.model_selection.GridSearchCV` to perform cross validation.

## Use Case

- Solving regression type of problem
- Fill in missing period handling

## Caution

- Do not use to predict something the training data has not seen before.
- Need to find a way to convert categorical predictors to numeric predictors.

## Tutorial
Load the required modules

In [1]:
import pandas
import numpy
import sklearn.neighbors
import sklearn.model_selection

The data we will use is the housing price data from [Kaggle](https://www.kaggle.com/c/house-prices-advanced-regression-techniques).

- Response Variable: **`price`**

**Read and Preview the Training Data**

In [2]:
train = pandas.read_csv('Data\\realestate-train.csv')
train.head()

Unnamed: 0,price,PoolArea,GarageCars,Fireplaces,TotRmsAbvGrd,Baths,SqFeet,CentralAir,Age,LotSize,BldgType,HouseStyle,condition
0,208.5,0,2,0,8,3,1710,Y,5,8450,1Fam,2Story,5
1,140.0,0,3,1,7,1,1717,Y,91,9550,1Fam,2Story,5
2,250.0,0,3,1,9,3,2198,Y,8,14260,1Fam,2Story,5
3,143.0,0,2,0,5,2,1362,Y,16,14115,1Fam,1.5Fin,5
4,307.0,0,2,1,7,2,1694,Y,3,10084,1Fam,1Story,5


**Read and Preview the Testing Data**

In [3]:
test = pandas.read_csv('Data\\realestate-test.csv')
test.head()

Unnamed: 0,PoolArea,GarageCars,Fireplaces,TotRmsAbvGrd,Baths,SqFeet,CentralAir,Age,LotSize,BldgType,HouseStyle,condition
0,0,2,0,6,2,1516,Y,45,10004,1Fam,1Story,6
1,0,1,0,4,1,616,Y,85,6000,1Fam,1Story,7
2,0,1,1,8,2,1696,Y,45,13673,1Fam,1Story,5
3,0,2,0,6,3,1479,Y,34,13517,1Fam,2Story,8
4,0,2,1,8,2,2217,Y,37,15865,1Fam,1Story,6


**For this tutorial, we will just focus on the following predictors:**

- `SqFeet`: *numeric*
- `Age`: *numeric*
- `BldgType`: *Categorical*

**The predictors are selected based on intuition and they are somewhat random. The purpose is to show that KNN can work with numeric and categorical variable.**

In [4]:
# Select the necessary columns in train
x_train = train[['SqFeet', 'Age', 'BldgType']]
y_train = train[['price']]

# Convert Categroical variable to numeric variable
x_train = x_train.assign(BldgType = pandas.factorize(x_train.BldgType)[0])
x_train.head()

Unnamed: 0,SqFeet,Age,BldgType
0,1710,5,0
1,1717,91,0
2,2198,8,0
3,1362,16,0
4,1694,3,0


In [5]:
# Select the necessary columns test
x_test = test[['SqFeet', 'Age', 'BldgType']]
# Convert Categroical variable to numeric variable
x_test = x_test.assign(BldgType = pandas.factorize(x_test.BldgType)[0])
x_test.head()

Unnamed: 0,SqFeet,Age,BldgType
0,1516,45,0
1,616,85,0
2,1696,45,0
3,1479,34,0
4,2217,37,0


**Running the KNN Model**

In [6]:
# Choose a base case for 5 nearest neighbors
k = 5

# Set up the model class
model = sklearn.neighbors.KNeighborsRegressor(n_neighbors = k)

# Fit the model
model.fit(X = x_train, y = y_train)

# Print the Parameters
model.get_params()

{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 5,
 'p': 2,
 'weights': 'uniform'}

**Report Mean Squared Error**

In [7]:
numpy.mean((y_train - model.predict(x_train))**2)

price    1666.474348
dtype: float64

**Use Cross Validation to find the best `k`**
- Reference: https://towardsdatascience.com/building-a-k-nearest-neighbors-k-nn-model-with-scikit-learn-51209555453a

In [8]:
# Create a new KNN object
knn2 = sklearn.neighbors.KNeighborsRegressor()

# Create a dictionary of all values we want to test for n_neighbors
param_grid = {'n_neighbors': numpy.arange(3, 51)}

# Use gridsearch to test all values for n_neighbors
knn_gscv = sklearn.model_selection.GridSearchCV(knn2, param_grid, cv=10)

# Fit model to data
knn_gscv.fit(X = x_train, y = y_train)

GridSearchCV(cv=10, estimator=KNeighborsRegressor(),
             param_grid={'n_neighbors': array([ 3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
       20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,
       37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50])})

**Report the best `k`**

In [9]:
knn_gscv.best_params_

{'n_neighbors': 12}

**Return the coefficient of determination**

In [10]:
knn_gscv.best_score_

0.6556953925874873

**Making Predictions**

In [11]:
test['predict'] = knn_gscv.predict(x_test)
test.predict.head()

0    154.568500
1     86.237500
2    166.575417
3    165.179167
4    262.100583
Name: predict, dtype: float64