# Regression: K Nearest Neighbour

Introduction to Machine Learning, BCAM & UPV/EHU course, by Carlos Cernuda, Ekhine Irurozki and Aritz Perez.


## References 

* James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). 
An introduction to statistical learning (Vol. 112). New York: springer.
* Data sets: http://www-bcf.usc.edu/~gareth/ISL/data.html
* SCIKIT-LEARN library example http://scikit-learn.org
* References Jupyter notebooks:
    - R. Jordan Crouser at Smith College for SDS293: Machine Learning (Spring 2016)
    http://www.science.smith.edu/~jcrouser/SDS293/labs/lab10-py.html
    - General Assembly's Data Science course in Washington, DC
    https://github.com/justmarkham/DAT4
    - An Introduction to Statistical Learning (James, Witten, Hastie, Tibshirani, 2013) adapted to Python code
    https://github.com/JWarmenhoven/ISLR-python

In [None]:
##########################################################
import numpy as np #scientific computing (n-dim arrays, etc)
import pandas as pd #data analysis library
##########################################################
# Plots:
import matplotlib.pyplot as plt 
import matplotlib.pylab as pylab
import seaborn as sns #visualization library based on matplotlib
%matplotlib inline
plt.style.use(['seaborn-white'])   
params = {'legend.fontsize': 'xx-large',
              'figure.figsize': (15, 5),
              'axes.labelsize': 'xx-large',
              'axes.titlesize':'xx-large',
              'xtick.labelsize':'xx-large',
              'ytick.labelsize':'xx-large'}    
pylab.rcParams.update(params)  #fix the parameters for the plots

pd.set_option('display.notebook_repr_html', False)
##########################################################
# SKLEARN: scikit-learn machine learning tools
from sklearn import neighbors 
#provides functionality for unsupervised and supervised neighbors-based learning methods
##########################################################
np.random.seed(0)

## K Nearest Neighbor 

In [None]:
# Generate sample data non linear 
X = np.sort(5 * np.random.rand(40, 1), axis=0)
y = np.sin(X).ravel() #Return a contiguous flattened array.
# Add noise to targets
y[::5] += 1 * (0.5 - np.random.rand(8))

# Fit regression model
# select parameter K
n_neighbors = 5

# New samples to predict
test = np.linspace(0, 5, 500)[:, np.newaxis]

# Model
knn_model = neighbors.KNeighborsRegressor(n_neighbors, weights='uniform')
test_knn_uniform = knn_model.fit(X, y).predict(test)

# Plots
plt.figure(figsize=(8,8))
#
plt.scatter(X, y, c='k', label='data')
#
plt.scatter(test, test_knn_uniform, c='g', label='KNN uniform')
#plt.plot(T, y_knn_uniform, c='g', linewidth=1)
#
plt.axis('tight')
plt.legend()
plt.title("KNN (k = " +str(n_neighbors)+ ", weights = uniform)");

## KNN distance weights

In [None]:
# Generate sample data non linear 
X = np.sort(5 * np.random.rand(40, 1), axis=0)
y = np.sin(X).ravel()
# Add noise to targets
y[::5] += 1 * (0.5 - np.random.rand(8))

# Fit regression model
# select parameter K
n_neighbors = 5

# New samples to predict
test = np.linspace(0, 5, 100)[:, np.newaxis]

# Model
knn_model_d = neighbors.KNeighborsRegressor(n_neighbors, weights='distance')
test_knn_distance = knn_model_d.fit(X, y).predict(test)

# Plots
plt.figure(figsize=(8,8))
#
plt.scatter(X, y, c='k', label='data')
#
plt.scatter(test, test_knn_distance, c='r', label='KNN distance')
#plt.plot(T, y_knn_distance, c='r')
#
plt.axis('tight')
plt.legend()
plt.title("KNN (k = " +str(n_neighbors)+ ", weights = distance)");

## Linear regression and KNN 

In what setting will a parametric approach, such as Linear Regression, outperform a non-parametric approach such as KNN? 

The parametric approach will outperform  if the parametric form selected is close to the true form of the model $f$.  

In [None]:
# Sinusoidal shape

# Generate sample data non linear 
X = np.sort(5 * np.random.rand(40, 1), axis=0)
y = np.sin(X).ravel()
# Add noise to targets
y[::5] += 1 * (0.5 - np.random.rand(8))

# New samples to predict
test = np.linspace(0, 5, 100)[:, np.newaxis]
# KNN
knn_model = neighbors.KNeighborsRegressor(n_neighbors, weights='uniform')
test_knn_uniform = knn_model.fit(X, y).predict(test)
# KNN distance
knn_model_d = neighbors.KNeighborsRegressor(n_neighbors, weights='distance')
test_knn_distance = knn_model_d.fit(X, y).predict(test)

# Plots
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(12,5))

# Left plot (KNN uniform)
#
ax1.scatter(X, y, c='k', label='data')
#
ax1.scatter(test, test_knn_uniform, c='g', label='KNN uniform')
ax1.plot(test, test_knn_uniform, c='g')
#
sns.regplot(X, y, order=1, ci=None, scatter=False, label='Linear', ax=ax1, color='b')
ax1.legend()
# Right plot (KNN distance)
#
ax2.scatter(X, y, c='k', label='data')
ax2.scatter(test, test_knn_distance, c='r', label='KNN distance')
ax2.plot(test, test_knn_distance, c='r')
#
sns.regplot(X, y, order=1, ci=None, scatter=False, label='Linear', ax=ax2, color='b')
ax2.legend();

In [None]:
# Linear shape
# Generate sample data linear 
X = np.sort(5 * np.random.rand(40, 1), axis=0)
# Add noise to targets
y = X + np.random.rand(40,1)

# Fit regression model
n_neighbors = 5

# To predict
test = np.linspace(0, 5, 500)[:, np.newaxis]

# Model
knn_model_uniform = neighbors.KNeighborsRegressor(n_neighbors, weights='uniform')
knn_model_distance = neighbors.KNeighborsRegressor(n_neighbors, weights='distance')
test_knn_model_uniform = knn_model_uniform.fit(X, y).predict(test)
test_knn_model_distance = knn_model_distance.fit(X, y).predict(test)

# Plots
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(12,5))

# Left plot (KNN uniform)
#
ax1.scatter(X, y, c='k', label='data')
#
ax1.scatter(test, test_knn_model_uniform, c='g', label='KNN uniform')
ax1.plot(test, test_knn_model_uniform, c='g')
#
sns.regplot(X, y, order=1, ci=None, scatter=False, label='Linear', ax=ax1, color='b')
ax1.legend()
# Right plot (KNN distance)
#
ax2.scatter(X, y, c='k', label='data')
ax2.scatter(test, test_knn_model_distance, c='r', label='KNN distance')
ax2.plot(test, test_knn_model_distance, c='r')
#
sns.regplot(X, y, order=1, ci=None, scatter=False, label='Linear', ax=ax2, color='b')
ax2.legend();

Danger! Overfitting risk!