# HOMEWORK: k-Nearest Neighbors

In [None]:
import os

import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 100)
pd.set_option('display.notebook_repr_html', True)
pd.set_option('display.max_columns', 100)

from sklearn import preprocessing, neighbors, grid_search, cross_validation
from sklearn import model_selection

import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

In [None]:
df = pd.read_csv('/Users/edwardlee/Desktop/df-sf-32/DS-SF-32/lessons/lesson-8/dataset-boston.csv')

In [None]:
df.head()

The Boston dataset concerns itself with housing values in suburbs of Boston.  A description of the dataset is as follows:

- CRIM: per capita crime rate by town
- ZN: proportion of residential land zoned for lots over 25,000 sqft
- INDUS: proportion of non-retail business acres per town
- CHAS: Charles River binary/dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX: nitric oxides concentration (parts per 10 million)
- RM: average number of rooms per dwelling
- AGE: proportion of owner-occupied units built prior to 1940
- DIS: weighted distances to five Boston employment centers
- RAD: index of accessibility to radial highways
- TAX: full-value property-tax rate (per ten thousands of dollars)
- PTRATIO: pupil-teacher ratio by town
- B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT: % lower status of the population
- MEDV: Median value of owner-occupied homes (in thousands of dollars)

## Question 1.  
+ Let's first categorize `MEDV` to 4 groups: Bottom 20% as Level 1, next 30% as Level 2, next 30% categorized as Level 3, and the top 20% as Level 4.  
+ Please create a new variable `MEDV_Category` that stores the level number
+ Remember the quantile function
+ Remember how to segment your pandas data frame

In [None]:
print df['MEDV'].quantile(.2)
print df['MEDV'].quantile(.5)
print df['MEDV'].quantile(.8)

In [None]:
q1 = df['MEDV'].quantile(.2)
q2 = df['MEDV'].quantile(.5)
q3 = df['MEDV'].quantile(.8)

def classify_medv(x):
    if x < q1:
        return 0
    elif x >= q1 and x < q2:
        return 1
    elif x >= q2 and x < q3:
        return 2
    elif x >= q3:
        return 3

df['medv_category'] = df['MEDV'].map(classify_medv)

In [None]:
df.head()

### Our goal is to predict `MEDV_Category` based on `RM`, `PTRATIO`, and `LSTAT`

## Question 2.  

+ First normalize `RM`, `PTRATIO`, and `LSTAT`.  
+ By normalizing, we mean to scale each variable between 0 and 1 with the lowest value as 0 and the highest value as 1

+ Check out the documentation for MinMaxScaler()

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
mms = MinMaxScaler()
df['RM'] = mms.fit_transform(df['RM'])
df['PTRATIO'] = mms.fit_transform(df['PTRATIO'])
df['LSTAT'] = mms.fit_transform(df['LSTAT'])

In [None]:
df.head()

## Question 3.  

+ Run a k-NN classifier with 5 nearest neighbors and report your misclassification error; set weights to uniform
+ Calculate your misclassification error on the training set

In [None]:
df.shape

In [None]:
knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform')

In [None]:
X = df[['RM', 'PTRATIO', 'LSTAT']]
y = df['medv_category'].values

In [None]:
from sklearn.cross_validation import train_test_split

trainX, testX, trainY, testY = train_test_split(X, y, stratify=y, train_size=.80)
print trainX.shape, testX.shape
print trainY.shape, testY.shape

In [None]:
model = knn.fit(trainX, trainY)

In [None]:
y_predict = model.predict(testX)

In [None]:
model.score(testX, testY)

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

cnf_mtx = confusion_matrix(testY, y_predict)
print cnf_mtx

print classification_report(y_predict, testY)

In [None]:
from sklearn.metrics import accuracy_score

print 'Error Rate', 1 - accuracy_score(testY, y_predict)

## Question 4. 
+ Is this error reliable? 
+ What could we do to make it better?

In [None]:
y_predict_full = model.predict(X)
model.score(X, y)

In [None]:
cnf_mtx = confusion_matrix(y, y_predict_full)
print cnf_mtx

print classification_report(y_predict_full, y)

In [None]:
print 'Error Rate', 1 - accuracy_score(y, y_predict_full)

<span style='font-size:1.5em; color:blue'>Based off of these 3 predictors, the algorithm on the entire dataset shows a 23% error rate, which isn't great. We will need to perform grid search to optimize our algorithm.</span>

## Question 5.  
+ Now use 10-fold cross-validation to choose the most efficient `k`

In [None]:
params = {
    'n_neighbors':range(2,30),
    'weights':['uniform', 'distance']
}
gs = grid_search.GridSearchCV(knn, params, cv=10, verbose=1)

In [None]:
gs.fit(trainX, trainY)

## Question 6.  

+ Explain your findings
+ What were your best parameters?
+ What was the best k?
+ What was the best model?

In [None]:
print 'best estimator: ', gs.best_estimator_
print 'best param: ', gs.best_params_
print 'best score: ', gs.best_score_

## Question 7.  

+ Train your model with the optimal `k` you found above 
+ (don't worry if it changes from time to time - if that is the case use the one that is usually the best)

In [None]:
knn_best = neighbors.KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=18, p=2,
           weights='distance')

In [None]:
model_best = knn_best.fit(trainX, trainY)
predict_best = model_best.predict(testX)
print 'error: ', 1 - model_best.score(testX, testY)

In [None]:
predict_best_full = model_best.predict(X)
print 'error: ', 1 - model_best.score(X, y)

In [None]:
print confusion_matrix(testY, predict_best)
print '=================================================='
print confusion_matrix(y, predict_best_full)

In [None]:
print classification_report(testY, predict_best)
print '=================================================='
print classification_report(y, predict_best_full)

<span style='font-size:1.5em; color:blue'>The grid search performed much better with a 5.7% error rate on the full dataset</span>

## Question 8.  

+ After training your model with that `k`, 
+ use it to *predict* the class of a neighborhood with `RM = 2`, `PRATIO = 19`, and `LSTAT = 3.5`
+ If you are confused, check out the sklearn documentation for KNN

In [None]:
X.columns.tolist()

In [None]:
model_best.predict([2, 19, 3.5])

- RM: average number of rooms per dwelling
- PTRATIO: pupil-teacher ratio by town
- LSTAT: % lower status of the population
- MEDV: Median value of owner-occupied homes (in thousands of dollars)

<span style='font-size:1.5em; color:blue'>With 2 dwelling rooms, a 19:1 pupil:teacher ratio and 3.5% lower status predicts a quantile between 20-50% in median income.</span>