# Challenge: Model Comparison

You now know 2 kinds of regression models and 2 kinds of classifier models. So let's use that to compare models!

Comparing models is something data scientists do all the time. Learning to choose the best model for a given situation is very important.

Find a data set and build a KNN Regression and an OLS regression. Compare the 2. How similar are they? Do they miss in different ways?

Describe the models' behaviors and why you favor 1 model or the other. Is there a situation where you would change your mind, or whether one is unambiguously better than the other? Lastly, Note what it is about the data that causes the better model to outperform the weaker model.

## Data

I will be looking at Boston housing data, found via [Kaggle](https://www.kaggle.com/c/boston-housing/data).

In [1]:
import numpy as np  
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor as KNNRegressor
from sklearn import linear_model

data_url = 'https://storage.googleapis.com/kaggle-competitions-data/kaggle/5315/train.csv?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1526780617&Signature=M07xqrJbBfFAY%2FaQFIj%2BHOYLySG%2Bah2%2F3C%2BO0WoyGmqH4AvQWEV%2FtMTQyMsN4SG7Sblmqou10GcPggvFo1iVAbceo8cLvvDqdTnVCUD5I7JaUrHAZET35Pe%2B88jtYtI9oLhxCnmtdjEHaPf7Q%2FlEoA7aEuuQ01N%2FK0jWbMqIUbOulHD0L0G4AnjSKDhkztixjKUJzHGkBoXnHSFo6uzfZ%2FhOf54Znu9YBMKZRr2ZJ6ZRPYrSTmCRZ7v2xCWZieY7ceQJN79BBNlKtUhVc1478O97WzEol1g7wkBVkmhgswxTWIJbxj0XJjfTiJWqYwK46eTyntkBHIM91l6JTcYrpA%3D%3D'

df = pd.read_csv(data_url, header=0)  
df.set_index('ID', inplace=True)

df.head()

Unnamed: 0_level_0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
2,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
4,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
5,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
7,0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.6,12.43,22.9


In [2]:
# Get rid of unwanted features
df = df.drop(['chas', 'nox', 'rad', 'black'], axis=1)

# Rename columns to be more intuitive
df.rename(index=str, columns={
    "crim": "Crime per capita", 
    "zn": "Proportion of residential land over 25K sq ft",
    "indus": "Proportion of non-retail acres",
    "rm": "Avg rooms per home",
    "age": "Proportion of occupied homes built pre-1940",
    "dis": "Weighted avg of distances to 5 employment centres",
    "ptratio": "Pupil-teacher ratio",
    "lstat": "% Lower status",
    "medv": "Median value of home"
})

df.head()

Unnamed: 0_level_0,crim,zn,indus,rm,age,dis,tax,ptratio,lstat,medv
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,0.00632,18.0,2.31,6.575,65.2,4.09,296,15.3,4.98,24.0
2,0.02731,0.0,7.07,6.421,78.9,4.9671,242,17.8,9.14,21.6
4,0.03237,0.0,2.18,6.998,45.8,6.0622,222,18.7,2.94,33.4
5,0.06905,0.0,2.18,7.147,54.2,6.0622,222,18.7,5.33,36.2
7,0.08829,12.5,7.87,6.012,66.6,5.5605,311,15.2,12.43,22.9


In [3]:
X = df.iloc[:, 0:9].values
Y = df.iloc[:, 9].values

# Get correlation

In [4]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20)  

In [5]:
# Normalize features before predicting
scaler = StandardScaler()  
scaler.fit(X_train)

X_train = scaler.transform(X_train)  
X_test = scaler.transform(X_test)  

## Using KNN Regression

In [6]:
regr_1 = KNNRegressor(n_neighbors=5, weights="distance")  
regr_1.fit(X_train, Y_train)  

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=5, p=2,
          weights='distance')

In [7]:
y_pred = regr_1.predict(X_test)

In [8]:
from sklearn.model_selection import cross_val_score
score = cross_val_score(regr_1, X, Y, cv=5)
print("Weighted Accuracy: %0.2f (+/- %0.2f)" % (score.mean(), score.std() * 2))

Weighted Accuracy: -0.23 (+/- 0.66)


## Using OLS Regression

In [9]:
regr_2 = linear_model.LinearRegression()
regr_2.fit(X_train, Y_train)  

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [11]:
regr_2.fit(X_train, Y_train)  

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)