## Boston Housing Assignment

In this assignment you'll be using linear regression to estimate the cost of house in boston, using a well known dataset.

Goals:
+  Measure the performance of the model I created using $R^{2}$ and MSE
> Learn how to use sklearn.metrics.r2_score and sklearn.metrics.mean_squared_error
+  Implement a new model using L2 regularization
> Use sklearn.linear_model.Ridge or sklearn.linear_model.Lasso 
+  Get the best model you can by optimizing the regularization parameter.   

In [185]:
from sklearn import datasets
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge

In [186]:
bean = datasets.load_boston()
print(bean.DESCR)

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

In [187]:
def load_boston():
    scaler = StandardScaler()
    boston = datasets.load_boston()
    X=boston.data
    y=boston.target
    X = scaler.fit_transform(X)
    return train_test_split(X,y)
    

In [188]:
X_train, X_test, y_train, y_test = load_boston()

In [189]:
X_train.shape

(379, 13)

### Fitting a Linear Regression

It's as easy as instantiating a new regression object (line 1) and giving your regression object your training data
(line 2) by calling .fit(independent variables, dependent variable)



In [190]:
clf_LR = LinearRegression()
clf_LR.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

### Making a Prediction
X_test is our holdout set of data.  We know the answer (y_test) but the computer does not.   

Using the command below, I create a tuple for each observation, where I'm combining the real value (y_test) with
the value our regressor predicts (clf.predict(X_test))

Use a similiar format to get your r2 and mse metrics working.  Using the [scikit learn api](http://scikit-learn.org/stable/modules/model_evaluation.html) if you need help!

In [191]:
# List<Tuple<real, predicted>>
output_LR = list(zip (y_test, clf_LR.predict(X_test)))
y_pred_LR = [x[1] for x in output_LR]
print(output_LR)

[(13.4, 13.134130012297117), (27.5, 15.216812002105762), (22.800000000000001, 28.196814870901264), (24.300000000000001, 29.397187796849643), (23.899999999999999, 24.564737292913385), (8.5, 7.4439331573708145), (26.600000000000001, 27.743503824161198), (44.0, 36.678482548766731), (7.0, -3.8648966360456214), (14.6, 19.173169747246359), (19.300000000000001, 17.895891989188303), (19.100000000000001, 24.789607576218486), (22.199999999999999, 25.837272166013022), (20.699999999999999, 20.941954229197147), (21.699999999999999, 22.702562387372325), (31.5, 31.181523387496341), (50.0, 42.304321183752563), (50.0, 38.553641970179882), (21.800000000000001, 21.339781727220561), (37.299999999999997, 33.616199484075082), (31.5, 31.987213465444206), (32.0, 33.358631879054585), (13.300000000000001, 20.034325331914548), (19.5, 19.917187351133535), (20.600000000000001, 21.493906917716682), (21.0, 21.335950651218063), (45.399999999999999, 38.013741378129403), (32.0, 33.339958215955996), (33.100000000000001,

### Calculate R2 and RMSE

The goal is to maximize R2 and minimize RMSE

In [192]:
R2_LR = r2_score(y_test, y_pred_LR)
print(R2_LR)

0.746414605529


In [193]:
RMSE_LR = np.sqrt(mean_squared_error(y_test, y_pred_LR))
print(RMSE_LR)

5.02534883762


### Implementing a New Model with L2 Regularization: sklearn.linear_model.Ridge

In [194]:
clf_Ridge = Ridge(alpha=1)
clf_Ridge.fit(X_train, y_train)

Ridge(alpha=1, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [195]:
output_Ridge = list(zip (y_test, clf_Ridge.predict(X_test)))
y_pred_Ridge = [x[1] for x in output_Ridge]
print(output_Ridge)

[(13.4, 13.111527994897568), (27.5, 15.120329038563343), (22.800000000000001, 28.183844785177406), (24.300000000000001, 29.333326081042976), (23.899999999999999, 24.570402779317625), (8.5, 7.4779368880497188), (26.600000000000001, 27.753864211164856), (44.0, 36.601108072544953), (7.0, -3.8221401917207132), (14.6, 19.137036880889909), (19.300000000000001, 17.89192888751991), (19.100000000000001, 24.749846720927163), (22.199999999999999, 25.810644522324242), (20.699999999999999, 20.973440951793954), (21.699999999999999, 22.736159698028978), (31.5, 31.13472471576722), (50.0, 42.190165884898548), (50.0, 38.529369141422293), (21.800000000000001, 21.293359521318486), (37.299999999999997, 33.531176767471408), (31.5, 31.964653372836501), (32.0, 33.284879548586545), (13.300000000000001, 20.019289798990759), (19.5, 19.917989353804511), (20.600000000000001, 21.450220329991581), (21.0, 21.350525491208366), (45.399999999999999, 37.948894663722214), (32.0, 33.321302525714032), (33.100000000000001, 3

In [196]:
R2_Ridge = r2_score(y_test, y_pred_Ridge)
print(R2_Ridge)

0.746590364036


In [197]:
RMSE_Ridge = np.sqrt(mean_squared_error(y_test, y_pred_Ridge))
print(RMSE_Ridge)

5.02360701627


### Compare Performance

In [198]:
# Ridge Performance over Linear Regression - R2
R2_Ridge_Over_LR = R2_Ridge - R2_LR
print(R2_Ridge_Over_LR)

0.000175758507355


In [199]:
# Ridge Performance over Linear Regression - RMSE
RMSE_Ridge_Over_LR = RMSE_Ridge - RMSE_LR
print(RMSE_Ridge_Over_LR)

-0.00174182134774
