## Boston Housing Assignment

In this assignment you'll be using linear regression to estimate the cost of house in boston, using a well known dataset.

Goals:
+  Measure the performance of the model I created using $R^{2}$ and MSE
> Learn how to use sklearn.metrics.r2_score and sklearn.metrics.mean_squared_error
+  Implement a new model using L2 regularization
> Use sklearn.linear_model.Ridge or sklearn.linear_model.Lasso 
+  Get the best model you can by optimizing the regularization parameter.   

In [73]:
from sklearn import datasets
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression

In [74]:
bean = datasets.load_boston()
print bean.DESCR

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

In [75]:
def load_boston():
    scaler = StandardScaler()
    boston = datasets.load_boston()
    X=boston.data
    y=boston.target
    X = scaler.fit_transform(X)
    return train_test_split(X,y)
    

In [76]:
X_train, X_test, y_train, y_test = load_boston()

In [77]:
X_train.shape

(379L, 13L)

### Fitting a Linear Regression

It's as easy as instantiating a new regression object (line 1) and giving your regression object your training data
(line 2) by calling .fit(independent variables, dependent variable)



In [78]:

clf = LinearRegression()
clf.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

### Making a Prediction
X_test is our holdout set of data.  We know the answer (y_test) but the computer does not.   

Using the command below, I create a tuple for each observation, where I'm combining the real value (y_test) with
the value our regressor predicts (clf.predict(X_test))

Use a similiar format to get your r2 and mse metrics working.  Using the [scikit learn api](http://scikit-learn.org/stable/modules/model_evaluation.html) if you need help!

In [79]:
zip (y_test, clf.predict(X_test))

[(18.600000000000001, 20.170731100221747),
 (19.800000000000001, 21.938157123319456),
 (19.100000000000001, 16.51627876657459),
 (21.399999999999999, 19.930807556998829),
 (16.5, 11.445258747440899),
 (13.800000000000001, 12.083184070024229),
 (10.199999999999999, 16.950684065457001),
 (12.5, 18.838078387058662),
 (7.5, 13.413545183478162),
 (5.0, 7.6500333833425955),
 (27.5, 31.437242280530793),
 (17.300000000000001, 16.036029904229004),
 (44.799999999999997, 38.09818285494643),
 (20.699999999999999, 25.375220850726954),
 (29.600000000000001, 24.727613526245126),
 (36.399999999999999, 33.233094185353828),
 (23.0, 19.897644209517537),
 (21.399999999999999, 21.7936324933723),
 (28.399999999999999, 30.983497929263152),
 (11.699999999999999, 14.76014271690094),
 (28.699999999999999, 28.024318044376802),
 (16.100000000000001, 21.343936834098212),
 (21.399999999999999, 25.078171372149679),
 (18.800000000000001, 20.943829820848062),
 (22.800000000000001, 28.589904398427361),
 (21.89999999999

In [80]:
from sklearn.datasets import *
data = load_boston()
print data.DESCR

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

In [81]:
from sklearn.datasets import *
from sklearn.linear_model import LinearRegression

data = load_boston()
model = LinearRegression()
model.fit(data.data, data.target)
print model.__dict__
print model.score(data.data, data.target)

{'normalize': False, 'intercept_': 36.491103280361422, 'residues_': 11080.276284149875, 'fit_intercept': True, 'coef_': array([ -1.07170557e-01,   4.63952195e-02,   2.08602395e-02,
         2.68856140e+00,  -1.77957587e+01,   3.80475246e+00,
         7.51061703e-04,  -1.47575880e+00,   3.05655038e-01,
        -1.23293463e-02,  -9.53463555e-01,   9.39251272e-03,
        -5.25466633e-01]), 'copy_X': True, 'rank_': 13, 'singular_': array([  3.94958310e+03,   1.77662274e+03,   6.42864253e+02,
         3.66980826e+02,   1.59116390e+02,   1.18692322e+02,
         9.01718207e+01,   6.93889529e+01,   4.06572735e+01,
         2.44223087e+01,   1.13502686e+01,   5.50918200e+00,
         1.24178413e+00])}
0.740607742865


## Estimating the performance of unseen data, , we split the data into two sets:
## The training set and the test set using r2 method

In [82]:
from sklearn.datasets import *
from sklearn.cross_validation import train_test_split
data = load_boston()

X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, train_size=0.5)
model = LinearRegression()
model.fit(X_train, y_train)
print "Train R2 %f"%model.score(X_train, y_train)
print "Test R2 %f"%model.score(X_test, y_test)

Train R2 0.707253
Test R2 0.764058


In [84]:
from sklearn import datasets
import pandas as pd
%pylab inline
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
import numpy as np
from matplotlib import pyplot as plt
from IPython.display import display
from sklearn.neighbors import KNeighborsRegressor

%matplotlib inline

Populating the interactive namespace from numpy and matplotlib


In [85]:
train_data, test_data, train_target, test_target = \
        train_test_split(boston.data, boston.target, train_size=.8)
train_data.shape

(404L, 13L)

In [86]:
from sklearn import datasets
boston = datasets.load_boston()
X = boston.data
y = boston.target

print X.shape
print y.shape

(506L, 13L)
(506L,)


In [87]:
from sklearn import linear_model
linreg = linear_model.LinearRegression()

In [88]:
_=linreg.fit(train_data, train_target)

##MSE performance measure

In [89]:
from sklearn import metrics
mse = metrics.mean_squared_error(test_target, linreg.predict(test_data)) 
print("MSE is {}".format(mse))

MSE is 20.985691564


##L2 regularization using Lasso method

In [90]:
lasso = linear_model.Lasso()
lasso = linear_model.Lasso(alpha=0.1)
lasso = linear_model.Lasso(normalize=True, alpha=1.0)
lasso.fit(train_data, train_target)

Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=True, positive=False, precompute='auto', tol=0.0001,
   warm_start=False)

In [91]:
lasso.fit(train_data, train_target)
r2_lasso_train = lasso.score(train_data, train_target)
r2_lasso = lasso.score(test_data, test_target)

In [92]:
results = """\
      | TRAINING | TESTING
------+----------+---------
Lasso | {:.2%}   | {:.2%}
---------------------------
""".format(r2_lasso_train, r2_lasso)
print(results)

      | TRAINING | TESTING
------+----------+---------
Lasso | 0.00%   | -0.73%
---------------------------

