#  Boston Housing Assignment
**University of Illinois**
<br>CSC 570 - Data Science Essentials
<br>Author: Arthur Putnam

## Lab Directions
1.  Pull the Boston Housing notebook I've created for this assignment.

2.  Impliment scikit learn's r2 and mse methods to measure the performance of my linear regressor.

3.  Impliment either sklearn.linear_model.Ridge or sklearn.linear_model.Lasso.

4.  Optimize (by reviewing the r2 and mse scores and adjusting the regularization paramater) the regression model you pick.

5.  Turn in the github link to your work 

## Boston Housing Assignment

In this assignment you'll be using linear regression to estimate the cost of house in boston, using a well known dataset.

Goals:
+  Measure the performance of the model I created using $R^{2}$ and MSE
> Learn how to use sklearn.metrics.r2_score and sklearn.metrics.mean_squared_error
+  Implement a new model using L2 regularization
> Use sklearn.linear_model.Ridge or sklearn.linear_model.Lasso 
+  Get the best model you can by optimizing the regularization parameter.   

In [3]:
from sklearn import datasets
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression



In [6]:
bean = datasets.load_boston()
print(bean.DESCR)

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

In [7]:
def load_boston():
    scaler = StandardScaler()
    boston = datasets.load_boston()
    X=boston.data
    y=boston.target
    X = scaler.fit_transform(X)
    return train_test_split(X,y)
    

In [8]:
X_train, X_test, y_train, y_test = load_boston()

In [9]:
X_train.shape

(379, 13)

### Fitting a Linear Regression

It's as easy as instantiating a new regression object (line 1) and giving your regression object your training data
(line 2) by calling .fit(independent variables, dependent variable)



In [103]:
clf = LinearRegression()
clf.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

### Making a Prediction
X_test is our holdout set of data.  We know the answer (y_test) but the computer does not.   

Using the command below, I create a tuple for each observation, where I'm combining the real value (y_test) with
the value our regressor predicts (clf.predict(X_test))

Use a similiar format to get your r2 and mse metrics working.  Using the [scikit learn api](http://scikit-learn.org/stable/modules/model_evaluation.html) if you need help!

In [11]:
list(zip (y_test, clf.predict(X_test)))

[(8.8000000000000007, 6.4290404175970473),
 (36.399999999999999, 32.77394999678593),
 (14.300000000000001, 13.176844333515538),
 (13.1, 20.077390428506313),
 (26.199999999999999, 23.711948370737346),
 (25.0, 24.622202774862547),
 (15.6, 11.491004387955922),
 (22.800000000000001, 29.409722777222175),
 (17.5, 16.184429668773639),
 (30.5, 30.251584786276389),
 (23.899999999999999, 26.936949721127078),
 (27.5, 33.464502568655789),
 (13.800000000000001, 20.125028633407108),
 (27.5, 24.583723257877384),
 (28.699999999999999, 31.196185464770934),
 (27.899999999999999, 20.372363471296609),
 (24.600000000000001, 29.81517837671565),
 (32.0, 33.880762571219407),
 (29.600000000000001, 25.559478616858108),
 (17.100000000000001, 18.276765623177532),
 (20.100000000000001, 18.680765682227108),
 (14.800000000000001, 14.64819632004486),
 (25.0, 28.959469773454479),
 (24.399999999999999, 23.582347554426669),
 (14.9, 17.556713048051066),
 (29.100000000000001, 31.675180591552625),
 (50.0, 25.15709680303730

### RMSE - Root Mean Squared Error


In [86]:
import math

def rmse(y_hat, y):
    return math.sqrt(mean_squared_error(y_hat, y))

y_hat = clf.predict(X_test)
linear_regression_rmse = rmse(y_hat, y_test)
print("RMSE:", linear_regression_rmse)


RMSE: 5.118757366097946


### R² coefficient of determination
R² will be between 0 and 1, larger is better

In [88]:
linear_regression_r2 = r2_score(y_test, y_hat)
print("Coefficient of determination:", linear_regression_r2 )

Coefficient of determination: 0.677820660033


### Linear Regression Summary

In [89]:
print("RMSE:",linear_regression_rmse)
print("Coefficient of determination:", linear_regression_r2 )

RMSE: 5.118757366097946
Coefficient of determination: 0.677820660033


### Ridge model
"This model solves a regression model where the loss function is the linear least squares function and regularization is given by the l2-norm. Also known as Ridge Regression or Tikhonov regularization. This estimator has built-in support for multi-variate regression" (sited from: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)

In [82]:
from sklearn.linear_model import Ridge
import numpy as np

rmse_values = []
r2_values = []
alpha_values = []
for i in np.arange(0.00001, 1, 0.00001).tolist():
    ridge_model = Ridge(alpha=i)
    ridge_model.fit(X_train, y_train)
    ridge_y_hat = ridge_model.predict(X_test)
    alpha_values.append(i)
    rmse_values.append(rmse(ridge_y_hat, y_test))
    r2_values.append(r2_score(y_test, ridge_y_hat))

print("==========================================")
print("Best Values for 0.00001-1")
# lowest RMSE value in the list
index = rmse_values.index(min(rmse_values))
best_alpha_value = alpha_values[index]
print("lowest RMSE value in the list")
print("Best alpha value for RMSE (1-100):", best_alpha_value)
print("RMSE value:", min(rmse_values))
print("Coefficient of determination:",r2_values[index]) 

print("\n")

# lowest r2 value in the list
print("lowest r2 value in the list")
index = r2_values.index(min(r2_values))
best_alpha_value = alpha_values[index]
print("Best alpha value for r2 (1-100):", best_alpha_value)
print("RMSE value:", rmse_values[index])
print("Coefficient of determination:",r2_values[index]) 
print("==========================================")

rmse_values = []
r2_values = []
alpha_values = []
for i in range(1,100):
    ridge_model = Ridge(alpha=i)
    ridge_model.fit(X_train, y_train)
    ridge_y_hat = ridge_model.predict(X_test)
    alpha_values.append(i)
    rmse_values.append(rmse(ridge_y_hat, y_test))
    r2_values.append(r2_score(y_test, ridge_y_hat))

print("==========================================")
print("Best Values for 1-100")
# lowest RMSE value in the list
index = rmse_values.index(min(rmse_values))
best_alpha_value = alpha_values[index]
print("lowest RMSE value in the list")
print("Best alpha value for RMSE (1-100):", best_alpha_value)
print("RMSE value:", min(rmse_values))
print("Coefficient of determination:",r2_values[index]) 

print("\n")

# lowest r2 value in the list
print("lowest r2 value in the list")
index = r2_values.index(min(r2_values))
best_alpha_value = alpha_values[index]
print("Best alpha value for r2 (1-100):", best_alpha_value)
print("RMSE value:", rmse_values[index])
print("Coefficient of determination:",r2_values[index]) 
print("==========================================")

Best Values for 0.00001-1
lowest RMSE value in the list
Best alpha value for RMSE (1-100): 0.99999
RMSE value: 5.111349045609309
Coefficient of determination: 0.678752558318


lowest r2 value in the list
Best alpha value for r2 (1-100): 1e-05
RMSE value: 5.1187572880624455
Coefficient of determination: 0.677820669856
Best Values for 1-100
lowest RMSE value in the list
Best alpha value for RMSE (1-100): 35
RMSE value: 5.038838234578068
Coefficient of determination: 0.687802492433


lowest r2 value in the list
Best alpha value for r2 (1-100): 1
RMSE value: 5.111348975285734
Coefficient of determination: 0.678752567157


### Lasso model
"Linear Model trained with L1 prior as regularizer (aka the Lasso)" (sited from: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html)

In [100]:
from sklearn.linear_model import Lasso

rmse_values = []
r2_values = []
alpha_values = []
for i in np.arange(0.00001, 1, 0.00001).tolist():
    lasso_model = Lasso(alpha=i)
    lasso_model.fit(X_train, y_train)
    lasso_y_hat = lasso_model.predict(X_test)
    alpha_values.append(i)
    rmse_values.append(rmse(lasso_y_hat, y_test))
    r2_values.append(r2_score(y_test, lasso_y_hat))

print("==========================================")
print("Best Values for 0.00001-1")
# lowest RMSE value in the list
index = rmse_values.index(min(rmse_values))
best_alpha_value = alpha_values[index]
print("lowest RMSE value in the list")
print("Best alpha value for RMSE (1-100):", best_alpha_value)
print("RMSE value:", min(rmse_values))
print("Coefficient of determination:",r2_values[index]) 

print("\n")

# lowest r2 value in the list
print("lowest r2 value in the list")
index = r2_values.index(min(r2_values))
best_alpha_value = alpha_values[index]
print("Best alpha value for r2 (1-100):", best_alpha_value)
print("RMSE value:", rmse_values[index])
print("Coefficient of determination:",r2_values[index]) 
print("==========================================")

rmse_values = []
r2_values = []
alpha_values = []
for i in range(1,100):
    lasso_model = Lasso(alpha=i)
    lasso_model.fit(X_train, y_train)
    lasso_y_hat = lasso_model.predict(X_test)
    alpha_values.append(i)
    rmse_values.append(rmse(lasso_y_hat, y_test))
    r2_values.append(r2_score(y_test, lasso_y_hat))

print("==========================================")
print("Best Values for 1-100")
# lowest RMSE value in the list
index = rmse_values.index(min(rmse_values))
best_alpha_value = alpha_values[index]
print("lowest RMSE value in the list")
print("Best alpha value for RMSE (1-100):", best_alpha_value)
print("RMSE value:", min(rmse_values))
print("Coefficient of determination:",r2_values[index]) 

print("\n")

# lowest r2 value in the list
print("lowest r2 value in the list")
index = r2_values.index(min(r2_values))
best_alpha_value = alpha_values[index]
print("Best alpha value for r2 (1-100):", best_alpha_value)
print("RMSE value:", rmse_values[index])
print("Coefficient of determination:",r2_values[index]) 
print("==========================================")

Best Values for 0.00001-1
lowest RMSE value in the list
Best alpha value for RMSE (1-100): 0.12853000000000003
RMSE value: 5.028847093658451
Coefficient of determination: 0.689039331848


lowest r2 value in the list
Best alpha value for r2 (1-100): 0.99999
RMSE value: 5.356040351357251
Coefficient of determination: 0.647258725449
Best Values for 1-100
lowest RMSE value in the list
Best alpha value for RMSE (1-100): 1
RMSE value: 5.3560427488577425
Coefficient of determination: 0.647258409657


lowest r2 value in the list
Best alpha value for r2 (1-100): 7
RMSE value: 9.020954328599489
Coefficient of determination: -0.000630110804626


### Summary of best of LinearRegression, Ridge and Lasso 

In [104]:
print("----------------------------------------------------------------------")
print("Linear Regression RMSE:",linear_regression_rmse)
print("Linear Regression Coefficient of determination:", linear_regression_r2 )
print("----------------------------------------------------------------------")
print("Ridge RRMSE value: 5.028847093658451")
print("RidgeCoefficient of determination: 0.689039331848")
print("----------------------------------------------------------------------")
print("Lasso RMSE value: 5.038838234578068")
print("Lasso Coefficient of determination: 0.687802492433")
print("----------------------------------------------------------------------")


----------------------------------------------------------------------
Linear Regression RMSE: 5.118757366097946
Linear Regression Coefficient of determination: 0.677820660033
----------------------------------------------------------------------
Ridge RRMSE value: 5.028847093658451
RidgeCoefficient of determination: 0.689039331848
----------------------------------------------------------------------
Lasso RMSE value: 5.038838234578068
Lasso Coefficient of determination: 0.687802492433
----------------------------------------------------------------------
