## Regression Evening Exercise: Housing in Boston

Your assignment is to predict the price of housing in Boston based on the features of the housing.

Since this is a group exercise, there will be opportunities to collaborate, but be sure to write your own code so that you can reinforce the key concepts learned in today's lessons.

1. (__Individual__) Use `sklearn` to fit a multiple linear regression model.
  - (__Group discussion__) How will you decide which features to include?
  
  
2. (__Individual__) What is the coefficient of determination (r-squared) for your model? What about the root mean squared error (RMSE)?


3. (__Group discussion__) Compare your results to make sure you ran your code successfully. Can you improve upon your original model?
  - (__Individual and group discussion__) Look at the correlations of the features to the target variable as well as the coefficients of your original model. How could you use that information to select a better set of features? 
  - (__Group discussion__) What features could you create that might be useful (e.g., categorical features based on binning numeric features). How do you handle categorical features in a linear regression model?
  - (__Individual__) Update your model with the improvements you discussed as a group.


4. (__Individual__) Make a scatterplot of the observations in the test data, where the x-axis is the actual price and the y-axis is the predicted price from your best model. What does this plot tell you about the model you created?

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split


In [2]:
seed = 0
np.random.seed(seed)

In [3]:
# load the Boston housing dataset from sklearn
boston = load_boston()
bos = pd.DataFrame(boston.data)

# give our DataFrame the appropriate feature names
bos.columns = boston.feature_names

# Add the target variable to the DataFrame
bos['Price'] = boston.target

In [4]:
# characteristics of the data for your reference
print(boston.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

In [7]:
boston.data

array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,
        4.9800e+00],
       [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,
        9.1400e+00],
       [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,
        4.0300e+00],
       ...,
       [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
        5.6400e+00],
       [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,
        6.4800e+00],
       [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
        7.8800e+00]])

In [8]:
# use a standard train/test split for comparing model performance across students
# Split the data into a train test split
bos_train_X, bos_test_X, bos_train_y, bos_test_y = train_test_split(bos.drop(['Price'], axis = 1), bos['Price'], test_size=0.25, random_state=seed, shuffle=True)

#### Your code below

In [9]:
bos_train_X

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
245,0.19133,22.0,5.86,0.0,0.431,5.605,70.2,7.9549,7.0,330.0,19.1,389.13,18.46
59,0.10328,25.0,5.13,0.0,0.453,5.927,47.2,6.9320,8.0,284.0,19.7,396.90,9.22
276,0.10469,40.0,6.41,1.0,0.447,7.267,49.0,4.7872,4.0,254.0,17.6,389.25,6.05
395,8.71675,0.0,18.10,0.0,0.693,6.471,98.8,1.7257,24.0,666.0,20.2,391.98,17.12
416,10.83420,0.0,18.10,0.0,0.679,6.782,90.8,1.8195,24.0,666.0,20.2,21.57,25.79
...,...,...,...,...,...,...,...,...,...,...,...,...,...
323,0.28392,0.0,7.38,0.0,0.493,5.708,74.3,4.7211,5.0,287.0,19.6,391.13,11.74
192,0.08664,45.0,3.44,0.0,0.437,7.178,26.3,6.4798,5.0,398.0,15.2,390.49,2.87
117,0.15098,0.0,10.01,0.0,0.547,6.021,82.6,2.7474,6.0,432.0,17.8,394.51,10.30
47,0.22927,0.0,6.91,0.0,0.448,6.030,85.5,5.6894,3.0,233.0,17.9,392.74,18.80


In [33]:
#lasso penalizes coeffient
#ridge penalizes the error
#https://towardsdatascience.com/feature-selection-using-regularisation-a3678b71e499


In [29]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler
pipeline_lr=Pipeline([('scalar1',StandardScaler()),('lr_regressor',LinearRegression())])
#('lasso',Lasso())

In [30]:
pipeline_lr.fit(bos_train_X,bos_train_y)

Pipeline(memory=None,
         steps=[('scalar1',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('lr_regressor',
                 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                                  normalize=False))],
         verbose=False)

In [31]:
mse=mean_squared_error(bos_test_y,pipeline_lr.predict(bos_test_X))
mae=mean_absolute_error(bos_test_y,pipeline_lr.predict(bos_test_X))
rmse=np.sqrt(mean_squared_error(bos_test_y,pipeline_lr.predict(bos_test_X)))
print('mse :{}'.format(mse))
print('mae :{}'.format(mae))
print('rmse :{}'.format(rmse))

mse :29.782245092302325
mae :3.6683301481357153
rmse :5.457311159564051


In [32]:
predict=lr.predict(bos_test_X)