# Boston Housing Study
### Using data from the Boston Housing Study case as described in "Marketing Data Science: Modeling Techniques for Predictive Analytics with R and Python" (Miller 2015). We use data from the Boston Housing Study to evaluate regression modeling methods within a cross-validation design.

### The Boston Housing Study is a market response study of sorts, with the market being 506 census tracts in the Boston metropolitan area. The objective of the study was to examine the effect of air pollution on housing prices, controlling for the effects of other explanatory variables. The response variable is the median price of homes in the census track. Table 1 shows variables included in the case. Short variable names correspond to those used in previously published studies. 

### Scikit Learn documentation for this assignment:
#### http://scikit-learn.org/stable/modules/model_evaluation.html 
#### http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html
#### http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
#### http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html
#### http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html
#### http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html
#### http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html
#### http://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html

### Textbook reference materials:
#### Geron, A. 2017. Hands-On Machine Learning with Scikit-Learn and TensorFlow. Sebastopal, Calif.: O'Reilly. Chapter 3 Training Models has sections covering linear regression, polynomial regression, and regularized linear models. Sample code from the book is available on GitHub at https://github.com/ageron/handson-ml.

In [None]:
# import base packages into the namespace for this program
import numpy as np
import pandas as pd

# modeling routines from Scikit Learn packages
import sklearn.linear_model 
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.metrics import mean_squared_error, r2_score  
from math import sqrt  # for root mean-squared error calculation

In [None]:
# seed value for random number generators to obtain reproducible results
RANDOM_SEED = 1

# although we standardize X and y variables on input, we will fit the intercept term in the models
# expect fitted values to be close to zero
SET_FIT_INTERCEPT = True

In [None]:
# read data for the Boston Housing Study
# creating data frame restdata
boston_input = pd.read_csv('boston.csv')

In [None]:
# check the pandas DataFrame object boston_input
print('\nboston DataFrame (first and last five rows):')
print(boston_input.head())
print(boston_input.tail())

print('\nGeneral description of the boston_input DataFrame:')
print(boston_input.info())

In [None]:
# drop neighborhood from the data being considered
boston = boston_input.drop('neighborhood', 1)
print('\nGeneral description of the boston DataFrame:')
print(boston.info())

print('\nDescriptive statistics of the boston DataFrame:')
print(boston.describe())

In [None]:
# set up preliminary data for fitting the models 
# the first column is the median housing value response
# the remaining columns are the explanatory variables
prelim_model_data = np.array([boston.mv,\
    boston.crim,\
    boston.zn,\
    boston.indus,\
    boston.chas,\
    boston.nox,\
    boston.rooms,\
    boston.age,\
    boston.dis,\
    boston.rad,\
    boston.tax,\
    boston.ptratio,\
    boston.lstat]).T

In [None]:
# dimensions of the polynomial model X input and y response
# preliminary data before standardization
print('\nData dimensions:', prelim_model_data.shape)

In [None]:
# standard scores for the columns... along axis 0
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
print(scaler.fit(prelim_model_data))

In [None]:
# show standardization constants being employed
print(scaler.mean_)
print(scaler.scale_)

In [None]:
# the model data will be standardized form of preliminary model data
model_data = scaler.fit_transform(prelim_model_data)

In [None]:
# dimensions of the polynomial model X input and y response
# all in standardized units of measure
print('\nDimensions for model_data:', model_data.shape)