# Lab 5 Linear Regression

In this part of the lab we will use linear regression to fit some data about housing prices in Boston.

In [None]:
# notebook magic to display plots
%matplotlib inline
# notebook magic to auto reload imported modules when changes are made to them 
%load_ext autoreload
%autoreload 2

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import sklearn
import sklearn.datasets
boston = sklearn.datasets.load_boston()
print boston.keys()

To read a description of the dataset, uncomment the following line:

In [None]:
# print boston.DESCR

In [None]:
# Print column names
print boston.feature_names

The data (predictor variables) is stored as multi-dimensional array.  Let's convert this into a python pandas data frame for easier manipulation.

In [None]:
bos = pd.DataFrame(boston.data,columns=boston.feature_names)
bos.head()

The target variable (the y value we wish to predict) is stored separately.  In this case, the target variable is `MEDV` (the Median value of owner-occupied homes in \$1000’s).

In [None]:
price = boston.target  

## Exploratory Data Analysis and Summary Statistics
Let's explore this data set. First we use `describe()` to get basic summary statistics for each of the columns.


In [None]:
bos.describe()

Next let's look at some scatter plots to see the relationship between predictor variables and the target variable.  With a pandas data frame, you can access a column of data simply by using dot notation, like so:

In [None]:
crime = bos.CRIM
plt.scatter(crime, price)
plt.xlabel("Per capita crime rate by town (CRIM)")
plt.ylabel("Housing Price")
plt.title("Relationship between CRIM and Price")

**TODO:** Plot two more scatter plots: 'RM' vs. price and 'PTRATIO' vs. price.   Please write *descriptive* labels on the figures, such as above.  Note: this will require looking at the above documentation to see what 'RM' and 'PTRATIO' are!

In [None]:
# todo: your code here

## Linear Regression

Let's use sklearn to fit a linear regression model.  We'll start with a simple linear regression on 'RM'.  In other words, our model is: $price = \beta_0 + \beta_1 \times RM$.

We can select a subset of the columns of a data frame like this:

In [None]:
X = bos[['RM']]

Now let's fit a linear model using the sklearn module.

In [None]:
import sklearn
from sklearn.linear_model import LinearRegression
lm = LinearRegression()

def getCoefficients(X, lm):
    """Given a dataset X and a fitted linear model, returns a nice data frame showing the coefficient.s"""
    names = ['Intercept'] + list(X.columns)
    coeffs = [lm.intercept_] + list(lm.coef_)
    return pd.DataFrame({'names': names, 'estimatedCoefficients': coeffs})

In [None]:
lm.fit(X, price)
getCoefficients(X, lm)

We can also get the $R^2$ goodness of fit:

In [None]:
# execute this line to get some documentation on what score gives you
lm.score?

In [None]:
lm.score(X, price)

## Finding a good model

Your task is to find the best fitting model for this dataset.  You can accomplish this by writing some code that does this automatically (see p. 78 of the [ISL book](http://www-bcf.usc.edu/~gareth/ISL/)) or you can do it using a more manual approach

You can receive full credit if you implement the "Forward selection" approach described in the ISL book (p. 78). For a stopping criteria, you can simply add up to 5 predictor variables and then stop.  But you are encouraged to get creative (*challenge problem!*).  Here are things you might try:

- use exploratory data analysis to identify interesting patterns in the data that you can exploit
- try some of the other approaches described on p.78
- relax the additive assumption (p. 87 of ISL)
- consider non-linear relationships (p. 90)
- consider other data transformations (e.g., transform a numerical predictor into a categorial one then add the categorial predictor variable to your model, p. 82)

How will we judge if the model is good?  If we add more predictor variables, our $R^2$ can only go up, so it's not the best measure.  Instead we will split our dataset into two components:

In [None]:
bos, price = sklearn.utils.shuffle(bos, price, random_state=0)  # by shuffling bos and price together, we preserves the relationship
test_size = 250
trainBos = bos[:-test_size]
trainPrice = price[:-test_size]
testBos = bos[-test_size:]
testPrice = price[-test_size:]

Do *not* use the test data when building your model.  Instead, only use it at the end to evaluate your final model.  Here's an illustration of what that might look like:

In [None]:
# train a linear regression model with predictor variables RM and PTRATIO
columns = ['RM', 'PTRATIO']
X = trainBos[columns]
lm.fit(X, trainPrice)
print lm.score(X, trainPrice)
getCoefficients(X, lm)

In [None]:
# example of evaluating the model on the test data
Xtest = testBos[columns]
lm.score(Xtest, testPrice)

**TODO** Go forth and build your model!  Write your code in `lab5.py`.  In the space below, evaluate your model on the test data:

In [None]:
# todo: evaluate your model on the test data

**TODO** Write a brief description of how you fit your model.

**YOUR ANSWER HERE**:  *todo: replace this with your answer*