Linear Regression Tutorial
===

Some problems don't have discrete (categorical) labels (e.g. color, plant species), but rather a continuous range of numbers (e.g. length, price). For these types of problems, regression is usually a good choice. Rather than predicting a categorical label for each example, it fits a continuous line (or plane, or curve) to the data in order to give a predicition as a number. 

If you've ever found a "line of best fit" using Excel, you've already used regression!

Setup
===
Tell matplotlib to print figures in the notebook. Then import numpy (for numerical data), matplotlib.pyplot (for plotting figures), linear_model (for the scikit-learn linear regression algorithm), datasets (to download the Boston housing prices dataset from scikit-learn), and cross_validation (to create training and testing sets).

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets # Import the linear regression function and dataset from scikit-learn
from sklearn import cross_validation
from sklearn.metrics import mean_squared_error, r2_score

# Print figures in the notebook
%matplotlib inline 

Import the dataset
===
Import the dataset and store it to a variable called iris. Scikit-learn's explanation of the dataset is [here](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html). This dataset is similar to a python dictionary, with the keys: ['DESCR', 'target', 'data', 'feature_names']

The data features are stored in boston.data, where each row is data from a suburb near boston, and each of the 13 columns is a single feature. The 13 feature names (with the label name as the 14th element) are stored in boston.feature_names, and include information such as the average number of rooms per home and the per capita crime rate in the town. Labels are stored as the median housing price (in thousands of dollars) in boston.target.

Below, we load the labels into y, the data into X, and the names of the features into featureNames. We also print the description of the dataset.

In [None]:
boston = datasets.load_boston()

y = boston.target
X = boston.data
featureNames = boston.feature_names

print(boston.DESCR)

Create Training and Testing Sets
---

In order to see how well our classifier works, we need to divide our data into training and testing sets

In [None]:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.3)

Visualize The Data
===

There are too many features to visualize the whole training set, but we can plot a single feature (e.g. average numbe of rooms) against the average housing price.

In [None]:
plt.scatter(X_train[:,5], y_train)
plt.ylabel('Average Houseing Price')
plt.xlabel('Avearge Number of Rooms')

Train A Toy Model
===

Here we train the regression on a single feature, then plot the linear regression line on top of the data. We do this by first fitting the regression model on our training data, and then predicting the output of the model for that same training data. These predictions are plotted as a line on top of the training data. 

This can't tell us how well it will perform on new, unseen, data, but it can show us how well the line fits the training data.

In [None]:
regr = linear_model.LinearRegression()
x_train = X_train[:,5][np.newaxis].T # regression expects a (#examples,#features) array shape
regr.fit(x_train, y_train)

plt.scatter(x_train, y_train)
plt.plot(x_train, regr.predict(x_train), c='r')
plt.ylabel('Average Houseing Price')
plt.xlabel('Avearge Number of Rooms')
plt.title('Regression Line on Training Data')

Test the Toy Model
===

Next, we will test the ability of our model to predict the average housing price for the neighborhoods in our test set, using only the average number of rooms.

First, we get our predictions for the training data, and plot the predicted model on top of the test data

In [None]:
x_test = X_test[:,5][np.newaxis].T # regression expects a (#examples,#features) array shape
predictions = regr.predict(x_test)

plt.scatter(x_test, y_test)
plt.plot(x_test, predictions, c='r')
plt.ylabel('Average Houseing Price')
plt.xlabel('Avearge Number of Rooms')
plt.xlabel('Avearge Number of Rooms')
plt.title('Regression Line on Test Data')

Next, we evaluate how well our model worked on the training dataset. Unlike with discrete classifiers (e.g. KNN, SVM), the number of examples it got "correct" isn't meaningful here. We may care if it is thousands of dollars off, but do we care if it's a few cents from the correct answer?

There are many ways to evaluate a linear classifier, but one popular one is the mean-squared error, or MSE. As the name implies, you take the error for each example (the distance between the point and the predicted line), square each of them, and then add them all together. 

Scikit-learn has a function that does this for you easily.

In [None]:
mse = mean_squared_error(y_test, predictions)

print('The MSE is ' + '%.2f' % mse)

The MSE isn't as intuitive as the accuracy of a discrete classifier, but it is highly useful for comparing the effectiveness of different models. Another option is to look at the $R^2$ score, which you may already be familiar with if you've ever fit a line to data in Excel. A value of 1.0 is a perfect predictor, while 0.0 means there is no correlation between the input and output of the regression model.

In [None]:
r2score = r2_score(y_test, predictions)

print('The R^2 score is ' + '%.2f' % r2score)

Train A Model on All Features
===

Next we will train a model on all of the available features and use it to predict the housing costs of our training set. We can then see how this compares to using only a single feature.

In [None]:
regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)

predictions = regr.predict(X_test)

mse = mean_squared_error(y_test, predictions)
print('The MSE is ' + '%.2f' % mse)

r2score = r2_score(y_test, predictions)
print('The R^2 score is ' + '%.2f' % r2score)