Linear Regression Tutorial
===

Setup
===
Tell matplotlib to print figures in the notebook. Then import numpy (for numerical data), matplotlib.pyplot (for plotting figures), linear_model (for the scikit-learn linear regression algorithm), datasets (to download the Boston housing prices dataset from scikit-learn), and cross_validation (to create training and testing sets).

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets # Import the linear regression function and dataset from scikit-learn
from sklearn import cross_validation

# Print figures in the notebook
%matplotlib inline 

Import the dataset
===
Import the dataset and store it to a variable called iris. Scikit-learn's explanation of the dataset is [here](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html). This dataset is similar to a python dictionary, with the keys: ['DESCR', 'target', 'data', 'feature_names']

The data features are stored in boston.data, where each row is data from a suburb near boston, and each of the 13 columns is a single feature. The 13 feature names (with the label name as the 14th element) are stored in boston.feature_names, and include information such as the average number of rooms per home and the per capita crime rate in the town. Labels are stored as the median housing price (in thousands of dollars) in boston.target.

Below, we load the labels into y, the data into X, and the names of the features into featureNames. We also print the description of the dataset.

In [None]:
boston = datasets.load_boston()

y = boston.target
X = boston.data
featureNames = boston.feature_names

print(boston.DESCR)

Create Training and Testing Sets
---

In order to see how well our classifier works, we need to divide our data into training and testing sets

In [None]:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.3)

Visualize The Data
===

There are too many features to visualize the whole training set, but we can plot a single feature (e.g. average numbe of rooms) against the average housing price.

In [None]:
plt.scatter(X_train[:,5], y_train)