# Regression

Regression is the supervised learning task of predicting the value of a continuous outcome ("class") variable, _y_, given real input ("feature") data, _X_. The objective of regression is to learn a model of the data that can be use to predict the correct class value for new or unseen feature data.

A variety of regression algorithms exist. These algorithms have been developed under varying assumptions and employ different concepts. Each algorithm may interact with data differently based upon the size, dimensionality, and noise of the dataset, among other characteristics. These algorithms may have varying degrees of interpretability, variability, and bias.

Here, we'll use the scikit-learn (sklearn) package to expore the use of several regression algorithms. Let's fetch the Boston housing dataset from the UCI Machine Learning repository.

In [1]:
import numpy as np
import pandas as pd

fileURL = 'https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data'
data = pd.read_csv(fileURL, names=['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 
                                   'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV'], 
                   sep='\s+', header=None)

X = data.iloc[:, :-1]  # features
y = data.iloc[:, -1]  # class

print(X.head())
print()
print(y.head())

      CRIM  ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD  TAX  PTRATIO  \
0  0.00632  18   2.31     0  0.538  6.575  65.2  4.0900    1  296     15.3   
1  0.02731   0   7.07     0  0.469  6.421  78.9  4.9671    2  242     17.8   
2  0.02729   0   7.07     0  0.469  7.185  61.1  4.9671    2  242     17.8   
3  0.03237   0   2.18     0  0.458  6.998  45.8  6.0622    3  222     18.7   
4  0.06905   0   2.18     0  0.458  7.147  54.2  6.0622    3  222     18.7   

        B  LSTAT  
0  396.90   4.98  
1  396.90   9.14  
2  392.83   4.03  
3  394.63   2.94  
4  396.90   5.33  

0    24.0
1    21.6
2    34.7
3    33.4
4    36.2
Name: MEDV, dtype: float64


## Linear Regression

Perhaps the simplest regression algorithm is linear regression. Let's import it from `sklearn`. We can call the method with several available inputs.

In [2]:
from sklearn.linear_model import LinearRegression
regr = LinearRegression()

Every regressor in `sklearn` has a `fit()` method. For a supervised learning algorithm, which learns to map features _X_ to classes _y_, we must input the corresponding _X_ and _y_ data.

In [3]:
regr.fit(X, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

We call an algorithm fit (or parameterized) to input data a "model." Can now use the fitted model to predict the class value _y_ given new features _X_.

Here, we'll simply see how well the regressor can predict the data on which it was fit. We can predict on feature data using the fitted model's `predict()` method, the output of which are corresponding class predictions:

In [4]:
y_pred = regr.predict(X)

We can compare the actual values, `y`, to the predicted values, `y_pred`. Here we compute the mean squared error:

In [5]:
print(np.average((y - y_pred) ** 2))

21.8948311817


## Housing Dataset

Let's practice applying what we've learned to a house price dataset for Ames, Iowa.

In [6]:
df_houses = pd.read_excel('http://www.amstat.org/publications/jse/v19n3/decock/AmesHousing.xls')
print(df_houses.head())

   Order        PID  MS SubClass MS Zoning  Lot Frontage  Lot Area Street  \
0      1  526301100           20        RL           141     31770   Pave   
1      2  526350040           20        RH            80     11622   Pave   
2      3  526351010           20        RL            81     14267   Pave   
3      4  526353030           20        RL            93     11160   Pave   
4      5  527105010           60        RL            74     13830   Pave   

  Alley Lot Shape Land Contour    ...     Pool Area Pool QC  Fence  \
0   NaN       IR1          Lvl    ...             0     NaN    NaN   
1   NaN       Reg          Lvl    ...             0     NaN  MnPrv   
2   NaN       IR1          Lvl    ...             0     NaN    NaN   
3   NaN       Reg          Lvl    ...             0     NaN    NaN   
4   NaN       IR1          Lvl    ...             0     NaN  MnPrv   

  Misc Feature Misc Val Mo Sold Yr Sold Sale Type  Sale Condition  SalePrice  
0          NaN        0       5    20

### 1. Drop or fill missing values. Consider how to handle the following columns, which have many missing values: `Alley`, `Fence`, `Fireplace Qu`, `Misc Feature`, `Pool QC`.

In [7]:
# Code goes here.

### 2. Separate the features and the class.

In [8]:
# Code goes here.

### 3. Transform the features to dummy variables.

In [9]:
# Code goes here.

### 4. Split the dataset into 80% training data and 20% testing data. Each row should consist of the features, `X`, and the sales price, `y`.

In [10]:
# Code goes here.

### 5. Fit a  linear regressor. Fit the model on the training data. Use all of the feature data, excluding the sales price, to predict the sales price.

In [11]:
# Code goes here.

### 6. Estimate the mean squared error of the linear regressor.

In [12]:
# Code goes here.