In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"
data = pd.read_csv(url, sep=r"\s+", header=None)

# Add column names to DataFrame
data.columns = ['mpg', 'cylinders', 'displacement', 'horsepower', 
                'weight', 'acceleration', 'model year', 'origin', 
                'car name']

# Horsepower has some ? values. Replace them by NaNs.
data["horsepower"] = pd.to_numeric(data["horsepower"], errors="coerce")

# Remove the NaNs.
data = data[~data["horsepower"].isnull()]

To fit linear regression, you can use Scikit-Learn (or an `lm` function that you wrote yourself). Scikit-Learn requires that all variables already be quantitative. (So you have to have already converted categorical variables to dummy variables.)

In [None]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()

X = data[["cylinders", "displacement", "horsepower"]]
y = data["mpg"]
model.fit(X, y)

model.intercept_, model.coef_

To estimate the prediction error, you can use cross validation. Shown below is an example of how you would use Scikit-Learn's `cross_val_score` function to do 10-fold cross-validation.

In [None]:
from sklearn.model_selection import cross_val_score

cross_val_score(model, X, y, cv=10, scoring="neg_mean_squared_error")

# Task 1

Implement forward stepwise regression, using the cross-validation estimate of prediction error to determine which variable to add at each step and when to stop.

# Task 2

Implement backward stepwise regression, using the cross-validation estimate of prediction error to determine which variable to delete at each step and when to stop.

# Task 3

Use [`LassoCV`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.lasso_path.html) in Scikit-Learn to fit the entire Lasso path (as a function of $\lambda$) and return the coefficients for the $\lambda$ value that has the lowest cross-validation error.

**Notes:**
- Try fitting the Lasso with the predictor variables (i.e., the `X` matrix) standardized (using `sklearn.preprocessing.StandardScaler`) and unstandardized. Which one seems to give you better prediction accuracy?
- Note that Scikit-Learn calls the parameter $\lambda$ `alpha`.

In [None]:
from sklearn.linear_model import LassoCV