# Ordinary Least Squares Regression

In this exercise, we will build OLS Regression models, both univariate and multivariate. We will also evaluate the goodness of fit of our models. We will use the Auto dataset from https://archive.ics.uci.edu/ml/datasets/auto+mpg. It is provided as <b>Auto.csv</b> in the data directory.

First, we will visualize the data to understand how automobile features may be related. Then, we will focus on predicting miles per gallon (mpg) from horsepower (hp). We will build two iterations of the model:

\begin{align*}
    MPG^{\left(i\right)} &= \beta_0^{sv} + \beta_1^{sv} \times hp^{\left(i\right)} + Z^{\left(i\right)} \\
    MPG^{\left(i\right)} &= \beta_0^{mv} + \beta_1^{mv} \times hp^{\left(i\right)} + \beta_2^{mv} \times \left(hp^{\left(i\right)}\right)^2 + Z^{\left(i\right)}
\end{align*}

$Z^{\left(i\right)}$ is noise. We will assess each model by visualizing the predictions, residuals, and quantile-quantile plots. Finally, we will select a model based on held-out data.


In [None]:
#Import the necessary libraries
%matplotlib inline

from matplotlib import pyplot as plt
import utils
import numpy as np
import pandas as pd
import statsmodels.api as sm
from scipy import stats
import warnings
import sklearn

# some settings
warnings.filterwarnings('ignore')
plt.rc('font', size = 14)

Read the data as a pandas dataframe. For reference: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

In [None]:
# Read the data
filename = 'data/Auto.csv'
df = pd.read_csv(filename)

# (0). Visualize the data

**(0a).** [1 pt] Print the number of observations in the dataset, and the first 5 rows of the dataframe.

**(0b).** [1 pt] Print all the columns in the dataset

We assume that the "Name" does not contain relevant information about the target variable. Thus, we will not use this column in our analysis, so you shall drop it from your dataframe (hint: you might find the command dataframe.drop() useful).

**(0c).** [1 pt] Produce a Scatter plot of all variables against each other.

Feel free to use the <b>scatter_plot_dataframe()</b> function in utils.py. Note: this function call may take a while.

**(0d).** [1 pt] Produce a plot of correlations between all variables.

Feel free to use the <b>correlation_plot()</b> function in utils.py

**(0e).** [1 pt] Using the plots above, can you identify the highly collinear pairs of variables? 

**A:** (Type your answer here)

**(0f).** [2 pts] Create a function that splits your data into two subsets: a random subset for training and the remaining for testing. Note that the subsets MUST not overlap (i.e. if you use sampling then do so without replacement). You should have roughly 70% of the data in training and the remaining roughly 30% in testing. Call your function and name the outputs **df_train** and **df_test**.

For reproducibility, your function should take in a seed that is used in whichever random generator you choose. Please fill in the function specified below:

In [None]:
def train_test_split(df,
                     seed):
    '''
    Randomly split the pandas dataframe so that 70% of the data are in training and 30% are in testing
    @param df: pandas DataFrame
    @param seed: int, seed for random generator, set for reproducibility
    @return: 1. df_train, pandas DataFrame containing training samples
             2. df_test, pandas DataFrame containing testing samples
    '''
    pass

# (1). Single Variable OLS Regression

We will start with a single variable regression model. We hypothesize a linear relationship (with intercept) between 'mpg' and 'horsepower':

\begin{equation*}
    MPG^{\left(i\right)} = \beta_0^{sv} + \beta_1^{sv} \times hp^{\left(i\right)} + Z^{\left(i\right)}
\end{equation*}

**(1a).** [2 pts] On <b>df_train</b> ONLY, build an OLS model using <b>statsmodels.OLS</b>. You do not have to fit the model in this step.

The dependent variable is 'mpg', and the independent variable is 'horsepower'. Include an intercept using the add_constant() function in statsmodels. Store your single-variable model in a variable called <b>sv_model</b>

Hint: http://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html

**(1b).** [1 pt] Use statsmodels' <b>fit()</b> function to fit the model. Store the output in a variable called <b>sv_results</b>

**(1c).** [1 pt] Print a summary of the results by calling <b>summary()</b> on the result from <b>fit()</b>. <br>
You can pass an array of columns as 'xname' to show the column names in the output 

**(1d).** [1 pt] What is the R^2 (goodness of fit)?

**A**: (Type your answer here)

**(1e).** [1 pt] Compute predictions for your training samples using the regression. Call the <b>predict()</b> function on <b>sv_results</b>. Store your predictions as <b>sv_y_hat</b>.

**(1f).** [2 pts] On the same plot (using different colors):

1. Produce a scatter plot 'mpg' vs 'horsepower'
2. Produce a plot (or scatterplot) of predictions (i.e. y_hat), 'horsepower'

Label your plots and axes. 

**(1g).** [1 pt] Does the best fit line appear to do justice to the data?

**A:** (Type your answer here)

**(1h).** [1 pt] Compute the residuals between observed and predicted mpg for your training samples. Store them as <b>sv_residuals</b>.

**(1i).** [1 pt] Produce a residual plot, that is, a scatter plot of the residuals vs 'horsepower'. Label your plot and axes. A good residual plot has roughly the same variance in residuals across different values of $x$.

**(1j).** [1 pt] Produce a histogram of the <b>sv_residuals</b>.

**(1k).** [1 pt] Now produce a QQ-plot of the residuals vs the Normal distribution. A QQ-plot compares 2 distributions by plotting their quantiles against each other. You can use the <b>sm.qqplot()</b> function: https://www.statsmodels.org/dev/generated/statsmodels.graphics.gofplots.qqplot.html

Include a 45-degree reference line. To make your data comparable to this reference line, set the **fit** parameter to standardize your data first.

**(1l).** [1 pt] Do the histogram and QQ-plot suggest that the residuals are Normally distributed?

**A:** (Type your answer here)

# (2). Two Variable OLS Regression
We will now add one more variable to the single variable regression model. We hypothesize a linear relationship (with intercept) between 'mpg' and the features 'horsepower' and 'horsepower^2'.

\begin{equation*}
    MPG^{\left(i\right)} = \beta_0^{mv} + \beta_1^{mv} \times hp^{\left(i\right)} + \beta_2^{mv} \times \left(hp^{\left(i\right)}\right)^2 + Z^{\left(i\right)}
\end{equation*}

We first introduce a squared-horsepower column.

In [None]:
square_columns = ['horsepower']

**(2a).** [1 pt] Use the <b>introduce_power_terms()</b> function in <b>utils.py</b> to update both **df_train** and **df_test** with the new squared column. Be sure to provide <b>power=2</b> to <b>introduce_power_terms()</b>. Print the columns in either df_train or df_test and observe the new column you've just added.

**(2b).** [1 pt] Build an OLS model using <b>statsmodels.OLS</b>. You do not have to fit the model in this step.

The dependent variable is 'mpg', and the independent variables are 'horsepower' and 'horsepower^2'. Include the intercept using the add_constant() function in statsmodels. Store your multivariable model in a variable called <b>mv_model</b>.

**(2c).** [1 pt] Similar to the exercise above: Fit the model. Then, compute predictions and residuals for the training samples. Save your predictions as <b>mv_y_hat</b> and your residuals as <b>mv_residuals</b>.

**(2d).** [2 pts] On the same plot (using different colors):

1. Produce a scatter plot 'mpg' vs 'horsepower'
2. Produce a plot of predictions (i.e. mv_y_hat) vs 'horsepower'

Label your plots and axes

**(2e).** [1 pt] Does the best fit line now appear to do better justice to the data? Why or why not?

**A:** (Type your answer here) 

**(2f).** [1 pt] Produce a histogram of the <b>mv_residuals</b>.

**(2g).** [1 pt] Produce a QQ-plot vs Normal distribution of the <b>mv_residuals</b>.

**(2h).** [1 pt] Do the histogram and QQ-plot suggest that the residuals are Normally distributed?

**A:** (Type your answer here) 

**(2i).** [4 pts] With the <b>train_test_split()</b> function you created in part (0f), produce 10 random subsets of the data for training and testing. For each random split, pass in a different seed.

Using the random splits, run 10 trials of the single variable and multivariable regression. Train your models on the training set and then produce predictions on the corresponding test set. Record the mean squared error on each of the 10 test sets. Print out the average mean squared error on the test sets for both the single variable and multivariate models. 

**(2j).** [1 pt] Which of the two models would you prefer, and why? Consider both goodness of fit on the training data and the mean squared error on the test sets.

**A:** (Type your answer here)