# Ordinary Least Squares Regression
In this exercise, we will build simple OLS Regression models, both single and multivariable. We will also evaluate the goodness of fit of our simple models.
For this exercise, we will be using the Auto dataset. It is provided as <b>Auto.csv</b> in the same folder as this notebook.

In [None]:
#Import the necessary libraries
%matplotlib inline

from matplotlib import pyplot as plt
import utils
import numpy as np
import statsmodels.api as sm
from scipy import stats

Read the data as a pandas dataframe. For refernece: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

In [2]:
# Read the data
filename = 'Auto.csv'
df = utils.read_dataset_from_csv(filename)

(0a). Print the number of observations in the dataset. 

(0b). Print all the columns in the dataset

(0c).Produce a Scatter plot of all variables against each other. <br> 
Feel free to use the <b>scatter_plot_dataframe()</b> function in utils.py. <br>
Note: this function call may take a while.

(0d). Produce a plot of correaltions between all variables. <br>
Feel free to use the <b>correlation_plot()</b> function in utils.py

(0e). Using the plots above, can you identify the highly collinear pairs of variables? 

(Provide pairs of highly collinear columns/variables here)

(0f). Produce two subsets of your data: a random subset for training and the remaining for testing. Note that the subsets MUST not overlap (i.e. if you use sampling then do so without-replacement). You should have roughly 70% of the data in training and the remaining roughly 30% in testing. Name your datasets as <b>df_train</b> and <b>df_test</b>.

<b> (1). Single Variable OLS Regression.</b> <br>
We will start with a single variable regression model. We hypothesize a simple linear relationship (with intercept) between 'mpg' and 'horsepower'

(1a). On <b>df_train</b> ONLY, build a simple OLS model using <b>statsmodels.OLS</b> <br>
Your dependent variable is 'mpg'; <br>
Independent variable is 'horsepower'. <br>
Do include the  intercept using the add_constant() function in statsmodels. <br>
Hint: http://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html <br>
Store your single-variable model in a variable called <b>sv_model</b>

(1b). Use statsmodels' <b>fit()</b> function to fit the model. Store the output in a variable called <b>sv_results</b>

(1c). Print a summary of the results by calling <b>summary()</b> on the result from <b>fit()</b>.<br> 
You can pass an array of columns as <b>xname</b> to show the column names in the output

(1d). Interpret the Results: <br>
i. Are the intercept and 'horsepower' statistically significant? <br>
ii. What is the R^2 (goodness of fit)?

(Type your answer here)

(1e). Produce in-sample predictions using the regression relationship. Call the <b>predict()</b> function on <b>sv_results</b>.
Store your predictions as <b>sv_y_hat</b>.

(1f). On the same plot (using different colors): <br>
1. Produce a scatter plot 'mpg' vs 'horsepower'; <br>
2. Produce a plot (or scatterplot) of predictions (i.e. y_hat), 'horsepower'. <br>
Label your plots and axes. 

(1g). Does the best fit line appear to do justice to the data?

(Type your answer here.)

(1h). Produce the residuals between observed mpg and in-sample predictions. Store them as <b>sv_residuals</b>.

(1i). Produce a scatter plot of the residuals vs 'horsepower'. Label your plot and axes.

(1j). Produce a histogram of the <b>sv_residuals</b>.

(1k). Now produce a QQ-plot of the residuals vs the Normal distribution. You can use the <b>probplot()</b> function in the <b>Scipy.stats</b> library using default options.

(1l). Do the histogram and QQ-plot suggest that the residuals are Normally distributed?

(Type your answer here)

<b>(2). Two Variable OLS Regression. </b> <br>
We will now add one more variable to the single variable regression model. We hypothesize a simple linear relationship (with intercept) between 'mpg', 'horsepower' and 'horsepower' (squared).

We first introduce a squared-horsepower column.

In [6]:
square_columns = ['horsepower']

(2a). Use the <b>introduce_power_terms()</b> function in <b>utils.py</b> to update the dataframe (df) with the new squared column. Be sure to provide <b>power=2</b> to <b>introduce_power_terms()</b>

(2b). Print the columns in the dataset and observe the new column you've just added.

(2c). Build a simple OLS model using <b>statsmodels.OLS</b>. 
Your dependent variable is 'mpg'; 
Independent variables are 'horsepower' and 'horsepower^2'.
Do include the intercept using the add_constant() function in statsmodels. 
Hint: http://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html 
Store your multivariable model in a variable called <b>mv_model</b>

(2d). Similar to the exercise above: fit the model, produce in-sample predictions and residuals. <br>
Save your in-sample predictions as <b>mv_y_hat</b> <br>
Save your residuals as <b>mv_residuals</b>.

(2e). On the same plot (using different colors):<br>
1. Produce a scatter plot 'mpg' vs 'horsepower'; <br>
2. Produce a plot of predictions (i.e. mv_y_hat) vs 'horsepower'. <br>
Label your plots and axes

(2f). Does the best fit line now appear to do better justice to the data? Why or why not?

(Type your answer here)

(2g). Produce a histogram of the <b>mv_residuals</b>.

(2h). Produce a QQ-plot vs Normal distribution of the <b>mv_residuals</b>.

(2i). Do the histogram and QQ-plot suggest that the residuals are Normally distributed?

(Type your answer here)

(2j). Wrap part 0f. into a function <b>train_test_split()</b> that returns a random training and test set from a dataframe. Recall the training set should be of size roughly 70% and the test set 30%. 

(2k). Use <b>train_test_split()</b> to produce 10 random subsets of the data for training and testing. Using the random splits, run 10 trials of the single variable and multivariable regression. Train your models on the Train set and then produce predictions on the corresponding Test set. Record the Mean-Squared Error on each of the 10 test sets. Print out the average Mean-Squared Error on the test sets for both the single varialbe and multivariable models. 

(2l). What model among the single-variable and multi-variable ones would you prefer, and why? Consider both the regression fits on the Training data and the Mean-Squared Error on the Test set(s).

(Type your answer here).