Before you turn this problem in, make sure everything runs as expected. In the menubar, select **Kernel** $\rightarrow$ **Restart Kernel and Run All Cells...**. If you do not run a specific cell, you will not receive credit for that question. 

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = "" # put your full name here
COLLABORATORS = [] # list anyone you collaborated with on this workbook

---

## Lab 5: Least Squares Regression (Single-Variable)

Welcome to your fifth lab of the semester!<br>

This lab aims to get you started with linear regression in Python.

By the end of this lab you should be able to:
* Calculate the coefficients of a single-variable least squares regression problem
* Build a model and predict estimates for a variable of interest
* Evaluate model performance

### Setup

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

### Introduction and Data Source

In this lab, we'll be applying least squares regression to data from the [California Department of Transportation (CalTrans)](https://data.ca.gov/). This dataset is no longer publicly available, so we have downloaded it for you and put it in the `data` folder.  

**Question 1 (1pt):** Load in the .csv file in the "data" folder and save it to a dataframe `df`.

In [None]:
# YOUR CODE HERE
# df = ...
# df

This dataset reports freeway congestion in California, organized by county and route. For this exercise, we'll be looking specifically at the Annual Vehicle Miles Traveled (VMT) field, which represents the total number of miles traveled per vehicle on that route in that county, and the Incidents/ Day field, which represents the average number of traffic incidents per day for that route and county in 2017.

Let's create a model to predict the number of Incidents/Day (i.e., the target variable) as a function of annual VMT (i.e., the independent variable). 

**Question 2 (2pts):** To start off with, create a scatter plot with Annual VMT on the x-axis and Incidents/Day on the y-axis. What can you say about the general relationship between these two variables?

*Note*: instead of typing out a long column name everytime you need to use it, you can create a variable that contains that column name as a string. For instance, rather than typing out `df["Annual Vehicle Miles Traveled (VMT)"]`, you can define a variable `vmt`:
```python
vmt = "Annual Vehicle Miles Traveled (VMT)"
df[vmt]
```
You can also just re-name the column names.

We recommend that you use one of these methods, as we'll be using these columns for the rest of the assignment. 

In [None]:
# YOUR CODE HERE
vmt = ...
sns.scatterplot(...)
plt.title(...)
plt.show()

*Your observations here*

### Estimate the coefficients

In lecture we went over formulas to solve for the coefficients $\beta_0$ and $\beta_1$ in a single-variable least squares regression problem:

$y_i = \beta_0 + \beta_1 x_i + e_i$.

Those formulas are:

$\hat{\beta}_0  =\bar{y} - \hat{\beta}_1\bar{x}$

$\hat{\beta}_1 = \frac{ \sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2}$

**Question 3 (1 pt):** Write a function that returns the estimated $\beta_0$ and $\beta_1$ using the summation formulas above, taking the vectors of all $x$ and $y$ observations as input.

In [None]:
def get_betas(x,y):
#     # YOUR CODE HERE
    return ...

**Question 4 (1 pt):** Use your function to estimate $\beta_0$ and $\beta_1$ for the independent and response variables of interest in the Caltrans data you loaded.  

In [None]:
(b0, b1) = ... # YOUR CODE HERE
print('Beta values are', b0, 'and', b1)

### Predict the target (dependent) variable

**Question 5 (1 pt):** Use your estimated coefficients to predict Incidents/Day ($\hat{y}$) for every observation of annual VMT ($x_i$).

In [None]:
y_hat = ... # YOUR CODE HERE

In [None]:
assert len(y_hat) == len(df) # Your code should return a predicted value of y for every observation in the dataset

**Question 6 (1 pt):** Output a plot that overlays your regression line on a scatterplot of VMT vs. incidents per day. 

In [None]:
# YOUR CODE HERE
plt.scatter(...)
plt.plot(...)
plt.title(...)
plt.xlabel(...)
plt.ylabel(...)
plt.legend()
plt.show()

### Model estimation and prediction using sckit tools

We can (and will) also estimate coefficients and predict response variables using the Python package scikit-learn. As we move forward in this class, we will be developing more complicated models and using more than one independent variable. The scikit-learn toolbox gives us a way to run regression (and other!) models quickly and efficiently. Let's walk through an example using single-variable regression.

First, we need to set up some new dependencies.

In [None]:
# Install sklearn
! pip install scikit-learn

In [None]:
# Import packages

from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

The `scikit-learn` package has a `linear_model` object upon which you can call `LinearRegression()` to generate a linear regression object:

`lm = linear_model.LinearRegression()`

`lm` takes as arguments to its `.fit()` method the arrays $X$ and $y$, where $X$ is a dataframe of independent variables and $y$ is a dataframe of the dependent variable, or our "target" data.

*Note*: The `scikit-learn` functions will only accept $X$ and $y$ as inputs if both dimensions of these arrays' respective shapes are explicitly defined. In other words, the `linear_model` functions will produce errors if either the $X$ or $y$ array has a `shape` of the form `(n,)`, where n is the number of elements in the array. Instead, one-dimensional arrays need to be reformatted to have the shape `(n,1)`. You'll have to get the values from your panda dataframe for $X$ and $y$, and then use the `.reshape()` method to get the right dimensions. Alternatively, `scikit-learn` will also accept an input if it takes the form of a pandas dataframe rather than a pandas series; for example, defining $X$ as `df[['column_name']]` is acceptable in `scikit-learn` syntax, but defining $X$ as `df['column name']` is not.

For example, we would initiate and fit a linear regression model for the CalTrans data as follows:

In [None]:
X = df[[vmt]] # define an array of independent variables
y = df[[i_day]] # define an array (usually one-dimension) of target variables
lm_incidents = linear_model.LinearRegression() # initiate a linear regression object
fit_incidents = lm_incidents.fit(X,y) # fit the linear regression object to your X and y data

In the code above, the `.fit()` method estimates the coefficients for our linear model. We can access the coefficients $\hat{\beta}_0$ and $\hat{\beta}_1$ as follows:

In [None]:
beta0_hat =  fit_incidents.intercept_
beta1_hat = fit_incidents.coef_
print('beta0_hat:', beta0_hat)
print('beta1_hat:', beta1_hat) # If we had more than one x term, .coef_ would return more than one coefficient, i.e., beta1_hat, beta2_hat...

**Question 7 (1 pt):** How do the estimates of $\hat{\beta}_0$ and $\hat{\beta}_1$ that we found using the `scikit-learn` tools compare to those we found using the linear regression equations?

*YOUR ANSWER HERE*

We can also use `scikit-learn` to predict the target variable. The `linear_model` object we initated and fit for the CalTrans data has a `.predict()` method, which takes in a matrix $X$ and returns a list of $\hat{y}$ values. 

In [None]:
y_pred = fit_incidents.predict(X)

Check that the values for y_pred equal the values for y_hat, at least the the 8th decimal place.

In [None]:
assert (np.round(y_pred,8) == np.round(y_hat.values.reshape(-1,1),8)).all() 

### Evaluate model performance

**Question 8 (2 pts):** Using the `y_pred` predicted values you developed above, calculate the error term $e_i$ (aka, the residual) for each pair of predictions and observations. The result for `error` should be a 1-dimensional array that has the same length as our number of observations. Then, create a scatter plot with the residual on the y-axis and Annual VMT on the x-axis. Overlay on your plot a dotted horizontal line that crosses the y-axis at zero.

In [None]:
# #YOUR CODE HERE

error = ... 

plt.scatter(...)
plt.axline(...)
plt.title(...)
plt.xlabel(...)
plt.ylabel(...)
plt.show()

**Question 9 (1pt):** Visually inspect your residual plot. Are there any regions of the x-domain in which your model seems to be systematically over- or under-estimating the response variable? Is this a sign of variance or bias in your model, and what is one strategy for correcting this issue?

*YOUR ANSWER HERE*

**Question 10 (1 pt)** Calculate the mean square error (MSE) for your model using the formula below. Your result should be a single, non-negative value.

$
MSE  =\frac{1}{n} \sum_{i=1}^n e_i^2
$

*Hint:* Use the `error` array you created in Question 8.

In [None]:
# YOUR CODE HERE
MSE = ...
print(MSE)

Alternatively, you can use scikit-learn built-in functions to calculate MSE.

In [None]:
MSE = mean_squared_error(df[i_day], y_hat)
MSE

### Train-Test Split

In the previous problem, you evaluated the mean squared error of your model on the full set of training data. To get a better sense of the out-of-sample performance of your modeling approach, you can instead divide our data into train and test sets, fit the model on the training set, and evaluate it on the train and test sets. 

**Question 11 (2pts):** Complete the code below to do a train/test split, fit a linear model on the training set, and evaluate the mean squared error (MSE) of your model on both the training set and the test set. 

In [None]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=123)

# Fit a model on the training set 
# YOUR CODE HERE

# Use your model to predict y values for both the train and test set
# YOUR CODE HERE

# Evaluate MSE for both the train and test set
MSE_train = ...
MSE_test = ...

# Print results
print('Train MSE: ' + str(MSE_train))
print('Test MSE: ' + str(MSE_test))

**Question 12 (1pts):** Which MSE is highter? Why might that be?

*YOUR ANSWER HERE*

# Hooray, you're done! 

Please remember to submit your lab work, after clicking Kernel -> Restart & Run All, in .html format on bCourses.