In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab10.ipynb")

<img style="display: block; margin-left: auto; margin-right: auto" src="./ccsf-logo.png" width="250rem;" alt="The CCSF black and white logo">

# Lab 10: Residuals

## References

* [Sections 15.0 - 15.6 of the Textbook](https://inferentialthinking.com/chapters/15/Prediction.html)
* [Sections 16.0 - 16.3 of the Textbook](https://inferentialthinking.com/chapters/16/Inference_for_Regression.html#)
* [datascience Documentation](https://datascience.readthedocs.io/)
* [Python Quick Reference](https://ccsf-math-108.github.io/materials-sp24/resources/quick-reference.html)

## Assignment Reminders

- Make sure to run the code cell at the top of this notebook that starts with `# Initialize Otter` to load the auto-grader.
- For all tasks indicated with a 🔎 that you must write explanations and sentences for, provide your answer in the designated space.
- Throughout this assignment and all future ones, please be sure to not re-assign variables throughout the notebook! _For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously!_
- Collaborating on labs is more than okay -- it's encouraged! You should rarely remain stuck for more than a few minutes on questions in labs, so ask an instructor or classmate for help. (Explaining things is beneficial, too -- the best way to solidify your knowledge of a subject is to explain it.) Please don't just share answers, though.
- View the related <a href="https://ccsf.instructure.com" target="_blank">Canvas</a> Assignment page for additional details.

Run the following cell to set up the lab, and make sure you run the cell at the top of the notebook that initializes Otter.

In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

## ✈️ San Francisco International Airport Utility Usage

<img src="sfo.webp" width=80% alt="The SFO Internation temrinal">

The San Francisco International Airport (SFO) utilizes a lot of utilities. The data in `sfo_usage.csv` includes the total monthly utility consumption for the electricity (`'electricity'`), natural gas (`'gas'`), and water (`'water'`) utilities. This data was sourced from the [SFO Airport Monthly Utility Consumption for Natural Gas, Water, and Electricity page](https://data.sfgov.org/Energy-and-Environment/SFO-Airport-Monthly-Utility-Consumption-for-Natura/gcjv-3mzf/about_data) on data.sfgov.org. The units for each utility are:

* Electricity: kWh
* Natural Gas: therms
* Water: million gallons

The `'passengers'` column contains the total number of passengers in SFO for the given month.

**Run the following cell to load the data in the table `sfo`.**

In [None]:
sfo = Table.read_table('sfo.csv')
sfo.show(3)

In this lab, you'll create models to predict the usage of electricity from the usage of water and evaluate the model by analyzing the residuals.

### Task 01 📍🔎

<!-- BEGIN QUESTION -->

To start, create a scatter plot showing the relationship between electricity and water usage with the least squares regression line overlaid on the scatter plot. 

**Note:** Use the `fit_line=True` parameter for the `scatter` method.

_Check your graph with a classmate, a tutor, or the instructor since there are no auto-grader tests for this task._

In [None]:
...
plt.title('Electricty vs. Water')
plt.show()

<!-- END QUESTION -->

### Task 02 📍

Assign an array of integers to `electricity_water` where the integers correspond to the following statements that best describe the relationship between the gas and water usage based on the scatter plot.

1. There is a positive association between the variables.
2. There is a negative association between the variables.
3. There is neither a positive association nor a negative association between the variables.
4. The association between the variables is approximately linear.
5. The association between the variables is nonlinear.

In [None]:
electricity_water = make_array(...)

In [None]:
grader.check("task_02")

## Fitting Models

There is some kind of relationship between the water and electricity usage at SFO. Now, you will fit the best linear model and the best quadratic model to the data. This part of the lab is a continuation of what you were learning last week.

### Task 03 📍

A linear model has the form of `linear_predicted_electricity = slope * actual_water + intercept`. 

Create the function `linear_model_rmse` that returns the [RMSE](https://inferentialthinking.com/chapters/15/3/Method_of_Least_Squares.html#root-mean-squared-error) for a linear model fit to the electricity and water data in the `sfo` table.

The code provided below the function definition will minimize the RMSE and return the approximate slope and the approximate intercept for the least square regression line.

**Note:** Remember that we are defining error to be the actual $y$ value minus the predicted $y$ value.

In [None]:
def linear_model_rmse(slope, intercept):
    actual_electricity = ...
    actual_water = ...
    linear_predicted_electricity = ...
    error = ...
    return ...

# The following code uses the minimize function to determine the optimal 
# slope and intercept for the linear model based on your RMSE function.
slope, intercept = minimize(linear_model_rmse)
print(f'The approximate slope and intercept of the least squares linear \
regression line fit to this data are {slope:.3f} kWh/million gallons and {intercept:.3f} kWh.')

In [None]:
grader.check("task_03")

### Task 04 📍

Using the `slope` and `intercept` values calculated in the previous task, create a function called `linear_predict_electricity`. The function should return the predicted electricity usage for the provided water usage.

In [None]:
def linear_predict_electricity(water_usage):
    return ...

# Apply the function to the data and the predicted values to the table
sfo = sfo.with_column('linear_electricity', 
                      sfo.apply(linear_predict_electricity, 'water'))
sfo.show(3)

In [None]:
grader.check("task_04")

### Task 05 📍

Next, fit the best quadratic model to the data. (As a reminder, "best" for us in this context means the quadratic model that has the lowest RMSE.) A quadratic model has the form `quadratic_predicted_electricity = a * actual_water ** 2 + b * actual_water + c`.

For this task, we've provided you with the template that mirrors the steps for the linear model. You just need to fill in the details.

In [None]:
def quadratic_model_rmse(a, b, c):
    actual_electricity = ...
    actual_water = ...
    quadratic_predicted_electricity = ...
    error = ...
    return ...
    
a, b, c = ...

def quadratic_predict_electricity(water_usage):
    return ...

# Apply the function to the data and the predicted values to the table
sfo = sfo.with_column('quadratic_electricity', 
                      sfo.apply(quadratic_predict_electricity, 'water'))
sfo

In [None]:
grader.check("task_05")

### Task 06 📍🔎

<!-- BEGIN QUESTION -->

Run the following code cell to see the predicted electricity values overlaid with the actual data.

In [None]:
sfo.select('electricity', 'water', 'linear_electricity', 'quadratic_electricity').scatter('water')

Based on this graphic, do you think the linear or quadratic model fits the general trend of the data?

_Check your response with a classmate, a tutor, or the instructor before moving on since there is no auto-grader for this task._

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## Model Evaluation

Great work so far! Now we want you to use the tools you've learned in [Chapter 14](https://inferentialthinking.com/chapters/15/Prediction.html) to be able to decide between the two models. 

### Residuals

We define the residual to be the actual value minus the predicted value. (Yes, this is the same as the term error used above.) Analyzing residuals can be useful to help you decide between two (or more) models for prediction. 

#### Task 07 📍

Calculate the residuals associated with the linear and quadratic predictions. We created the function below called `residual` that returns the residual associated with the provided actual and predicted values.

We provided code to add those residuals to the table `sfo`.

In [None]:
def residual(actual, predicted):
    return actual - predicted

In [None]:
linear_residuals = ...
quadratic_residuals = ...
sfo = sfo.with_columns(
    'linear_residual', linear_residuals,
    'quadratic_residual', quadratic_residuals)
sfo

In [None]:
grader.check("task_07")

#### Task 08 📍🔎

<!-- BEGIN QUESTION -->

Residual plots can be used to do a visual diagnostic of a model. In short, a residual plot is a scatter plot with the predictor variable (water usage) on the horizontal axis and the associated residuals on the vertical axis.

According to [Section 15.5](https://inferentialthinking.com/chapters/15/5/Visual_Diagnostics.html):

> The residual plot of a good regression shows no pattern. The residuals look about the same, above and below the horizontal line at 0, across the range of the predictor variable.

1. Create two residual plots, one for the linear model and one for the quadratic model.
2. Based on a visual inspection of the two plots and the content of Section 15.5, which model would you choose?

_Type your answer here, replacing this text._

In [None]:
sfo.scatter('water', 'linear_residual')
plt.title('(Linear) Residual Plot')
plt.show()

sfo.scatter('water', 'quadratic_residual')
plt.title('Quadratic) Residual Plot')
plt.show()

<!-- END QUESTION -->

### RMSE

Root mean squared error (Error) is a way to summarize the prediction error for a model. You can decide between two models by choosing the model with the smaller RMSE.

#### Task 09 📍

For this task:
1. Create the function `rmse` that uses the `sfo` table that you've built so far and returns the RMSE associated with the actual electricity values and predicted electricity values based on the model you used. The argument `'predicted_col'` will tell the function which column in `sfo` contains the predicted values you want to use for the calculation.
2. After calculating the RMSE associated with each model, assign 'linear' or 'quadratic' to `best_model_by_RMSE` based on which model has the lowest RMSE. 

In [None]:
def rmse(predicted_col):
    actual = ...
    predicted = ...
    error = ...
    return ...
    
linear_rmse = ...
quadratic_rmse = ...
print(f'The linear RMSE is {linear_rmse}. \nThe quadratic RMSE is {quadratic_rmse}.')

best_model_by_RMSE = ...
print(f'You said the the {best_model_by_RMSE} model is the best model based on the RMSE value.')

In [None]:
grader.check("task_09")

Great work! As you can see, choosing the best model to predict with is not a simple decision. There are many things to consider when picking a model, and the tools you've learned about are here to help you make a decision.

## Submit your Lab to Canvas

Once you have finished working on the lab questions, prepare to submit your work in Canvas by completing the following steps.

1. In the related Canvas Assignment page, check the requirements for a Complete score for this lab assignment.
2. Double-check that you have run the code cell near the end of the notebook that contains the command `grader.check_all()`. This command will run all of the run tests on all your responses to the auto-graded tasks marked with 📍.
3. Double-check your responses to the manually graded tasks marked with 📍🔎.
4. Select the menu items `File`, `Save and Export Notebook As...`, and `Html_embed` in the notebook's Toolbar to download an HTML version of this notebook file.
5. In the related Canvas Assignment page, click Start Assignment or New Attempt to upload the downloaded HTML file.

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()