In [None]:
# Run cells by clicking on them and hitting CTRL + ENTER on your keyboard
from IPython.display import YouTubeVideo
from datascience import *
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline

In [None]:
def standard_units(x):
    """Converts an array x to standard units"""
    return (x - np.mean(x)) / np.std(x)

# Module 6.2 Part 1: Least Squares

Previously, you have learned to use the regression equation to generate a line describing the relationship between
two numerical variables. In this notebook, you'll see how this can accomplished through the method of least squares.

2 videos make up this notebook, for a total run time of 16:10.

1. [Squared Error](#section1) *1 videos, total runtime 9:55*
2. [Least Squares](#section2) *1 video, total runtime 6:15*
3. [Check for Understanding](#section3)

Textbook readings:
- [Chapter 15.3: Method of Least Squares](https://www.inferentialthinking.com/chapters/15/3/Method_of_Least_Squares.html)

<a id='section1'></a>
## 1. Squared Error


In this video, you'll learn how to quantify the error in your observed and predicted values using the mean squared error (MSE) or root mean squared error (RMSE).

In [None]:
YouTubeVideo('BuTMV2r89Gc')

Let's revisit the Madrid air quality dataset from Module 6.1. Run the cell below to load the data

In [None]:
madrid = Table().read_table("madrid_2018.csv")
madrid.show(5)

Write a function `linear_fit` that returns an array of predicted y-values from the x-values in a table.

In [None]:
def linear_fit(t, x, y):
        ...

<details>
    <summary>Solution</summary>
    
    def linear_fit(t, x, y):
        x_su = standard_units(t.column(x))
        y_su = standard_units(t.column(y))
        r = np.mean(x_su * y_su)
        slope = r * np.std(t.column(y)) / np.std(t.column(x))
        intercept = np.mean(t.column(y)) - slope * np.mean(t.column(x))
        return slope * t.column(x) + intercept
</details>
<br>

Use `linear_fit` to define an array `no2_predictions` that contains `NO_2` predictions from each value of `NO` in `madrid`.

In [None]:
no2_predictions = ...

<details>
    <summary>Solution</summary>
    
    no2_predictions = linear_fit(madrid, "NO", "NO_2")
</details>
<br>

Create a new table `madrid_with_predictions` that resembles `madrid` but with a new column `Predicted NO_2` that contains predicted values of `NO_2` for each value of `NO`. Create a scatterplot that compared predicted and observed NO_2 values.

In [None]:
madrid_with_predictions = ...
madrid_with_predictions.scatter("NO")

<details>
    <summary>Solution</summary>
    
    madrid_with_predictions = madrid.with_column("Predicted NO_2", no2_predictions)
</details>
<br>

Add a column `Errors` to the table `madrid_with_predictions` that contains the difference between each observed and predicted value of `NO_2`

In [None]:
madrid_with_predictions =  ...
madrid_with_predictions.show(5)

<details>
    <summary>Solution</summary>
    
    madrid_with_predictions.with_column("Errors", madrid_with_predictions.column("NO_2") - madrid_with_predictions.column("Predicted NO_2"))
</details>
<br>



<a id='section2'></a>

## 2. Least Squares

In this video, you'll learn how to find the best-fit line for two numerical variables by minimizing the MSE, or,
equivalently, the RMSE.

In [None]:
YouTubeVideo('uBaIf9B3BCQ')

Write a function `find_rmse` to find the root mean squared error of `NO_2` prediction made for a line given by `rmse_slope` and `rmse_intercept`.

In [None]:
def find_rmse(rmse_slope, rmse_intercept):
    ...

<details>
    <summary>Solution</summary>
    
    def find_rmse(rmse_slope, rmse_intercept):
        x = madrid.column("NO")
        y = madrid.column("NO_2")
        predicted = rmse_slope * x  + rmse_intercept
        return (np.mean((y-predicted) ** 2)) ** 0.5
</details>
<br>

Use `minimize` to find the slope and intercept of the line that minimizes the RMSE of `NO_2` values in madrid. Verify that these are the same slope and intercept values from your regression equation. 

In [None]:
...

<details>
    <summary>Solution</summary>
    
    minimize(find_rmse)
</details>
<br>

<a id='section3'></a>
## 3. Check for Understanding

**A. Why do we use the (root) mean square error instead of just the mean error?**
 
<details>
    <summary>Solution</summary>
If you use any arbitrary line to calculate your estimates, then some of your errors are likely to be positive and others negative. To avoid cancellation when measuring the rough size of the errors, we will take the mean of the squared errors rather than the mean of the errors themselves. Taking the square root of this value makes the units more interpretable.


</details>
<br>

**B. True or False? The least squares line is always a better fit than the regression line.**

<details>
    <summary>Solution</summary>
    <b>False.</b> The least squares line and regression line are the same. The regression line is the only line that minimizes mean squared error.
</details>
<br>