In [None]:
# Run cells by clicking on them and hitting CTRL + ENTER on your keyboard
from IPython.display import YouTubeVideo
from datascience import *
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline

# Module 6.1 Part 2: Linear Regression

In this notebook, you'll learn how to fit a regression line that captures the linear relationship between two numerical variables.
This is accomplished by extending the concept of correlation to a method called linear regression.

5 videos make up this notebook, for a total run time of 63:03.

1. [Prediction](#section1) *1 videos, total runtime 11:53*
2. [Calculating Correlation](#section2) *2 videos, total runtime 25:10*
3. [The Regression Equation](#section3) *2 videos, total runtime 26:00*
4. [Check for Understanding](#section4)

Textbook readings:
- [Chapter 15.2: The Regression Line](https://www.inferentialthinking.com/chapters/15/2/Regression_Line.html)

<a id='section1'></a>
## 1. Prediction

The following video goes through a simple method for making predictions. Professor Adhikari will walk you through the process of trying to predict children's height from their midparent height.

In [None]:
YouTubeVideo('ojod4DTcFdA')

Run the cell below to load `madrid_2018`, a table that contains daily nitric oxide (`NO`,  μg/m³) and nitrogen dioxide (`NO_2`,  μg/m³) levels in Madrid, Spain throughout 2018.

In [None]:
madrid_2018 = Table.read_table("madrid_2018.csv")
madrid_2018.show(5)

Write a function `predict_no2` that takes in one argument `NO_value`. The function should return the average `NO_2` values from `madrid_2018` for all `NO` values within 3 μg/m³ of `NO_value`.

In [None]:
def predict_no2(NO_value):
    ...

<details>
    <summary>Solution</summary>
    
    def predict_no2(NO_value):
        NO_range = madrid_2018.where("NO", are.between(NO_value-3, NO_value+3.1))
        return np.mean(NO_range.column("NO_2"))
</details>
<br>

Use your function `predict_no2` to set `no2_predictions` to predictions for each of the values of `NO` in `madrid_2018`

In [None]:
no2_predictions = ...

<details>
    <summary>Solution</summary>
    
    no2_predictions = madrid_2018.apply(predict_no2, "NO")
</details>
<br>

Add `no2_predictions` to the `madrid_2018` table and generate a scatterplot that compares your predicted values of `NO` to the observed values of `NO`.

In [None]:
...

<details>
    <summary>Solution</summary>
    
    madrid_2018.with_column("Predicted NO_2", no2_predictions).scatter("NO")
</details>
<br>

<a id='section2'></a>

## 2. Introduction to Linear Regression

In the next video, you'll learn how to calculate regression estimates with variables that are linearly associated.

### Linear Regression

In [None]:
YouTubeVideo('DS95QoflalM')

### Regression to the Mean

In [None]:
YouTubeVideo('1-5HJ4cGhBI')

Find the regression estimate of the nitrogen dioxide (`NO_2`) level on a day where the nitric oxide (`NO`) level is 10 μg/m³.

In [None]:
def convert_to_standard_units(arr):
    return (arr - np.average(arr)) / np.std(arr)

r = np.mean(convert_to_standard_units(madrid_2018.column("NO")) * convert_to_standard_units(madrid_2018.column("NO_2")))

NO_su = ...
est_NO_2_su = ...
est_NO_2 = ...
est_NO_2

<details>
    <summary>Solution</summary>
    
    NO_su = (10 - np.mean(madrid_2018.column("NO"))) / np.std(madrid_2018.column("NO"))
    est_NO_2_su = r * NO_su
    est_NO_2 = (est_NO_2_su * np.std(madrid_2018.column("NO_2"))) + np.mean(madrid_2018.column("NO_2"))
    est_NO_2
</details>
<br>

<a id='section3'></a>

## 3. The Regression Equation

In this video you'll learn how to form an equation that describes a regression line, both in standard units and original units.

### Regression Equation

In [None]:
YouTubeVideo('0FR1WREFMb4')

### Interpreting the Slope

In [None]:
YouTubeVideo('Vf2f50AHPGc')

In the cell below, find the slope and intercept of the regression line that predicts `NO_2` values from `NO` values.

In [None]:
def convert_to_standard_units(arr):
    return (arr - np.average(arr)) / np.std(arr)

r = ...

x_avg = ...  
x_sd = ...

y_avg = ...
y_sd = ...

slope = ...
intercept = ...

print("The regression line is: NO_2 = " + str(round(slope, 3)) + " * NO + " + str(round(intercept, 3)))

<details>
    <summary>Solution</summary>
    
    r = np.mean(convert_to_standard_units(madrid_2018.column("NO")) * convert_to_standard_units(madrid_2018.column("NO_2")))

    x_avg = np.mean(madrid_2018.column("NO"))  
    x_sd = np.std(madrid_2018.column("NO"))

    y_avg = np.mean(madrid_2018.column("NO_2"))
    y_sd = np.std(madrid_2018.column("NO_2"))
    
    slope = r * (y_sd/x_sd)
    intercept = y_avg - slope * x_avg
</details>
<br>

How can you interpret the slope of this line?

<details>
    <summary>Solution</summary>
    For each 1 μg/m³ increase in nitric oxide, we expect to see a corresponding increase in nitrogen dioxide of 18.807 μg/m³.
</details>
<br>

Write a function `predict_NO2_regression_line` that takes in a nitric oxide level and returns the predicted nitrogen dioxide level using the regression line. Use the function to predict the nitrogen dioxide (NO_2) level on a day where the nitric oxide (NO) level is 10 μg/m³. Check that this value is equivalent to the regression estimate you calculated in the section above.

In [None]:
def predict_NO2_regression_line(NO_val):
    ...

predict_NO2_regression_line(10)

<details>
    <summary>Solution</summary>
    
    def predict_NO2_regression_line(NO_val):
        return slope * NO_val + intercept
</details>
<br>

<a id='section4'></a>
## 4. Check for Understanding

**A. True or False? There may be a straight line that is a better fit for the data than the regression line**
 
<details>
    <summary>Solution</summary>
    <b>False.</b> No matter what the shape of the scatter plot us, the regression equation gives the "best" among all straight lines.
</details>
<br>

**B. If you convert the units of your variables, will the regression line change?**

<details>
    <summary>Solution</summary>
    Yes. While the correlation coefficient is not dependent on the units of a variable, the mean and standard deviation are. Since these values are used to calculate the slope and intercept of a line, the regression line will change.
</details>
<br>