In [None]:
# Run cells by clicking on them and hitting CTRL + ENTER on your keyboard
from IPython.display import YouTubeVideo
from datascience import *
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline

In [None]:
def standard_units(any_numbers):
    "Convert any array of numbers to standard units."
    return (any_numbers - np.mean(any_numbers))/np.std(any_numbers)  

def correlation(t, x, y):
    return np.mean(standard_units(t.column(x))*standard_units(t.column(y)))

def slope(table, x, y):
    r = correlation(table, x, y)
    return r * np.std(table.column(y))/np.std(table.column(x))

def intercept(table, x, y):
    a = slope(table, x, y)
    return np.mean(table.column(y)) - a * np.mean(table.column(x))

def fit(table, x, y, given_x):
    a = slope(table, x, y)
    b = intercept(table, x, y)
    return a * given_x + b

# Module 6.2 Part 3: Regression Inference

We can conduct hypothesis tests and produce confidence intervals for the estimates of a linear regression. In this notebook, you'll learn how to perform statistical inference for both the predictions and and slope of a linear regression. 

3 videos make up this notebook, for a total run time of 27:10.

1. [Regression Model](#section1) *1 video, total runtime 9:38*
2. [Prediction Variability](#section2) *1 video, total runtime 10:18*
3. [The True Slope](#section3) *1 video, total runtime 7:14*
4. [Check for Understanding](#section4)

Textbook readings:
- [Chapter 16: Inference for Regression](https://www.inferentialthinking.com/chapters/16/Inference_for_Regression.html)

<a id='section1'></a>
## 1. Regression Model

The following video will introduce you to uncertainty and estimation in the context of linear trends and regression lines. 

In [None]:
YouTubeVideo('aUsYPrGwdhU')

<a id='section2'></a>

## 2. Prediction Variability

In the next video, you'll learn how to use bootstrap sampling to generate confidence intervals for predictions produced by linear regression.

In [None]:
YouTubeVideo('SHWRa8-86ks')

Run the cell below to load `movie_review`, a table that contains the `RottenTomatoes` score and `IMDB` score for the selected movies.

In [None]:
movie_reviews = Table.read_table("movie_reviews.csv")
movie_reviews

Find the predicted `IMDB` score for a movie that received a `RottenTomatoes` score of 78.

In [None]:
...

<details>
    <summary>Solution</summary>

    fit(movie_reviews, "RottenTomatoes", "IMDB", 78)
<br>

Find the predicted `IMDB` score for a `RottenTomatoes` score of 78 for 1000 bootstrapped resamples of `movie_reviews`. Store these prediction in an array called `predictions`. Generate a histogram to visualize the distribution of these simulated predictions.

In [None]:
repetitions = 1000
predictions = make_array()

for i in np.arange(repetitions):
    ...

Table().with_column("Predicted IMDB Scores", predictions).hist()

<details>
    <summary>Solution</summary>

    for i in np.arange(repetitions):
        bootstrap_sample = movie_reviews.sample()
        bootstrap_prediction = fit(bootstrap_sample, "RottenTomatoes", "IMDB", 78)
        predictions = np.append(predictions, bootstrap_prediction)

    Table().with_column("Predicted IMDB Scores", predictions).hist()
<br>

Generate a 95% confidence interval for the predicted `IMDB` score for a movie that received a `RottenTomatoes` score of 78

In [None]:
imdb_left = ...
imdb_right = ...
print('Approximate 95%-confidence interval: (' + str(round(imdb_left, 3)) + ", "+ str(round(imdb_right, 3)) + ")")

<details>
    <summary>Solution</summary>

    imdb_left = percentile(2.5, predictions)
    imdb_right = percentile(97.5, predictions)
<br>

<a id='section3'></a>

## 3. The True Slope

In the next video, you'll learn how to use bootstrap sampling to generate confidence intervals for the true slope of a linear regression line.

In [None]:
YouTubeVideo('4Qa1uDn-uHU')

Find the slope of the regression line used to predict `IMDB` scores from `RottenTomatoes` scores for 1000 bootstrapped resamples of movie_reviews. Store these slopes in an array called `bootstrap_slopes`. Generate a histogram to visualize the distribution of these simulated slopes.

In [None]:
repetitions = 1000
bootstrap_slopes = make_array()

for i in np.arange(repetitions):
    ...

Table().with_column("Bootstrapped Regression Slopes", bootstrap_slopes).hist()

<details>
    <summary>Solution</summary>

    for i in np.arange(repetitions):
        bootstrap_sample = movie_reviews.sample()
        bootstrap_slope = slope(bootstrap_sample, "RottenTomatoes", "IMDB")
        bootstrap_slopes = np.append(bootstrap_slopes, bootstrap_slope)

    Table().with_column("Bootstrapped Regression Slopes", bootstrap_slopes).hist()
<br>

Generate a 95% confidence interval for the true slope of the regression line that predicts `IMDB` scores from `RottenTomatoes` scores.

In [None]:
slope_left = ...
slope_right = ...
print('Approximate 95%-confidence interval: (' + str(round(slope_left, 3)) + ", "+ str(round(slope_right, 3)) + ")")

<details>
    <summary>Solution</summary>

    slope_left = percentile(2.5, bootstrap_slopes)
    slope_right = percentile(97.5, bootstrap_slopes)
<br>

Suppose we wanted to run a hypothesis test with the null hypothesis that the true slope of the regression line predicted `IMDB` from `RottenTomatoes` was 0 and the alternative hypothesis that the true slope is not 0. What conclusion would you make? Assume a p-value cutoff of 0.05.

<details>
    <summary>Solution</summary>
The data is consistent with the alternative hypothesis, since 0 is not contained in the 95% confidence interval for regression slopes.
<br>

<a id='section4'></a>
## 4. Check for Understanding

**A. What condition must be met in order for the regression line to be a good approximation of the true line?**

<details>
    <summary>Solution</summary>
    If the regression model holds and the sample size is large, then the regression line is likely to be close to the true line.
</details>
<br>