## Final Exam Practice:  Bootstrap, Confidence Intervals and Regression

In these exercises, we'll cover some concepts that were covered in the class regarding bootstrap, confidence intervals and regression. These exercises are not meant to be all-encompassing of the material that will be covered on the final exam. However it will be good practice to review these concepts.

Remember, practice exercises are *not* required and will not be turned in for credit...but they are helpful for developing your YData skills!

Credit:  These practice exercises have been adapted from Berkeley's Data8 course.

Let's begin by running the cell below.

In [None]:
# Don't change this cell; just run it. 

import numpy as np
from datascience import *

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)
from matplotlib import patches
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets

## 1. Quantifying Sampling Errors in Regression


Previously, in this class we've used confidence intervals to quantify uncertainty about estimates as well as to test predictions. To run a hypothesis test using a confidence interval, we use the following procedure:
1. Formulate a null hypothesis
2. Formulate an alternative hypothesis 
3. Choose a test statistic and compute the observed value for the test statistic
4. Bootstrap, finding a value of the test stat for each resample
5. Generate a 95% confidence interval from those resampled test stats
6. Based on whether your value is in an interval, make a conclusion

Another thing we've covered recently is the use of linear regression to make predictions, using correlated variables. An example is, say, predicting the height of children based on the heights of their parents.

We can combine these two topics together in order to make even more powerful statements about our population given just a sample as before. We can use the following techniques to do so:
- Bootstrapped interval for the true slope
- Bootstrapped prediction interval for y (given a particular value of x)

This practice further explores these two advanced methods.

Recall the Old Faithful dataset from previous practice exercises on regression. The table contains two pieces of information about each eruption of the Old Faithful geyser in Yellowstone National Park:
1. The duration of the eruption, in minutes.
2. The time between this eruption and the next eruption (the "waiting time"), in minutes.

The dataset is plotted below along with its line of best fit.

In [None]:
faithful = Table.read_table('faithful_inference.csv')
faithful.scatter('duration', fit_line=True)
faithful

### Finding the Bootstrap Confidence Interval for the True Slope

Last time we looked at this dataset, we noticed the apparent linear relationship between duration and wait, and we decided to use regression to predict wait in terms of duration. However, our data are just a sample of all the eruptions that have happened at Old Faithful. As we know, relationships can appear in a sample that don't really exist in the population from which the sample was taken.

**Question 3.1.**
Before we move forward using our linear model, we would like to know whether or not there truly exists a relationship between duration and wait time. If there is no linear association between the two, then we'd expect a correlation of 0, which would give us a slope of 0. Now, write in null and alternative hypotheses, based on your knowledge of hypothesis tests you've conducted in the past.

- **Null Hypothesis:** [*Your solution goes here*]
- **Alternate Hypothesis:** [*Your solution goes here*]

We will use the method of confidence intervals to test this hypothesis.

<div class="hide">\pagebreak</div>

**Question 3.2.**
We'll warm up by implementing some familiar functions. You may use these functions throughout this assignment. Start by defining these two functions:

1. `standard_units` should take in an array of numbers and return an array containing those numbers converted to standard units.
2. `correlation` should take in a table with 2 columns and return the correlation between these columns. Hint: you may want to use the `standard_units` function you defined above.

In [None]:
def standard_units(arr):
    ...

def correlation(tbl):
    ...

<div class="hide">\pagebreak</div>

**Question 3.3.**
Using the functions you just implemented, create a function called `fit_line`.  It should take a table as its argument.  It should return an array containing the slope and intercept of the regression line that predicts the second column in the table using the first.

In [None]:
def fit_line(tbl):
    ...
    slope = ...
    intercept = ...
    return make_array(slope, intercept)

# This should compute the slope and intercept of the regression
# line predicting wait time from duration in the faithful dataset.
fit_line(faithful)

In [None]:
# Ensure your fit_line function fits a reasonable line 
# to the data in faithful, using the plot below

slope, intercept = fit_line(faithful)
faithful.scatter(0)
plt.plot([min(faithful[0]), max(faithful[0])], 
         [slope*min(faithful[0])+intercept, slope*max(faithful[0])+intercept])
plt.show()

Now we have all the tools we need in order to create a confidence interval quantifying our uncertainty about the true relationship between duration and wait time.

<div class="hide">\pagebreak</div>

**Question 3.4.**
Use the bootstrap to compute 1000 resamples from our dataset. For each resample, compute the slope of the best fit line. Put these slopes in an array called `resampled_slopes`, giving you the empirical distribution of regression line slopes in resamples. Plot a histogram of these slopes.

In [None]:
...

<div class="hide">\pagebreak</div>

**Question 3.5.**
Use your resampled slopes to construct an approximate 95% confidence interval for the true value of the slope.

In [None]:
lower_end = ...
upper_end = ...
print("95% confidence interval for slope: [{:g}, {:g}]".format(lower_end, upper_end))

<div class="hide">\pagebreak</div>

**Question 3.6.**
Based on your confidence interval, would you accept or reject the null hypothesis that the true slope is 0?  Why?  What P-value cutoff are you using?

*Write your answer here, replacing this text.*

### Finding the Bootstrap Confidence Interval for the regression line

Suppose we're tourists at Yellowstone, and we'd like to know how long we'll have to wait for the next Old Faithful eruption.  We decide to use our regression line to make some predictions for the waiting times.  But just as we're uncertain about the slope of the true regression line, we're also uncertain about the predictions we'd make based on the true regression line.

<div class="hide">\pagebreak</div>

**Question 3.7.**
Define the function `fitted_value`.  It should take 2 arguments:

1. A table with 2 columns.  We'll be predicting the values in the second column using the first.
2. A number, the value of the predictor variable for which we'd like to make a prediction.

Make sure to use your `fit_line` function. 

In [None]:
def fitted_value(table, given_x):
    ...

# Here's an example of how fitted_value is used.  This should
# compute the prediction for the wait time of an eruption that lasts 
# two minutes.
two_minutes_wait = fitted_value(faithful, 2)
two_minutes_wait

<div class="hide">\pagebreak</div>

**Question 3.8.**
The park ranger tells us that the most recent eruption lasted 5 minutes. Using your function above, assign the variable `most_recent_wait` to the predicted wait time. 

In [None]:
most_recent_wait = ...
most_recent_wait

Juan, a fellow tourist, raises the following objection to your prediction:

> "Your prediction depends on your sample of 272 eruptions.  Couldn't your prediction have been different if you had happened to have a different sample of eruptions?"

Having read section [16.3](https://www.inferentialthinking.com/chapters/16/3/Prediction_Intervals.html) of the textbook, you know just the response!



<font color = "red">
**Note**: The textbook using the term "prediction interval" to refer to what is usually called a "confidence interval for the regression line". This confidence interval gives a range of values where the "true" regression line will be most of the time; i.e., 95% of the time when we construct these intervals, the true regression line will be in the interval. By "true regression line" we mean a line fit to a population of all possible data, rather than a line fit to just a sample of data. 

The term "prediction interval" is usually used to refer to a range of y values that occurs most of the time; i.e., these are the values the predicted y values from the regression line, **plus the typical random scatter off the line**. So, for example, if one says if x = 10, a 95% prediction interval is between 50 and 90, then this means, 95% of the time the y-values will be between 50 and 90. In class I will try to clarify how the book uses terminology in a non-standard way which will hopefully help avoid confusion if you come across this terminology after this class. 

<div class="hide">\pagebreak</div>

**Question 3.9.**
Define the function `bootstrap_lines`.  It should take two arguments:
1. A table with two columns.  As usual, we'll be predicting the second column using the first.
2. An integer, a number of bootstraps to run.

It should return a *table* whose first column, `"Slope"`, contains the given number of bootstrapped slopes, and whose second column, `"Intercept"`, contains the corresponding bootstrapped intercepts.  Each slope and intercept should come from a regression line that predicts column 2 from column 1 of a resample of the given table.  The table should have 1 row for each bootstrap replication.

In [None]:
def bootstrap_lines(tbl, num_bootstraps):
    ...

# When you're done, this code should produce the slopes
# and intercepts of 1000 regression lines computed from
# resamples of the faithful table.
regression_lines = bootstrap_lines(faithful, 1000)
regression_lines

<div class="hide">\pagebreak</div>

**Question 3.10.**
Create an array called `predictions_for_five`.  It should contain 1000 numbers.  Each number should be the predicted waiting time after an eruption with a duration of 5 minutes, using a different bootstrapped regression line. Hint: use `regression_lines` from the previous questions.

In [None]:
predictions_for_five = ...

# This will make a histogram of your predictions:
table_of_predictions = Table().with_column('Predictions at eruptions=5', predictions_for_five)
table_of_predictions.hist('Predictions at eruptions=5', bins=20)

<div class="hide">\pagebreak</div>

**Question 3.11.**
Create a 95 percent confidence interval for where the true regession line is at a duration value of 5.

In [None]:
lower_bound = ...
upper_bound = ...

print('95% Confidence interval for the regression line at x=5 is: (', lower_bound,",", upper_bound, ')')

<div class="hide">\pagebreak</div>

**Question 3.12.**
Look at the scatter plot of the data at the start of this exercise. 
Determine which of the following are true, then set `question_12_choice` to an array consisting of the numbers of statements that are true. For example, if you think that 1 and 2 are true but 3 is false, you'd assign `question_12_choice` to be an array consisting of the values 1 and 2.

Statement 1: This confidence interval covers 95 percent of waiting times of eruptions in `faithful` that had an eruption duration of 5 minutes.

Statement 2: This interval gives a sense of how much actual wait times differ from your prediction.

Statement 3: The confidence interval quantifies our uncertainty in our estimate of what the true regression line would predict.

In [None]:
question_12_choice = []

Great job on the final practice! Make sure to study well and go to office hours if you have questions. It's been a pleasure having you in class!