# Homework 08: Regression Inference, Diagnostics, and Classification

Reading:
- Textbook chapter [13](http://www.cs.cornell.edu/courses/cs1380/2018sp/textbook/chapters/13/prediction.html)
- Textbook chapter [14](http://www.cs.cornell.edu/courses/cs1380/2018sp/textbook/chapters/14/inference-for-regression.html)

Run the cell below to prepare the notebook.

In [None]:
# Run this cell to set up the notebook, but please don't change it.
import numpy as np
from datascience import *

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)
from matplotlib import patches
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets

from test import *

## Quantifying Sampling Errors in Regression

Previously, in this class we've used confidence intervals to quantify uncertainty about estimates as well as to test predictions. To run a hypothesis test using a confidence interval, we use the following procedure:
1. Formulate a null hypothesis
2. Formulate an alternative hypothesis 
3. Choose a test statistic and compute the observed value for the test statistic
4. Bootstrap, finding a value of the test stat for each resample
5. Generate a 95% confidence interval from those resampled test stats
6. Based on whether your value is in an interval, make a conclusion

Another thing we've covered recently is the use of linear regression to make predictions, using correlated variables. An example is, say, predicting the height of children based on the heights of their parents.

Some important formulas for regression follow:

- $r = \text{mean}(x_{su} * y_{su})$
- $\text{slope} = r * \frac{\text{SD of }y}{\text{SD of }x}$
- $\text{intercept} = \text{average of }y − \text{slope}\cdot\text{average of }x$

We can combine these two topics together in order to make even more powerful statements about our population given just a sample as before. We can use the following techniques to do so:
- Bootstrapped interval for the true slope
- Bootstrapped prediction interval for y (given a particular value of x)

This homework further explores these two advanced methods.

Recall the Old Faithful dataset from our lab on regression. The table contains two pieces of information about each eruption of the Old Faithful geyser in Yellowstone National Park:
1. The duration of the eruption, in minutes.
2. The time between this eruption and the next eruption (the "waiting time"), in minutes.

The dataset is plotted below along with its line of best fit.

In [None]:
faithful = Table.read_table('faithful_inference.csv')
faithful.scatter('duration', fit_line=True)
faithful

Two quick questions:

1. What variable are we trying to predict? What variable are we given?

    We are trying to predict waiting time. We are given duration.


2. Given the regression line above and that the wait time before that was 60 minutes, can we say that the eruption likely to have lasted 2.5 minutes? 

    No. This regression line predicts y using x. To go from x to y we would need to create a different regression line.


### Finding the Bootstrap Confidence Interval for the True Slope

Last time we looked at this dataset, we noticed the apparent linear relationship betwen duration and wait, and we decided to use regression to predict wait in terms of duration. However, our data are just a sample of all the eruptions that have happened at Old Faithful. As we know, relationships can appear in a sample that don't really exist in the population from which the sample was taken.

Before we move forward using our linear model, we would like to know whether or not there truly exists a relationship between duration and wait time. If there is no relationship between the two, then we'd expect a correlation of 0, which would give us a slope of 0. Now, write in null and alternative hypotheses, based on your knowledge of hypothesis tests you've conducted in the past.

- **Null Hypothesis:** [*Your solution goes here*]
- **Alternate Hypothesis:** [*Your solution goes here*]

We will use the method of confidence intervals to test this hypothesis.

<div class="hide">\pagebreak</div>
#### Question 1
We'll warm up by implementing some familiar functions. You may use these functions throughout this assignment. Start by defining these two functions:

1. `standard_units` should take in an array of numbers and return an array containing those numbers converted to standard units.
2. `correlation` should take in a table with 2 columns and return the correlation between these columns. Hint: you may want to use the `standard_units` function you defined above.

In [None]:
def standard_units(arr):
    ...


def correlation(tbl):
    ...
    


In [None]:
check1_1(standard_units, correlation)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


<div class="hide">\pagebreak</div>
#### Question 2
Using the functions you just implemented, create a function called `fit_line`.  It should take a table as its argument.  It should return an array containing the slope and intercept of the regression line that predicts the second column in the table using the first.

In [None]:
def fit_line(tbl):
    ...
    slope = ...
    intercept = ...
    return make_array(slope, intercept)


# This should compute the slope and intercept of the regression
# line predicting wait time from duration in the faithful dataset.
fit_line(faithful)

In [None]:
# Ensure your fit_line function fits a reasonable line 
# to the data in faithful, using the plot below
# Pleae uncomment the following code once you have implemented the [fit_line] function

# slope, intercept = fit_line(faithful)
# faithful.scatter(0)
# plt.plot([min(faithful[0]), max(faithful[0])], 
#          [slope*min(faithful[0])+intercept, slope*max(faithful[0])+intercept])
# plt.show()

In [None]:
check1_2(fit_line)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


Now we have all the tools we need in order to create a confidence interval quantifying our uncertainty about the true relationship between duration and wait time.

<div class="hide">\pagebreak</div>
#### Question 3
Use the bootstrap to compute 1000 resamples from our dataset. For each resample, compute the slope of the best fit line. Put these slopes in the array `resample_slopes`, giving you the empirical distribution of regression line slopes in resamples.

In [None]:
resample_slopes = make_array()
for i in np.arange(1000):
    sample = ...
    resample_line = ...
    resample_slope = ...
    resample_slopes = ...


# Please uncomment the code below once you have completed the code above. 
# Table().with_column("Slope estimate", resample_slopes).hist() # DO NOT CHANGE THIS LINE

In [None]:
check1_3(resample_slopes)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


<div class="hide">\pagebreak</div>
#### Question 4
Use your resampled slopes to construct an approximate 95% confidence interval for the true value of the slope.

In [None]:
lower_end = ...
lower_end

upper_end = ...
upper_end

# Please uncomment the code below once you have completed the code above. 
# print("95% confidence interval for slope: [{:g}, {:g}]".format(lower_end, upper_end))

In [None]:
check1_4(lower_end, upper_end)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


<div class="hide">\pagebreak</div>
#### Question 5
Based on your confidence interval, would you accept or reject the null hypothesis that the true slope is 0?  Why?  What P-value cutoff are you using? Choose one of the choices below and set the variable interpret_interval equal to that value.

1. We would reject the null, since 0 is not within the 95% confidence interval. If we use a 95% confidence interval, we're using a 5% cutoff.
2. We would accept the null, since 0 is within the 95% confidence interval. If we use a 95% confidence interval, we're using a 5% cutoff.
3. We would reject the null, since 0 is not within the 95% confidence interval. If we use a 95% confidence interval, we're using a 2.5% cutoff.

In [None]:
interpret_interval = ...
interpret_interval

In [None]:
check1_5(interpret_interval)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


### Finding the Bootstrap Prediction Interval

Suppose we're tourists at Yellowstone, and we'd like to know how long we'll have to wait for the next Old Faithful eruption.  We decide to use our regression line to make some predictions for the waiting times.  But just as we're uncertain about the slope of the true regression line, we're also uncertain about the predictions we'd make based on the true regression line.

<div class="hide">\pagebreak</div>
#### Question 6
Define the function `fitted_value`.  It should take 2 arguments:

1. A table with 2 columns.  We'll be predicting the values in the second column using the first.
2. A number, the value of the predictor variable for which we'd like to make a prediction.

Make sure to use your `fit_line` function. 

In [None]:
def fitted_value(table, given_x):
    # The staff solution took 4 lines of code.
    ...
    
    

# Here's an example of how fitted_value is used.  This should
# compute the prediction for the wait time of an eruption that lasts 
# two minutes .
two_minutes_wait = fitted_value(faithful, 2)
two_minutes_wait

In [None]:
check1_6(two_minutes_wait) 

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


<div class="hide">\pagebreak</div>
#### Question 7
The park ranger tells us that the most recent eruption lasted 5 minutes. Using your function above, assign the variable `five_minutes_wait` to the predicted wait time. 

In [None]:
five_minutes_wait = ...
five_minutes_wait

In [None]:
check1_7(five_minutes_wait) 

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


Juan, a fellow tourist, raises the following objection to your prediction:

> "Your prediction depends on your sample of 272 eruptions.  Couldn't your prediction have been different if you had happened to have a different sample of eruptions?"

Having read [section 14.3](http://www.cs.cornell.edu/courses/cs1380/2018sp/textbook/chapters/14/3/prediction-intervals.html) of the textbook, you know just the response!

<div class="hide">\pagebreak</div>
#### Question 8
Define the function `bootstrap_lines`.  It should take two arguments:
1. A table with two columns.  As usual, we'll be predicting the second column using the first.
2. An integer, a number of bootstraps to run.

It should return a *table* whose first column, `"Slope"`, contains the given number of bootstrapped slopes, and whose second column, `"Intercept"`, contains the corresponding bootstrapped intercepts.  Each slope and intercept should come from a regression line that predicts column 2 from column 1 of a resample of the given table.  The table should have 1 row for each bootstrap replication.

**Hint:** Your code should look very similar to the code you wrote for question 3, with just a few key changes.

In [None]:
def bootstrap_lines(tbl, num_bootstraps):
    ...

    


# When you're done, this code should produce the slopes
# and intercepts of 1000 regression lines computed from
# resamples of the faithful table.
regression_lines = bootstrap_lines(faithful, 1000)
regression_lines

In [None]:
check1_8(regression_lines) 

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


<div class="hide">\pagebreak</div>
#### Question 9
Create an array called `predictions_for_five`.  It should contain 1000 numbers.  Each number should be the predicted waiting time after an eruption with a duration of 5 minutes, using a different bootstrapped regression line. Hint: use `regression_lines` from the previous questions.

In [None]:
predictions_for_five = ...
predictions_for_five

# This will make a histogram of your predictions:
# Please uncomment the code below once you have completed the code above.
# table_of_predictions = Table().with_column('Predictions at eruptions=5', predictions_for_five)
# table_of_predictions.hist('Predictions at eruptions=5', bins=20)

In [None]:
check1_9(predictions_for_five) 

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


<div class="hide">\pagebreak</div>
#### Question 10
Create a 95 percent confidence interval for these predictions.

In [None]:
lower_bound = ...
lower_bound

upper_bound = ...
upper_bound

print('95% Confidence interval for predictions for x=5: (', lower_bound,",", upper_bound, ')')

In [None]:
check1_10(lower_bound, upper_bound)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


<div class="hide">\pagebreak</div>
#### Question 11
Look at the scatter plot of the data at the start of this exercise.  Does your confidence interval cover around 95 percent of eruptions in `faithful` that had an eruption duration of 5 minutes? If not, what does this confidence interval mean?

*Write your answer here, replacing this text.*

In [None]:
# DO NOT DELETE THIS CELL


## 2. Visual Diagnostics for Linear Regression


### Regression Model Diagnostics
Linear regression isn't always the best way to describe the relationship between two variables. We'd like to develop techniques that will help us decide whether or not to use a linear model to predict one variable based on another.

We will use the insight that if a regression fits a set of points well, then the residuals from that regression line will show no pattern when plotted against the predictor variable. 

The table below contains information about crime rates and median home values in suburbs of Boston. We will attempt to use linear regression to predict median home value in terms of crime rate.

#### About the dataset
All data are from 1970.  Crime rates are per capita per year; home values are in thousands of dollars.  The crime data come from the FBI, and home values are from the US Census Bureau.  

Run the next cell to load the data and see a scatter plot.

In [None]:
boston = Table.read_table('boston_housing.csv')
boston.scatter('Crime Rate')

<div class="hide">\pagebreak</div>
#### Question 1: Finding Residuals
Write a function called `residuals`.  It should take a single argument, a table.  It should first compute the slope and intercept of the regression line that predicts the second column of that table (accessible as `tbl.column(1)`) using the first column (`tbl.column(0)`).  `residuals` should return an array containing the *residuals* for that regression line. Recall that residuals are given by 

$$\texttt{residual} = \texttt{observed value} - \texttt{regression estimate}.$$

Hint: You may want to use your function fit_line from earlier

In [None]:
def residuals(tbl):
    ...


In [None]:
check2_1(residuals(boston))

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


<div class="hide">\pagebreak</div>
#### Question 2:  Plotting a Residual Plot
Make a scatter plot of the residuals for the Boston housing dataset against crime rate. Crime rate should be on the horizontal axis.

In [None]:
...


In [None]:
# DO NOT DELETE THIS CELL


<div class="hide">\pagebreak</div>
#### Question 3: Interpreting the Residual Plot
Does the plot of residuals look roughly like a formless cloud? Or is there some kind of pattern in them? Are they centered around 0? Choose one of the choices below and set the variable interpret_residual equal to that value.

1. The residuals **do** look like a formless cloud. They seem to be **high** for towns with very large or very small crime rates, and **low** for intermediate crime rates. They **should** be centered around zero horizontally.
2. The residuals **don't** look like a formless cloud. They seem to be **low** for towns with very large or very small crime rates, and **high** for intermediate crime rates. They **should** be centered around zero horizontally. 
3. The residuals **don't** look like a formless cloud. They seem to be **high** for towns with very large or very small crime rates, and **low** for intermediate crime rates. They **should** be centered around zero horizontally.
4. The residuals **don't** look like a formless cloud. They seem to be **high** for towns with very large or very small crime rates, and **low** for intermediate crime rates. They **shouldn't** be centered around zero horizontally. 

In [None]:
interpret_residual = ...
interpret_residual

In [None]:
check2_3(interpret_residual)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


<div class="hide">\pagebreak</div>
#### Question 4: Is Linear Regression a good fit?
Does it seem like a linear model is appropriate for describing the relationship between crime and median home value?

*Write your answer here, replacing this text.*

In [None]:
# DO NOT DELETE THIS CELL


[Section 13.6](http://www.cs.cornell.edu/courses/cs1380/2018sp/textbook/chapters/13/6/numerical-diagnostics.html) of the textbook describes some mathematical facts that hold for all regression estimates, regardless of goodness of fit.  One fact is that there is a relationship between the standard deviation of the residuals, the standard deviation of the response variable, and the correlation.  Let us test this.

Below, we have imported a new table, the Old Faithful data.

In [None]:
old_faithful = Table.read_table('faithful.csv')
old_faithful

The following cell makes a residual plot for this new dataset.

In [None]:
# Please uncomment the code below once you have completed the residuals function
# Table().with_columns('Residual', residuals(old_faithful), 'Duration', old_faithful.column('duration')).scatter('Duration')

<div class="hide">\pagebreak</div>
#### Question 5: Finding the Standard Deviation of Residuals for Boston dataset
Directly compute the standard deviation of the residuals from the Boston data.  Then compute the same quantity without using the residuals, using the formula described in section 13.6 instead.

In [None]:
boston_residual_sd = ...
boston_residual_sd

boston_residual_sd_from_formula = ...
boston_residual_sd_from_formula 





print("Residual SD: {0}".format(boston_residual_sd))
print("Residual SD from the formula: {0}".format(boston_residual_sd_from_formula))

In [None]:
check2_5(boston_residual_sd, boston_residual_sd_from_formula)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


<div class="hide">\pagebreak</div>
#### Question 6: Finding the Standard Deviation of Residuals for 'Old_Faithful' dataset
Repeat the procedure from Question 5 for the `old_faithful` dataset.

In [None]:
faithful_residual_sd = ...
faithful_residual_sd

faithful_residual_sd_from_formula = ...
faithful_residual_sd_from_formula

print("Residual SD: {0}".format(faithful_residual_sd))
print("Residual SD from the formula: {0}".format(faithful_residual_sd_from_formula))

In [None]:
check2_6(faithful_residual_sd, faithful_residual_sd_from_formula)

In [None]:

#
# AUTOGRADER TEST - DO NOT REMOVE
#


## 3. Submission

To submit your assignment, click the red Submit button above. You may submit as many times as you wish before the deadline. Only your final submission will be graded. No late work will be accepted, so please make sure you submit something before the deadline!

Before you submit, it would be wise to click on the menu item Kernel -> Restart & Run All. That will re-run all your cells from scratch. Take a second look to make sure all your answers are passing the checks. Doing this will help catch any errors in your homework that result from running cells in a strange order.