# Homework 8: Confidence Intervals

**Reading**: 
* [Estimation](https://www.inferentialthinking.com/chapters/13/Estimation.html)

Please complete this notebook by filling in the cells provided. Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. For all problems that you must write out explanations, you **must** provide your answer in the designated space. Moreover, throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook! 

In [None]:
# Don't change this cell; just run it.   

import numpy as np
from datascience import *

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

## 1. Election Polls


During the 2020 election for U.S. Senate representing North Carolina, pollsters asked likely voters if they had voted for Cal Cunningham or Thom Tillis. If they had not, the likely voters were then asked  if the election for U.S. Senate representing North Carolina were held today, would you vote for Cal Cunningham or Thom Tillis? The results of the survey for the 660 participants is shown below:

|Responses  | Percentage|
|:------------:|:------------:|
|Cal Cunningham | 46% |
|Thom Tillis | 42% |
|Some other candidate | 5% |
|Not sure / Prefer not to answer | 7% |

We will attempt to estimate the corresponding *parameters*, or the percentage of the votes that each response will receive from the entire population (the entire population is all NC voters). We will use confidence intervals to compute a range of values that reflects the uncertainty of our estimates.

The table `poll` contains the results of the survey.

In [None]:
# Just run this cell
poll = Table.read_table('poll.csv')
poll.group('Response')

**Question 1.** Complete the function `one_resampled_percentage` below. It should return Cal Cunningham's **percentage** of votes after simulating one bootstrap sample of `tbl`.

**Note:** `tbl` will always be in the same format as `poll`.

<!--
BEGIN QUESTION
name: q1_1
manual: false
-->

In [None]:
def one_resampled_percentage(tbl):
    ...

one_resampled_percentage(poll)

**Question 2.** Complete the `percentages_in_resamples` function such that it returns an array of 2500 bootstrapped estimates of the percentage of voters who will vote for Cal Cunningham. You should use the `one_resampled_percentage` function you wrote above.

<!--
BEGIN QUESTION
name: q1_2
manual: false
-->

In [None]:
def percentages_in_resamples():
    percentage_Cal = make_array()
    ...

In the following cell, we run the function you just defined, `percentages_in_resamples`, and create a histogram of the calculated statistic for the 2,500 bootstrap estimates of the percentage of voters who voted for Cal Cunningham. Based on what the original election poll percentages were, does the graph seem reasonable? 

In [None]:
resampled_percentages = percentages_in_resamples()
Table().with_column('Estimated Percentage', resampled_percentages).hist("Estimated Percentage")

**Question 3.** Using the array `resampled_percentages`, find the values at the two edges of the middle 95% of the bootstrapped percentage estimates. (Compute the lower and upper ends of the interval, named `Cal_lower_bound` and `Cal_upper_bound`, respectively.)

<!--
BEGIN QUESTION
name: q1_3
manual: false
-->

In [None]:
Cal_lower_bound = ...
Cal_upper_bound = ...
print("Bootstrapped 95% confidence interval for the percentage of Cal Cunningham voters in the population: [{:f}, {:f}]".format(Cal_lower_bound, Cal_upper_bound))

**Question 4.** The survey results do not seem to indicate that Cal Cunningham is beating all of the other candidates or responses combined. We would like to use confidence intervals to determine a range of likely values for Cal Cunninghams's true lead over all the other candidates or responses combined. 

Define the function `one_resampled_difference` that returns **exactly one value** of Cal Cunningham's percentage lead over Thom Tillis, Some other candidate, and Not sure / Prefer not to answer combined from one bootstrap sample of `tbl`.

<!--
BEGIN QUESTION
name: q1_4
manual: false
-->

In [None]:
def one_resampled_difference(tbl):
    bootstrap = ...
    Cal_percentage = ...
    Remaining_percentage = ...
    ...

**Question 5.**
Write a function called `leads_in_resamples` that finds 2,500 bootstrapped estimates (the result of calling `one_resampled_difference`) of Cal Cunninghams's lead over the other responses combined. Plot a histogram of the resulting samples. 

**Note:** Cal Cunningham's lead can be negative.

<!--
BEGIN QUESTION
name: q1_5
manual: true
-->
<!-- EXPORT TO PDF -->

In [None]:
def leads_in_resamples():
    ...

sampled_leads = leads_in_resamples()
Table().with_column('Estimated Lead', sampled_leads).hist("Estimated Lead")

**Question 6.** Use the simulated data from Question 5 to compute an approximate 95% confidence interval for Cal Cunningham's true lead over the other responses combined.

<!--
BEGIN QUESTION
name: q1_6
manual: false
-->

In [None]:
diff_lower_bound = ...
diff_upper_bound = ...
print("Bootstrapped 95% confidence interval for Cal Cunningham's true lead over the other responses combined: [{:f}, {:f}]".format(diff_lower_bound, diff_upper_bound))

## 2. Interpreting Confidence Intervals


Dr. McLean computed the following 95% confidence interval for the percentage of Cal Cunningham voters: 

$$[42.3, 49.8]$$

(Your answer may have been a bit different; that doesn't mean it was wrong!)

#### Question 1
Can we say there is a 95% probability that the interval [42.3, 49.8] contains the true percentage of the population that votes for Cal Cunningham for US Senate in North Carolina? Answer "yes" or "no" and explain your reasoning. 

*Note:* ambiguous answers using language like "sometimes" or "maybe" will not receive credit.

<!--
BEGIN QUESTION
name: q2_1
manual: true
-->
<!-- EXPORT TO PDF -->

*Write your answer here, replacing this text.*

#### Question 2

Dr. McLean also created 70%, 90%, and 99% confidence intervals from the same sample, but he forgot to label which confidence interval represented which percentages! Match each confidence level (70%, 90%, 99%) with its corresponding interval in the cell below (e.g. __ % CI: [49.87, 54.0] $\rightarrow$ replace the blank with one of the three confidence levels). **Then**, explain your thought process and how you came up with your answers.

The intervals are below:

* [42.9, 49.2]
* [41.2, 50.9]
* [44.1, 48.2]

<!--
BEGIN QUESTION
name: q2_2
manual: true
-->
<!-- EXPORT TO PDF -->

*Write your answer here, replacing this text.*

### Question 3
Suppose that we produced 1,200 new samples (each one a uniform random sample of 660 likely voters in North Carolina) from the population and created a 90% confidence interval from each one. Roughly how many of those 1,200 intervals do you expect will actually contain the true percentage of the population?

Assign your answer to `true_percentage_intervals`.

<!--
BEGIN QUESTION
name: q2_3
manual: false
-->

In [None]:
true_percentage_intervals = ...
true_percentage_intervals

Recall the second bootstrap confidence interval you created, which estimated Cal Cunningham's lead over the other candidates and responses combined. Dr. McLean's 99% confidence interval for the true lead (in the population of all voters) was

$$[-17.2, 2.2]$$

Suppose we are interested in testing a simple yes-or-no question:

> "Is the percentage of votes for Cal Cunningham tied with the percentage of votes for the remaining candidates and responses combined?"

Our null hypothesis is that the percentages are equal, or equivalently, that Cal Cunningham's lead is exactly 0. Our alternative hypothesis is that Cal Cunningham's lead is not equal to 0.  In the questions below, don't compute any confidence interval yourself - use only Dr. McLean's 99% confidence interval.


**Question 4** 

Say we use a 5% P-value cutoff.  Do we reject the null, fail to reject the null, or are we unable to tell using our staff confidence interval?

Assign `responses_tied` to the number corresponding to the correct answer.

1. Reject the null / Data is consistent with the alternative hypothesis
2. Fail to reject the null / Data is consistent with the null hypothesis
3. Unable to tell using Dr. McLean's confidence interval

*Hint:* If you're confused, take a look at [this chapter](https://inferentialthinking.com/chapters/13/4/Using_Confidence_Intervals.html) of the textbook.

<!--
BEGIN QUESTION
name: q2_4
manual: false
-->

In [None]:
responses_tied = ...

#### Question 5
What if, instead, we use a P-value cutoff of 1%? Do we reject the null, fail to reject the null, or are we unable to tell using Dr. McLean's confidence interval?

Assign `cutoff_one_percent` to the number corresponding to the correct answer.

1. Reject the null / Data is consistent with the alternative hypothesis
2. Fail to reject the null / Data is consistent with the null hypothesis
3. Unable to tell using Dr. McLean's confidence interval

<!--
BEGIN QUESTION
name: q2_5
manual: false
-->

In [None]:
cutoff_one_percent = ...

#### Question 6
What if we use a P-value cutoff of 0.1%? Do we reject, fail to reject, or are we unable to tell using our confidence interval?

Assign `cutoff_one_tenth_percent` to the number corresponding to the correct answer.

1. Reject the null / Data is consistent with the alternative hypothesis
2. Fail to reject the null / Data is consistent with the null hypothesis
3. Unable to tell using Dr. McLean's confidence interval

<!--
BEGIN QUESTION
name: q2_6
manual: false
-->

In [6]:
cutoff_one_tenth_percent = ...