# Homework 6: Confidence Intervals and Sample Size

Reading: [Chapter 11](http://www.cs.cornell.edu/courses/cs1380/2018sp/textbook/chapters/11/estimation.html) and [Chapter 12](http://www.cs.cornell.edu/courses/cs1380/2018sp/textbook/chapters/12/why-the-mean-matters.html)

### <font color="red">Please run the next cell to import the packages required for the lab.</font>

In [None]:
import numpy as np
from datascience import *

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

from test import *

## 1. Plot the Vote


Four candidates are running for President of Dataland. A polling company surveys 1000 people selected uniformly at random from among voters in Dataland, and it asks each one who they are planning on voting for. After compiling the results, the polling company releases the following proportions from their sample:

|Candidate  | Proportion|
|:------------:|:------------:|
|Candidate C | 0.47 |
|Candidate T | 0.38 |
|Candidate J | 0.08 |
|Candidate S | 0.03 |
|Undecided   | 0.04 |

These proportions represent a uniform random sample of the population of Dataland. We will attempt to estimate the corresponding *population parameters* - the proportion of the vote that each candidate received from the entire population.  We will use confidence intervals to compute a range of values that reflects the uncertainty of our estimate.

The table `votes` contains the results of the survey. Candidates are represented by their initials. Undecided voters are denoted by `U`.

In [None]:
votes = Table().with_column('vote', np.array(['C']*470 + ['T']*380 + ['J']*80 + ['S']*30 + ['U']*40))
num_votes = votes.num_rows
votes.sample()

Below, we have given you code that will use bootstrapped samples from `votes` to compute estimates of the true proportion of voters who are planning on voting for **Candidate C**.

In [None]:
def proportions_in_resamples():
    statistics = make_array()
    for i in np.arange(5000):
        bootstrap = votes.sample()
        sample_statistic = np.count_nonzero(bootstrap.column('vote') == 'C')/num_votes
        statistics = np.append(statistics, sample_statistic)
    return statistics

sampled_proportions = proportions_in_resamples()
Table().with_column('Estimated Proportion Voting for C', sampled_proportions).hist(bins=np.arange(0.2,0.6,0.01))

**Question 1.** Using the array `sampled_proportions`, compute an approximate 95% confidence interval for the true proportions of voters planning on voting for candidate C.  Compute the lower and upper ends of the interval, named `c_lower_bound` and `c_upper_bound`, respectively.

In [None]:
c_lower_bound = ...
c_lower_bound

In [None]:
c_upper_bound = ...
c_upper_bound

In [None]:
check1_1(c_lower_bound,c_upper_bound)

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


**Question 2.** The survey results seem to indicate that Candidate C is beating Candidate T among voters. We would like to use confidence intervals to determine a range of likely values for her true *lead*. Candidate C's lead over Candidate T is:

$$\text{Candidate C's proportion of the vote} - \text{Candidate T's proportion of the vote}.$$

Using the function `proportions_in_resamples` above as a model, use the bootstrap to compute an approximate distribution for Candidate C's lead over Candidate T. Resample and estimate 5000 times. Plot a histogram of the the resulting samples. Compute a 95% confidence intervals for the lead.

In [None]:
bins = bins=np.arange(-0.2,0.2,0.01)

def leads_in_resamples():
    statistics = make_array()
    ...
    return statistics


sampled_leads = leads_in_resamples()
...


In [None]:
lead_lower_bound = ...
lead_lower_bound

In [None]:
lead_upper_bound = ...
lead_upper_bound

In [None]:
check1_2(lead_lower_bound,lead_upper_bound)

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


## 2. Interpreting Confidence Intervals


The course staff computed the following 95% confidence interval for the proportion of Candidate C voters: 

$$[.439, .5]$$


(Your answer might have been a little different; that doesn't mean you were wrong.)

#### Question 1

Only one of the following statements about how to interpret confidence intervals is true.  Assign `ci_interpretation` to the integer corresponding to the true statement.

1. About 95% of the population is in the interval $[.439, .5]$.
2. About 95% of the sample is in the interval $[.439, .5]$.
3. If we repeatedly sampled voters, bootstrapped, and computed a new confidence interval, about 95% of the repetitions would produce the interval $[.439, .5]$.
4. There is about a 95% probability that the interval $[.439, .5]$ contains the true proportion of the population that is voting for Candidate C.
5. If we repeatedly sampled voters, bootstrapped, and computed a new confidence interval, about 95% of the repetitions would produce an interval containing the true proportion of the population that is voting for candidate C.

In [None]:
ci_interpretation = ...


In [None]:
check2_1(ci_interpretation)

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


#### Question 2
Suppose we had the resources to do a lot of polling of voters.  We collect 10,000 new samples (each one a uniform random sample of 1,000 voters), then bootstrap and create a 99% confidence interval for the proportion of voters that are voting for Candidate C.  Roughly how many of those 10,000 intervals will actually contain the true proportion of the population that is voting for Candidate C?  Assign `new_samples_containing_true_proportion` to your answer.

1. 5000
2. 9500
3. 9900
4. It's impossible to say

In [None]:
new_samples_containing_true_proportion = ...


In [None]:
check2_2(new_samples_containing_true_proportion)

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


**Question 3**

The staff also created 80%, 90%, and 99% confidence intervals from one sample, and those intervals appear below but not necessarily in that order. Match the interval to the percent of confidence the interval represents. Assign `intervals` to your answer, which should be a list containing the numbers .8, .9, and .99.  The first number in the list should be the confidence for interval 1, the second for interval 2, and the third for interval 3.

* Interval 1: $[.444,.495]$

* Interval 2: $[.450,.490]$

* Interval 3: $[.430,.511]$

In [None]:
intervals = [..., ..., ...]


In [None]:
check2_3(intervals)

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


Recall the second bootstrap confidence interval you created, estimating Candidate C's lead over Candidate T. Among voters in the sample, her lead was .09. The staff's 95% confidence interval for her true lead (in the population of all voters) was

$$[.032,.15].$$

Suppose we are interested in testing a simple yes-or-no question:

> "Are the candidates tied?"

Our null hypothesis is that the proportions are equal, or, equivalently, that Candidate C's lead is exactly 0. Our alternative hypothesis is that her lead is not equal to 0.  In the questions below, don't compute any confidence interval yourself - use only the staff's 95% confidence interval.

#### Question 4

Suppose we use a 5% P-value cutoff.  Do we reject the null, fail to reject the null, or are we unable to tell using our confidence interval? Choose one of the choices below and set the variable answer equal to that value. 

1. Reject the null. 
2. Fail to reject the null.
3. Unable to tell using your confidence interval: [0.032, 0.15]. 

In [None]:
answer = ...
answer 

In [None]:
check2_4(answer)

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


#### Question 5

What if, instead, we use a P-value cutoff of 1%? Do we reject the null, fail to reject the null, or are we unable to tell using our confidence interval? Choose one of the choices below and set the variable answer equal to that value. 

1. Reject the null. 
2. Fail to reject the null.
3. Unable to tell using your confidence interval [0.032, 0.15]. 

In [None]:
answer = ...
answer 

In [None]:
check2_5(answer)

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


#### Question 6
What if we use a P-value cutoff of 10%? Do we reject, fail to reject, or are we unable to tell using our confidence interval? Choose one of the choices below and set the variable answer equal to that value.

1. Reject the null. 
2. Fail to reject the null.
3. Unable to tell using your confidence interval [0.032, 0.15].

In [None]:
answer = ...
answer 

In [None]:
check2_6(answer)

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


## 3. Grouped Means


Suppose you'd like to know about the ages of people in a small town.  The local government collects this data about everyone in the town, but to ensure that you don't see any individual's age, it only makes public the number of people of each age.  (This could have been done by calling `group` on the original data table.)  So the first few rows of the dataset look something like this:

In [None]:
ages =  Table().with_columns('age', [0, 1, 2, 3, 5, 6], 'count', [2, 5, 1, 4, 10, 1])
ages

In this example, there were 2 people age 0, 5 people age 1, etc. Nobody is age 4.

#### Question 1
After you get the data, you first want to compute the mean age of the people in the town.

Write a function called `grouped_mean`.  It should take as its argument a table like the one above, except that the columns might have different names.  It should return the mean of the numbers in the dataset, assuming the first column contains the numbers themselves and the second column contains the count of each number, as in the example.

*Remember:* Even if you don't know the column name for the first column, you can access it by saying `tbl.column(0)`.

In [None]:
def grouped_mean(t):
    # Replace "return -1" with your own code.
    return -1


grouped_mean(ages)

In [None]:
check3_1(grouped_mean)

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


#### Question 2
Next, you want to summarize how spread out the ages are, so you decide to compute their standard deviation.

Write a function called `grouped_std`.  It should take as its argument a table like the one above, except that the columns might have different names.  It should return the standard deviation of the numbers in the dataset, assuming the first column contains the numbers and the second column contains the count of each number, as in the example.

*Hint:* 
1. Find the squared deviation from the mean for each item.
2. Then the standard deviation is the square root of the mean of that list of squared deviations.

You've already written a function that computes means of grouped numbers, so that should be useful.

__If you need a further hint, here is an example of how to calculate the standard deviation of an array of numbers:__

Suppose we want to find the standard deviation of a list `numbers`.

    numbers = make_array(5, 7, 1, 2, 6)

First, calculate the mean of the array.

    mean_of_numbers = np.mean(numbers)
    
Then compute the squared deviations from the mean for each element in `numbers`.

    squared_deviations = (numbers - mean_of_numbers) ** 2

Then the standard deviation is simply the square root of the mean of that list.

    sd = np.sqrt(np.mean(squared_devations))

In [None]:
def grouped_std(t):
    # Replace "return -1" with your own code.
    return -1



In [None]:
check3_2(grouped_std)

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


<div class="hide">\pagebreak</div>
If you're uncerttain whether your code for the previous question is correct, you should write some test cases.  Testing your own code on simple cases is an important skill.  Let's practice that.

The built-in NumPy function `np.std` computes the standard deviation of an array of numbers.  It doesn't work for grouped data, so you couldn't have just used it in your answer to question 2!  But we can use it to check `grouped_std` by manually un-grouping some small datasets (duplicating each number once for each count, and putting the duplicated numbers into an array) and calling `np.std` on the result.

|age|count|
|-|-|
|10|1|
|15|2|

$$\longleftrightarrow$$

$$\verb|make_array(10, 15, 15)|$$

#### Question 3
For the two tables in the following **two** cells, create an array representing the original (un-grouped) dataset it came from, and then use it to verify that `grouped_std` computes the right answer on that table.  We've done most of the first one for you.

In [None]:
example_0 = Table().with_columns(
    "age", make_array(10, 15),
    "count", make_array(1, 2))
grouped_std_0 = grouped_std(example_0)
example_0_ungrouped = make_array(10, 15, 15)

# The standard deviation of example_0_ungrouped, according to NumPy:
numpy_std_0 = ...
numpy_std_0

print("NumPy answer:\t", numpy_std_0, "\nYour answer:\t", grouped_std_0)

In [None]:
example_1 = Table().with_columns(
    "age", make_array(10, 15, 20, 25),
    "count", make_array(1, 2, 3, 0))

# Fill in the rest of the code, as above, so that the last line
# prints out the results of the test.

grouped_std_1 = ...
grouped_std_1

numpy_std_1 = ...
numpy_std_1

print("NumPy answer:\t", numpy_std_1, "\nyour answer:\t", grouped_std_1)

In [None]:
check3_3(grouped_std_0, numpy_std_0, numpy_std_1, grouped_std_1)

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


<div class="hide">\pagebreak</div>
If your results are different, that means there's an error in your `grouped_std` function (or your `grouped_mean` function).  Go back and fix it!  Each time you make a change, you can rerun the tests you've written to see if you've gotten it right.

## 4. Testing the Central Limit Theorem


The Central Limit Theorem tells us that the probability distribution of the sum or average of a large random sample drawn with replacement will be roughly normal, *regardless of the distribution of the population from which the sample is drawn*.

That's a pretty big claim, but the theorem doesn't stop there. It further states that the standard deviation of this normal distribution is given by $$\frac{\texttt{sd of the original distribution}}{\sqrt{\texttt{sample size}}}$$ In other words, suppose we start with *any distribution* that has standard deviation $x$, take a sample of size $n$ (where $n$ is a large number) from that distribution with replacement, and compute the mean of that sample. If we repeat this procedure many times, then those sample means will have a normal distribution with standard deviation $\frac{x}{\sqrt{n}}$.

That's an even bigger claim than the first one! The proof of the theorem is beyond the scope of this class, but in this exercise, we will be exploring some data to see the CLT in action.

**Question 1.** The CLT only applies when sample sizes are "sufficiently large." This isn't a very precise statement. Is 10 large?  How about 50?  The truth is that it depends both on the original population distribution and just how "normal" you want the result to look. Let's use a simulation to get a feel for how the distribution of the sample mean changes as sample size goes up.

Consider a coin flip. If we say `Heads` is 1 and `Tails` is 0, then there's a 50% chance of getting a 1 and a 50% chance of getting a 0, which is definitely not a normal distribution.  The average of several coin tosses is equal to the proportion of heads in those coin tosses, so the CLT should apply if we compute the sample proportion of heads many times.

Write a function called `simulate_sample_n` that takes in a sample size $n$. It should return an array that contains 5000 sample proportions of heads, each from $n$ coin flips.

In [None]:
def sample_size_n(n):
    coin = make_array(0, 1)
    sample_proportions = make_array()
    for _ in np.arange(5000):
        # An array of the results of n coin flips (0s and 1s):
        flips = ...
        sample_proportion = ...
        sample_proportions = ...
    return ...



sample_size_n(5)

In [None]:
check4_1(sample_size_n)

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


<div class="hide">\pagebreak</div>
The code below will use the function you just defined to plot the empirical distribution of the sample mean for various sample sizes. Drag the slider or click on the number to the right to type in a sample size of your choice. The x- and y-scales are kept the same to facilitate comparisons. 

In [None]:
# Just run this cell
from ipywidgets import interact

def outer(f):
    def graph(x):
        bins = np.arange(-0.01,1.05,0.02)
        sample_proportions = f(x)
        Table().with_column('Sample Size: {}'.format(x), sample_proportions).hist(bins=bins)
        plots.ylim(0, 30)
        print('Sample SD:', np.std(sample_proportions))
        plots.show()
    return graph
    
# Please uncomment the code once you have implemented the function "sample_size_n"
# interact(outer(sample_size_n), x=(0, 400, 1), continuous_update=False);

# Notice the shape of the graph as the sample size increases and decreases.
# Min sample size is 0, max is 400
# The graph will refresh a few times when you drag the slider around

You can see that even the means of samples of 10 items follow a roughly bell-shaped distribution.  A sample of 50 items looks quite bell-shaped.

**Question 2:** In the plot for a sample size of 10, why are the bars spaced at intervals of .1, with gaps in between?

*Write your answer here, replacing this text.*

In [None]:
# DO NOT DELETE THIS CELL


<div class="hide">\pagebreak</div>
Now we will test the second claim of the CLT: That the SD of the sample mean is the SD of the original distribution, divided by the square root of the sample size.

We have imported the flight delay data and computed its standard deviation for you.

In [None]:
united = Table.read_table('united_summer2015.csv')
united_std = np.std(united.column('Delay'))
united_std

**Question 3:** Write a function called `predict_sd`.  It takes a sample size `n` (a number) as its argument.  It returns the predicted standard deviation of the mean delay time for samples of size `n` from the flight delays (represented in the table `united`).

In [None]:
def predict_sd(n):
    # Replace "return -1" with your own code
    return -1


predict_sd(10)

In [None]:
check4_3(predict_sd)

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


**Question 4:** Write a function called `empirical_sd` that takes a sample size `n` as its argument. The function should simulate 500 samples with replacement of size `n` from the flight delays dataset, and it should return the standard deviation of the means of those 500 samples.

*Hint:* This function will be similar to the `sample_size_n` function you wrote earlier.

In [None]:
def empirical_sd(n):
    sample_means = make_array()
    for i in np.arange(500):
        sample = ...
        sample_mean = ...
        sample_means = ...
    return ...


empirical_sd(10)

In [None]:
check4_4(empirical_sd)

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


<div class="hide">\pagebreak</div>
The cell below will plot the predicted and empirical SDs for the delay data for various sample sizes.

In [None]:
# Please uncomment the code below once the previous functions have been implemented. 

# sd_table = Table().with_column('Sample Size', np.arange(1,101))
# predicted = sd_table.apply(predict_sd, 'Sample Size')
# empirical = sd_table.apply(empirical_sd, 'Sample Size')
# sd_table = sd_table.with_columns('Predicted SD', predicted, 'Empirical SD', empirical)
# sd_table.scatter('Sample Size')

**Question 5:** The empirical SDs are very close to the predicted SDs, but they're not exactly the same. Explain the reason for the similarities and reason for the differences?

*Write your answer here, replacing this text.*

In [None]:
# DO NOT DELETE THIS CELL


## 5. Polling and the Normal Distribution


Michelle is a statistical consultant, and she works for a group that supports Proposition 68 (which would mandate labeling of all horizontal or vertical axes), called Yes on 68.  They want to know how many people will vote for the proposition.

Michelle polls a uniform random sample of all voters, and she finds that 210 of the 400 sampled voters will vote in favor of the proposition.

In [None]:
sample = Table().with_columns(
    "Vote",  make_array("Yes", "No"),
    "Count", make_array(210,   190))
sample_size = sum(sample.column("Count"))
sample_proportions = sample.with_column(
    "Proportion", sample.column("Count") / sample_size)
sample_proportions

She uses 10,000 bootstrap resamples to compute a confidence interval for the proportion of all  voters who will vote Yes.  Run the next cell to see the empirical distribution of Yes proportions in the 10,000 resamples.

In [None]:
resample_yes_proportions = make_array()
for i in np.arange(10000):
    resample = proportions_from_distribution(sample_proportions, "Proportion", sample_size)
    resample_yes_proportions = np.append(resample_yes_proportions, resample.column("Random Sample").item(0))
Table().with_column("Resample Yes proportion", resample_yes_proportions).hist(bins=np.arange(.2, .8, .01))

#### Question 1
How does the Central Limit Theorem (CLT) explain the distribution we see above?

*Write your answer here, replacing this text.*

In [None]:
# DO NOT DELETE THIS CELL


<div class="hide">\pagebreak</div>
In a population whose members are 0 and 1, there is a simple formula for the standard deviation of that population:

$$\texttt{standard deviation} = \sqrt{(\text{proportion of 0s}) \times (\text{proportion of 1s})}$$

(Figuring out this formula, starting from the definition of the standard deviation, is an fun exercise for those who enjoy algebra.)

#### Question 2
Using only the CLT and the numbers of Yes and No voters in our sample of 400, compute (_algebraically_) a number `approximate_sd` that's the predicted standard deviation of the array `resample_yes_proportions` according to the Central Limit Theorem. **Do not access the data in `resample_yes_proportions` in any way.** 

$\text{SD of sample means} = \frac{\text{population SD}}{\sqrt{\text{sample size}}}$

Remember that if we do not know the population SD, we can use the sample SD as a reasonable approximation in its place.

In [None]:
approximate_sd = ...
approximate_sd

In [None]:
check5_2(approximate_sd)

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


#### Question 3
Compute the standard deviation of the array `resample_yes_proportions` to verify that your answer to question 2 is approximately right.

In [None]:
exact_sd = ...
exact_sd

In [None]:
check5_3(exact_sd)

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


#### Question 4
**Again, without accessing `resample_yes_proportions` in any way**, compute an approximate 95% confidence interval for the proportion of Yes voters in California.

The cell below draws your interval as a red bar below the histogram of `resample_yes_proportions`; use that to verify that your answer looks right.

In [None]:
lower_limit = ...
lower_limit

In [None]:
upper_limit = ...
upper_limit

In [None]:
check5_4(lower_limit, upper_limit)

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [None]:
# Uncomment and run the code below to plot your confidence interval.

# Table().with_column("Resample Yes proportion", resample_yes_proportions).hist(bins=np.arange(.2, .8, .01))
# plots.plot(make_array(lower_limit, upper_limit), make_array(0, 0), c='r', lw=10);

<div class="hide">\pagebreak</div>
Your confidence interval should overlap the number 0.5.  That means we can't be very sure whether Proposition 68 is winning, even though the sample Yes proportion is a bit above 0.5.

The Yes on 68 campaign really needs to know whether they're winning.  It's impossible to be absolutely sure without polling the whole population, but they'd be okay if the standard deviation of the sample mean were only 0.005.  They ask Michelle to run a new poll with a sample size that's large enough to achieve that.  (Polling is expensive, so the sample also shouldn't be bigger than necessary.)

Michelle consults Chapter 12 of your textbook.  Instead of making the conservative assumption that the population standard deviation is 0.5 (coding Yes voters as 1 and No voters as 0), she decides to assume that it's equal to the standard deviation of the sample,

$$\sqrt{(\text{Yes proportion in the sample}) \times (\text{No proportion in the sample})}.$$

Under that assumption, Michelle decides that a sample of 9,975 would suffice.

#### Question 5
How did Michelle arrive at that answer? Please be clear in your steps/explanations.

*Write your answer here, replacing this text.*

In [None]:
# DO NOT DELETE THIS CELL


## 6. Submission


To submit your assignment, click the blue Submit button above. You may submit as many times as you wish before the deadline. Only your final submission will be graded. No late work will be accepted, so please make sure you submit something before the deadline!

Before you submit, it would be wise to click on the menu item Kernel -> Restart & Run All. That will re-run all your cells from scratch. Take a second look to make sure all your answers are passing the checks. Doing this will help catch any errors in your homework that result from running cells in a strange order.