# Homework 9: Central Limit Theorem

**Reading**: 
* [Why the mean matters](https://inferentialthinking.com/chapters/14/Why_the_Mean_Matters.html)


**Deadline:** This assignment is due **Sunday, April 17th at 11pm.** 

Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. Refer to the policies page to learn more about how to learn cooperatively.

You should start early so that you have time to get help if you're stuck. The drop-in office hours schedule can be found on [Canvas](https://yale.instructure.com/courses/76050). You can also post questions or start discussions on [Ed Discussion](https://edstem.org/us/courses/18969/discussion/).

When you assign data to a name in a cell that contains information relevant to answering a question, please be sure to "show your work" by printing the value stored in the name (i.e., put the name by itself on a line so that the value in it is printed). 


**Submission:**

Submit your assignment as a .pdf on Gradescope. You can access Gradescope through Canvas on the left-side of the class home page. The problems in each homework assignment are numbered. When submitting on Gradescope, please select the correct pages of your pdf that correspond to each problem. Failure to mark pages correctly will result in points being deducted from your homework score.

**Note**: Please carefully read over all your answers in the .pdf document you create prior to uploading it to Gradescope to make sure all your answers are readable and are not cutoff. Points will be taken off for any answer that can not be read, and it will not be possible to resubmit homework answers. Adding extra space and new lines can help prevent answers from being cut off, and if you run into problems please post questions to Ed Discussions or come to office hours for help. 


In [None]:
# Don't change this cell; just run it. 

import numpy as np
from datascience import *

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

## 1. Testing the Central Limit Theorem


The Central Limit Theorem tells us that the probability distribution of the **sum** or **average** of a large random sample drawn with replacement will be roughly normal, *regardless of the distribution of the population from which the sample is drawn*.

That's a pretty big claim, but the theorem doesn't stop there. It further states that the standard deviation of this normal distribution is given by 

$$\frac{\texttt{sd of the original distribution}}{\sqrt{\texttt{sample size}}}$$ 

In other words, suppose we start with *any distribution* that has standard deviation $\sigma$, take a sample of size $n$ (where $n$ is a large number) from that distribution with replacement, and compute the **mean** of that sample. If we repeat this procedure many times, then those sample means will have a normal distribution with standard deviation $\frac{\sigma}{\sqrt{n}}$.

That's an even bigger claim than the first one! The proof of the theorem is beyond the scope of this class, but in this exercise, we will be exploring some data to see the CLT in action.

**Note**: "standard deviation of sample means" (i.e., $\frac{\sigma}{\sqrt{n}}$) is called the "standard error" and is denoted SE. The textbook does not use this terminology, however, it is the common term used by Statisticians. Using this term will make it easier to distinguish between different standard deviations; e.g., it will make it easier for us to distinguish between the standard deviation of the population ($\sigma$) and the standard deviation of sample means (SE). 

**Question 1.1.** The CLT only applies when sample sizes are "sufficiently large." This isn't a very precise statement. Is 10 large?  How about 50?  The truth is that it depends both on the original population distribution and just how "normal" you want the result to look. Let's use a simulation to get a feel for how the distribution of the sample mean changes as sample size goes up.

Consider a coin flip. If we say `Heads` is $1$ and `Tails` is $0$, then there's a 50% chance of getting a 1 and a 50% chance of getting a 0, which definitely doesn't match our definition of a normal distribution.  The average of several coin tosses, where Heads is 1 and Tails is 0, is equal to the proportion of heads in those coin tosses (which is equivalent to the mean value of the coin tosses), so the CLT should hold **true** if we compute the sample proportion of heads many times.

Write a function called `sample_size_n` that takes in a sample size $n$. It should return an array that contains 5000 sample proportions of heads, each from $n$ coin flips.

In [None]:
def sample_size_n(n):
    coin_proportions = make_array(.5, .5) # our coin is fair
    heads_proportions = make_array()
    for i in np.arange(5000):
        simulated_proportions = ...
        prop_heads = ...
        heads_proportions = ...
    return heads_proportions


sample_size_n(5)

<div class="hide">\pagebreak</div>
The code below will use the function you just defined to plot the empirical distribution of the sample mean for various sample sizes. Drag the slider or click on the number to the right to type in a sample size of your choice. The x- and y-scales are kept the same to facilitate comparisons. Notice the shape of the graph as the sample size increases and decreases.

In [None]:
# Just run this cell
from ipywidgets import interact

def outer(f):
    def graph(x):
        bins = np.arange(-0.01,1.05,0.02)
        sample_props = f(x)
        Table().with_column('Sample Size: {}'.format(x), sample_props).hist(bins=bins)
        plt.ylim(0, 30)
        print('Sample SD:', np.std(sample_props))
        plt.show()
    return graph
    
interact(outer(sample_size_n), x=(0, 400, 1), continuous_update=False);

# Min sample size is 0, max is 400
# The graph will refresh a few times when you drag the slider around

You can see that even the means of samples of 10 items follow a roughly bell-shaped distribution.  A sample of 50 items looks quite bell-shaped.

**Question 1.2.** In the plot for a sample size of 10, why are the bars spaced at intervals of .1, with gaps in between?

*Write your answer here, replacing this text.*

<div class="hide">\pagebreak</div>
Now we will test the second claim of the CLT: That the SD of the sample mean (i.e., the SE) is equal to the SD of the original distribution, divided by the square root of the sample size.

We have imported the flight delay data and computed its standard deviation ($\sigma$) for you.

In [None]:
united = Table.read_table('united_summer2015.csv')
united_std = np.std(united.column('Delay'))
united_std

**Question 1.3.** Write a function called `empirical_se` that takes a sample size `n` as its argument. The function should simulate 500 samples with replacement of size `n` from the flight delays dataset. For each sample, it should calculate the mean flight delay. The function should then return the standard deviation of the **means of those 500 samples** (i.e., it should return the empirical standard error).

*Hint:* This function will be similar to the `sample_size_n` function you wrote earlier.

In [None]:
def empirical_se(n):
    sample_means = make_array()
    for i in np.arange(500):
        sample = ...
        sample_mean = ...
        sample_means = ...
    return np.std(sample_means)

empirical_se(10)

**Question 1.4.** Now, write a function called `predict_se` to find the predicted value of the standard deviation of means (i.e., the predicted SE) according to the relationship between the SE (i.e., the standard deviation of the sample means) and sample size that is discussed [here](https://inferentialthinking.com/chapters/14/5/Variability_of_the_Sample_Mean.html) in the textbook. It takes a sample size `n` (a number) as its argument.  It returns the predicted value of the standard deviation of the mean delay time for samples of size `n` from the flight delays (represented in the table `united`).

In [None]:
def predicted_se(n):
    ...

predicted_se(10)

<div class="hide">\pagebreak</div>
The cell below will plot the predicted and empirical SEs for the delay data for various sample sizes.

In [None]:
se_table = Table().with_column('Sample Size', np.arange(1,101))
predicted = se_table.apply(predicted_se, 'Sample Size')
empirical = se_table.apply(empirical_se, 'Sample Size')
se_table = se_table.with_columns('Predicted SE', predicted, 'Empirical SE', empirical)
se_table.scatter('Sample Size')
plt.ylabel("Standard error, SE")

**Question 1.5.** Do our predicted and empirical SE values match? 

*Write your answer here, replacing this text.*

## 2. Polling and the Normal Distribution


**Question 2.1.** Michelle is a statistical consultant, and she works for a group that supports Proposition 68 (which would mandate labeling of all horizontal or vertical axes), called Yes on 68.  They want to know how many Californians will vote for the proposition.

Michelle polls a uniform random sample of all California voters, and she finds that 210 of the 400 sampled voters will vote in favor of the proposition. Fill in the code below to form a table with 3 columns: the first two columns should be identical to `sample`. The third column should be named `Proportion` and have the proportion of total voters that chose each option.

In [None]:
sample = Table().with_columns(
    "Vote",  make_array("Yes", "No"),
    "Count", make_array(210,   190))
sample_size = ...
sample_with_proportions = ...
sample_with_proportions

**Question 2.2.** She then wants to use 10,000 bootstrap resamples to compute a confidence interval for the proportion of all California voters who will vote Yes.  Fill in the next cell to simulate an  empirical distribution of Yes proportions with 10,000 resamples. In other words, use bootstrap resampling to simulate 10,000 election outcomes, and populate `resample_yes_proportions` with the yes proportion of each bootstrap resample. Then, visualize `resample_yes_proportions` with a histogram. You should see a bell shaped curve centered near the proportion of Yes in the original sample.

In [None]:
resample_yes_proportions = make_array()
for i in np.arange(10000):
    resample = ...
    resample_yes_proportions = ...
Table().with_column("Resample Yes proportion", resample_yes_proportions).hist(bins=np.arange(.2, .8, .01))

**Question 2.3.** Why does the Central Limit Theorem (CLT) apply in this situation, and how does it explain the distribution we see above?

*Write your answer here, replacing this text.*

<div class="hide">\pagebreak</div>
In a population whose members are 0 and 1, there is a simple formula for the standard deviation ($\sigma$) of that population:

$$\texttt{standard deviation} = \sqrt{(\text{proportion of 0s}) \times (\text{proportion of 1s})}$$

(Figuring out this formula, starting from the definition of the standard deviation, is an fun exercise for those who enjoy algebra.)

**Question 2.4.** Using only the CLT and the numbers of Yes and No voters in our sample of 400, compute (*algebraically*) a number `approximate_se` that's the predicted standard deviation of the array `resample_yes_proportions` according to the Central Limit Theorem. **Do not access the data in `resample_yes_proportions` in any way.** Remember that a predicted standard deviation of the sample means (i.e., the predicted standard error, SE) can be computed from the population SD and the size of the sample. 

Also remember that if we do not know the population SD, we can use the sample SD as a reasonable approximation in its place.

In [None]:
approximate_se = ...
approximate_se

**Question 2.5.** Compute the SD of the array `resample_yes_proportions` which will act as an approximation to the true SE of the possible sample proportions (i.e., you should use `resample_yes_proportions` here). This will help verify whether your answer to question 2.2 is approximately correct.  


In [None]:
resampled_se = ...
resampled_se

**Question 2.6.** **Again, without accessing `resample_yes_proportions` in any way**, compute an approximate 95% confidence interval for the proportion of Yes voters in California using the Central Limit Theorem and `approximate_se`.

The cell below draws your interval as a red bar below the histogram of `resample_yes_proportions`; use that to verify that your answer looks right.

*Hint:* How many SDs corresponds to 95% of the distribution promised by the CLT? Recall the discussion in the textbook <a href = "https://inferentialthinking.com/chapters/14/3/SD_and_the_Normal_Curve.html"> here. 

In [None]:
lower_limit = ...
upper_limit = ...
print('lower:', lower_limit, 'upper:', upper_limit)

In [None]:
# Run this cell to plot your confidence interval.
Table().with_column("Resample Yes proportion", resample_yes_proportions).hist(bins=np.arange(.2, .8, .01))
plt.plot(make_array(lower_limit, upper_limit), make_array(0, 0), c='r', lw=10);

<div class="hide">\pagebreak</div>
Your confidence interval should overlap the number 0.5.  That means we can't be very sure whether Proposition 68 is winning, even though the sample Yes proportion is a bit above 0.5.

The Yes on 68 campaign really needs to know whether they're winning.  It's impossible to be absolutely sure without polling the whole population, but they'd be okay if the standard error was only 0.005.  They ask Michelle to run a new poll with a sample size that's large enough to achieve that.  (Polling is expensive, so the sample also shouldn't be bigger than necessary.)

Michelle consults Chapter 14 of your textbook.  Instead of making the conservative assumption that the population standard deviation is 0.5 (coding Yes voters as 1 and No voters as 0), she decides to assume that it's equal to the standard deviation of the sample,

$$\sqrt{(\text{Yes proportion in the sample}) \times (\text{No proportion in the sample})}.$$

Under that assumption, Michelle decides that a sample of 9,975 would suffice.

**Question 2.7.** Does Michelle's sample size achieve the desired standard error? What standard error would you achieve with a smaller sample size? A higher sample size? To explore this, first compute the standard error obtained by using Michelle's sample size.

In [None]:
estimated_population_sd = ...
michelle_sample_size = ...
michelle_se = ...
print("With Michelle's sample size, you would predict a SE of %f." % michelle_se)

Then, compute the SE that you would get from a smaller sample size. Ideally, you should pick a number that is significantly smaller, but any sample size smaller than Michelle's will do. 

In [None]:
smaller_sample_size = ...
smaller_se = ...
print("With this smaller sample size, you would predict an SE of %f" % smaller_se)

Finally, compute the SE that you would get from a larger sample size. Here, a number that is significantly larger would make any difference more obvious, but any sample size larger than Michelle's will do. 

In [None]:
larger_sample_size = ...
larger_se = ...
print("With this larger sample size, you would predict an SE of %f" % larger_se)

## 3. Reflection

Please reflect on how the homework went by going to Canvas, going to the Quizzes link, and clicking on [Reflection on homework 9](https://yale.instructure.com/courses/76050/quizzes/52315)


## 4. Submission

Once you're finished, submit your assignment as a .pdf (download as .html, then print to save as a .pdf) on Gradescope.