<div style="width: 38.5%;">
    <p><strong>City College of San Francisco</strong><p>
    <hr>
    <p>MATH 108 - Foundations of Data Science</p>
</div>

# Lecture 31: Designing Experiments

Associated Textbook Sections: [14.6](https://inferentialthinking.com/chapters/14/6/Choosing_a_Sample_Size.html)

## Outline

* [Confidence Intervals](#Confidence-Intervals)
* [Sample Proportions](#Sample-Proportions)

## Set Up the Notebook

In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

---

## Confidence Intervals

### Graph of the Distribution

<img src="img/lec28_approx_dist_sample_ave.png" width=50%>

### The Key to 95% Confidence 

<img src="img/lec28_95_confidence.png" width=80%>

* For about 95% of all samples, the sample average and population average are within 2 SDs of each other.
* SD = SD of sample average = $(\text{population SD}) / \sqrt{\text{sample size}}$


### Constructing the Interval

For about 95% of all samples,

* If you stand at the population average and look two SDs on both sides, you will find the sample average.
* Distance is symmetric.
* So if you stand at the sample average and look two SDs on both sides, you will capture the population average.


### The Interval

<img src="img/lec28_the_interval_mean.png" width=80%>

### Width of the Interval

Total width of a 95% confidence interval for the population average

* $=  4 * \text{SD of the sample average}$
* $=  4 * (\text{population SD}) / \sqrt{\text{sample size}}$
* The $\text{population SD}$ is unknown in practice... 


---

## Sample Proportions

### Proportions are Averages

* Data: 1 1 1 1 0 0 0 0 0 0 (10 entries)
    * Sum  =  4  (number of 1's)
    * Average  =  4/10  =  0.4 (proportion of 1's)
* If the population consists of 1's and 0's (yes/no answers to a question), then:
    * the population average is the proportion of 1's in the population
    * the sample average is the proportion of 1's in the sample


In [None]:
number_of_ones = 4
zero_one_population = np.append(np.ones(number_of_ones), np.zeros(10 - number_of_ones))
zero_one_population

In [None]:
np.mean(zero_one_population)

### Confidence Interval

<img src="img/lec28_the_interval_prop.png" width=80%>

### Controlling the Width

* Total width of an approximate 95% confidence interval for a population proportion is equal to 

$$4 * (\text{SD of 0/1 population}) / \sqrt{\text{sample size}}$$

* The narrower the interval, the more precise your estimate.
* Suppose you want the total width of the interval to be no more than 1%. How should you choose the sample size?


### The Sample Size for a Given Width

* $0.01  =  4 * (\text{SD of 0/1 population}) / \sqrt{\text{sample size}}$
* Left side: 1%, the max total width that you'll accept
* Right side: formula for the total width
* Re-arrange: $\sqrt{\text{sample size}} =  4 * (\text{SD of 0/1 population}) / 0.01$

### Demo: SD of 0/1 Population

Calculate the SD of the 0/1 population

In [None]:
zero_one_population

In [None]:
np.std(zero_one_population)

Let's make a graph with proportion of ones on the x axis and SD on the y axis.

In [None]:
def sd_of_zero_one_population(number_of_ones):
    """SD of a population with num_ones ones and (10 - num_ones) zeros"""
    zero_one_population = np.append(np.ones(number_of_ones), np.zeros(10 - number_of_ones))
    return np.std(zero_one_population)

In [None]:
possable_ones = np.arange(11)
zero_one_pop = Table().with_columns(
    'Number of ones', possable_ones,
    'Proportion of ones', possable_ones / 10
)

In [None]:
sds = zero_one_pop.apply(sd_of_zero_one_population, 'Number of ones')
zero_one_pop = zero_one_pop.with_column('SD', sds)

In [None]:
zero_one_pop.scatter('Proportion of ones', 'SD')

### "Worst Case" Population SD

* $\sqrt{\text{sample size}} =  4 * (\text{SD of 0/1 population}) / 0.01$
* SD of 0/1 population is at most 0.5
* $\sqrt{\text{sample size}} \geq 4 * 0.5 / 0.01$
* $\text{sample size} \geq  (4 * 0.5 / 0.01)^ 2   =   40000$
* The sample size should be 40,000 or more


### Example

<img src="./img/lec28_poll_scientific_american.png" alt="Scientific American article with the headline reading: How can a poll of only 1,004 Americans represent 260 million people with only a 3 percent margin of error?" width=40%>

* A researcher is estimating a population proportion based on a random sample of size 1,004.
* With chance at least 95%, the estimate will be correct to within 3%.
* A 3% margin of error translates to an interval width of 6%.

In [None]:
CI_width = ...
CI_width 

### An Exercise

* A researcher is estimating a population proportion based on a random sample of size 10,000. 
* With a confidence level of 95%, the estimate will be correct to within how many percentage points?

In [None]:
width = ...
margin_of_error = ...
margin_of_error_percent = ...
margin_of_error_percent

### Another Example

* I am going to use a 68% confidence interval to estimate a population proportion. 
* I want the total width of my interval to be no more than 2.5\%.
* How large must my random sample be?
* $2 \cdot (\frac{0.5}{\sqrt{\text{sample\_size}}}) = 0.025$
* In other words: $\text{sample\_size} = \left(\frac{2\cdot 0.5}{0.025}\right)^2$


In [None]:
sample_size = ...
sample_size

---

<footer>
    <p>Adopted from UC Berkeley DATA 8 course materials.</p>
    <p>This content is offered under a <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">CC Attribution Non-Commercial Share Alike</a> license.</p>
</footer>