<h1> Lecture 28 

Data Science 8, Spring 2021 </h1>

<h3>
<b>
<ul>
<li>Designing Experiments: What Sample Size to Use?</li><br>
</ul>
</b>
</h3>

In [None]:
from datascience import *
import numpy as np
import warnings
warnings.filterwarnings("ignore")

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
plots.rcParams["patch.force_edgecolor"] = True

#The following allows porting images into a Markdown window
#Syntax: ![title](image_name.png)
from IPython.display import Image

In [None]:
united = Table.read_table('united.csv')
united

In [None]:
united.hist('Delay', bins = np.arange(-20, 300, 10))

In [None]:
delays = united.column('Delay')
population_delay_median = percentile(50, delays)
population_delay_mean = np.mean(delays)
population_delay_sd = np.std(delays)
print('Population Median Delay:', np.round(population_delay_median,2))
print('Population Mean Delay:', np.round(population_delay_mean,2))
print('Population Delay Standard Deviation:', np.round(population_delay_sd,2))


In [None]:
def one_sample_mean(sample_size):
    """ 
    Takes a sample from the population of flights 
    and computes its mean
    """
    # Recall that the "sample" method without an 
    # argument performs the sampling with replacement
    sampled_flights = united.sample(sample_size)
    return np.mean(sampled_flights.column('Delay'))

In [None]:
def many_sample_means(sample_size,num_simulations):
    """Approximate the distribution of the sample mean"""
    means = make_array()
    for i in np.arange(num_simulations):
        mean = one_sample_mean(sample_size)
        means = np.append(means, mean)
    return means

In [None]:
"""Empirical distribution of random sample means"""

def plot_and_summarize_sample_means(sample_size,num_simulations):
    sample_means = many_sample_means(sample_size,num_simulations)
    sample_means_table = Table().with_column('Sample Means', sample_means)
    
    # Print some information about the distribution of the sample means
    print("Sample size: ", sample_size)
    print("Population mean:", population_delay_mean)
    print("Average of sample means: ", np.mean(sample_means))
    print("Population SD:", population_delay_sd)
    print("SD of sample means:", np.std(sample_means))

    # Plot a histogram of the sample means
    sample_means_table.hist(bins=20)
    plots.xlabel('Sample Means')
    plots.title('Sample Size ' + str(sample_size))

In [None]:
plot_and_summarize_sample_means(100,10000)

<h4>Recall that<br>   
$$\textsf{Sample-Mean SD}=\displaystyle \frac{1}{\sqrt{\textsf{Sample Size}}} \times \textsf{Population SD}.$$
</h4>

<h4>The Sample Mean SD is approximately $\displaystyle \frac{1}{10}$ of the Population SD, because the sample size is $100$.  
    
</h4>

In [None]:
plot_and_summarize_sample_means(400,10000)

<h4>The Sample Mean SD is approximately $\displaystyle \frac{1}{20}$ the Population SD, because the sample size is $400$.  
</h4>

In [None]:
plot_and_summarize_sample_means(900,10000)

<h4>Note that the Sample Mean SD is approximately $\displaystyle \frac{1}{30}$ the Population SD, because the sample size is $900$.  
</h4>

<h3>Upshot:<br><br> The larger the sample size, the more confident we are of our estimate of the population mean.</h3

## SD of the sample mean

In [None]:
num_simulations=10000

In [None]:
# Warning: this cell will take a long time to run!
sample_sizes = np.arange(100, 950, 50)

sample_mean_sds = make_array()
for n in sample_sizes:
    sample_means = many_sample_means(n,num_simulations)
    sample_mean_sds = np.append(sample_mean_sds, np.std(sample_means))

In [None]:
sd_table = Table().with_columns(
    'Sample size', sample_sizes,
    'SD of simulated sample means', sample_mean_sds,
    '(Theoretical Ratio: Pop SD / sqrt(sample size)', population_delay_sd / np.sqrt(sample_sizes),
)
sd_table.show()

In [None]:
sd_table.scatter('Sample size')

SLIDE: Central Limit Theorem

## SD of 0/1 Population ##

<h4>Play with the <tt>number_of_ones</tt> in the cell below&mdash;by trying integers between 0 and 10&mdash;and see how it affects the variance and standard deviation.<br><br>

Compare Variance and SD values for 0 and 10, 1 and 9, 2 and 8, ...</h4>

In [None]:
# Population of size 10

number_of_ones = 8
zero_one_population = np.append(np.ones(number_of_ones), np.zeros(10 - number_of_ones))

print('Variance:', np.round(np.var(zero_one_population),2))
print('Standard Deviation:', np.round(np.std(zero_one_population),2))

zero_one_population

<h4> Let's make a graph with proportion of ones on the $x$-axis and SD on the $y$-axis:</h4>

<h5>Step 1: Define a function that does two things: <br>
<ul>
<li> Create an array of size 10, consisting of a specified number of ones, and each of the remaining elements equal to zero; and </li><br>
<li> Return the Standard Deviation of the values in the array.
</ul>   
</h5> 

In [None]:
def sd_of_zero_one_population(number_of_ones):
    """SD of a population with num_ones ones and (10 - num_ones) zeros"""
    zero_one_population = np.append(np.ones(number_of_ones), 
                                    np.zeros(10 - number_of_ones))
    return np.std(zero_one_population)

<h5>Step 2: Create a table of two columns, where<br>
<ul>
<li> the first column contains the array of ones and zeros constructed above; and</li><br>
<li> the second column denotes the fraction (proportion) of ones in the array.</li>
</ul>
</h5>

In [None]:
possible_ones = np.arange(11)
zero_one_pop = Table().with_columns(
    'Number of Ones', possible_ones,
    'Proportion of Ones', possible_ones / 10
)
zero_one_pop.show()

<h5>Step 3: Add a column of Standard Deviations computed from the second column of the table above:</h5>

In [None]:
sds = zero_one_pop.apply(sd_of_zero_one_population, 'Number of Ones')
zero_one_pop = zero_one_pop.with_column('SD', sds)
zero_one_pop.show()

<h4>Question: What are your observations of the SD as a function of the Proportion of Ones?</h4>

In [None]:
zero_one_pop.scatter('Proportion of Ones', 'SD')