<h1> Lecture 27 

Data Science 8, Spring 2021 </h1>

<h3>
<b>
<ul>
<li>The Normal Distribution</li><br>
<li>The Central Limit Theorem (CLT)</li><br>
<li>Sample Means</li>
</ul>
</b>
</h3>

In [None]:
from datascience import *
import numpy as np
import warnings
warnings.filterwarnings("ignore")

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
plots.rcParams["patch.force_edgecolor"] = True

#The following allows porting images into a Markdown window
#Syntax: ![title](image_name.png)
from IPython.display import Image

<h2> Central Limit Theorem </h2>

In [None]:
united = Table.read_table('united.csv')
united_bins = np.arange(-20, 301, 10)
united

In [None]:
united.hist('Delay', bins=united_bins)

<h4>Compute the Median, Mean, and Standard Deviation of the Delays</h4>

In [None]:
delays = united.column('Delay')
delay_median = percentile(50, delays)
delay_mean = np.mean(delays)
delay_sd = np.std(delays)
print('Median Delay:', np.round(delay_median,2))
print('Mean Delay:', np.round(delay_mean,2))
print('Delay Standard Deviation:', np.round(delay_sd,2))

<h4><u>Question:</u> Why is the Mean greater than the Median? </h4>

<h4>
Assume that we don't have practical access to the complete data. Accordingly, we'll only sample the data. In particular, we'll <br>
<ul>
    <li> write code to generate a sample (with replacement) of flights; </li><br>
    <li> compute the mean delay in the sample&mdash;called the <i>Sample Mean of the Delay; and </li><br><br>
    <li> generate a large number of such samples, from which we draw inferences about the true mean delay.</li><br>
</ul>
The Sample Mean is an estimate of the population mean (i.e., true mean flight delay of all flights).
</h4>

<h3>Generate a Single Sample Mean:</h3>

In [None]:
def one_sample_mean(sample_size):
    """ 
    Takes a sample from the population of flights 
    and computes its mean
    """
    # Recall that the "sample" method without an 
    # argument performs the sampling with replacement
    sampled_flights = united.sample(sample_size)
    return np.mean(sampled_flights.column('Delay'))

<h4>Run the function <tt>one_sample_mean</tt> several times:</h4>

In [None]:
one_sample_mean(400)

<h3>Question: How many possible random samples are there?</h3>

<h4>In how many possible ways can we draw a sample of size <tt>sample_size=100</tt> from the data set?
</h4>

<h5>Total number of flights in our data set (total population size):</h5>

In [None]:
united.num_rows

<h5>For each flight we draw, we have 100 possibilities (recall that we sample with replacement):</h5>

In [None]:
# How many possible sample means are there?
united.num_rows ** 100

<h4>How many random samples can we get of size <tt>sample_size=400</tt>?</h4>

In [None]:
# How many possible sample means are there?
united.num_rows ** 400

Far too many samples to enumerate them all!  So while there *is* a well-defined distribution given by all possible sample means from all samples, it is too hard to compute it exactly.  Instead, we approximate this distribution by drawing 10,000 samples from it (which is much smaller than the ridiculous number above!).  We then draw the histogram of the sample means of those 10,000 samples.

<h4>In particular, to understand the variability of the Sample Mean, let's<br>
<ul>
    <li>run a large number (<tt>num_simulations</tt>) of trials&mdash;that is, take a large number of samples of identical size (e.g., <tt>sample_size=400)</tt>;</li><br>
    <li>compute the mean of each sample (called the "Sample Mean"); and</li><br> 
    <li>observe the distribution of these sample means.</li>
</h4>

<h3>Run a Large Number of Trials&mdash;that is, generate a large number of samples:</h3>

In [None]:
def many_sample_means(sample_size,num_simulations):
    """Approximate the distribution of the sample mean"""
    means = make_array()
    for i in np.arange(num_simulations):
        mean = one_sample_mean(sample_size)
        means = np.append(means, mean)
    return means

In [None]:
sample_means_100_10000 = many_sample_means(100,10000)

In [None]:
sample_means_100_10000

In [None]:
len(sample_means_100_10000)

In [None]:
sample_means_100_10000_table = Table().with_column(
    'Mean of 100 flight delays', sample_means_100_10000).hist(bins=20)

print('Population Mean:', np.round(delay_mean,2))
print('Sample Mean:', np.round(np.mean(sample_means_100_10000),2))

<h3>Now let's look at the distribution's dependence on sample size.</h3>
    
<h4>What if each sample population contains 400 flights?</h4>

In [None]:
sample_means_400_10000 = many_sample_means(400,10000)
sample_means_400_10000_table = Table().with_column(
    'Mean of 400 flight delays', sample_means_400_10000).hist(bins=20)

print('Population Average:', np.round(delay_mean,2))
print('Sample Mean:', np.round(np.mean(sample_means_400_10000),2))

SLIDE: Distribution of the Sample Mean

<h4>Now do the same with a sample size of 900 flights.</h4>

In [None]:
sample_means_900_10000 = many_sample_means(900,10000)
sample_means_900_10000_table = Table().with_column(
    'Mean of 400 flight delays', sample_means_900_10000).hist(bins=20)

print('Population Average:', np.round(delay_mean,2))
print('Sample Mean:', np.round(np.mean(sample_means_900_10000),2))

In [None]:
sample_means_100_400_900_table = Table().with_columns(
    'Mean of 100 flight delays', sample_means_100_10000,
    'Mean of 400 flight delays', sample_means_400_10000,
    'Mean of 900 flight delays', sample_means_900_10000)
sample_means_100_400_900_table

<h4>For comparison, superimpose the histograms for sample sizes 100, 400, and 900:</h4>

In [None]:
sample_means_100_400_900_table.hist(bins=20)

<h4>How do you interpret the picture above?</h4>   
<h5>Think in terms of "center of mass" and spread.</h5>
<h4>What does it tell you about the effect of increasing the sample size?
</h4>

SLIDE: Specifying the Distribution

<h3>Quantifying the effect of sample size on spread:</h3>

In [None]:
np.std(sample_means_100_10000)/np.std(sample_means_400_10000)

In [None]:
np.sqrt(400/100)

<h5>What about samples sizes of 450 and 900?</h5>

In [None]:
sample_means_450_10000 = many_sample_means(450,10000)
np.std(sample_means_450_10000)/np.std(sample_means_900_10000)

In [None]:
np.sqrt(900/450)

<h5>What about samples sizes of 100 and 900?</h5>

In [None]:
np.std(sample_means_100_10000)/np.std(sample_means_900_10000)

In [None]:
np.sqrt(900/100)

<h4>Consider sample sizes $m$ and $n$, where $n>m$.<br><br>
    Then the standard deviation of the sample mean of <tt>sample_size</tt>$=n$ is a factor $\sqrt{\displaystyle \frac{n}{m}}$ <u>less</u> than the standard deviation of <tt>sample_size</tt>$=m$</h4>

<h3>One-Stop Code Cell</h3>

In [None]:
"""Empirical distribution of random sample means"""

def plot_and_summarize_sample_means(sample_size,num_simulations):
    sample_means = many_sample_means(sample_size,num_simulations)
    sample_means_table = Table().with_column('Sample Means', sample_means)
    
    # Print some information about the distribution of the sample means
    print("Sample size: ", sample_size)
    print("Population mean:", delay_mean)
    print("Average of sample means: ", np.mean(sample_means))
    print("Population SD:", delay_sd)
    print("SD of sample means:", np.std(sample_means))

    # Plot a histogram of the sample means
    sample_means_table.hist(bins=20)
    plots.xlabel('Sample Means')
    plots.title('Sample Size ' + str(sample_size))

In [None]:
plot_and_summarize_sample_means(100,10000)

<h4>If I take a sample of size <tt>sample_size=1</tt>, the standard deviation of my sample is the Population SD: 39.48.</h4> 

<h4>If I take a sample of size <tt>sample_size=100</tt>, the standard deviation of my sample is reduced by a factor of:</h4> 

In [None]:
39.48 / 3.932

<h4>Note that:</h4>

In [None]:
np.sqrt(100)

In [None]:
plot_and_summarize_sample_means(400,10000)

<h4>If I take a sample of size <tt>sample_size=400</tt>, the standard deviation of my sample is reduced by a factor of:</h4> 

In [None]:
39.48 / 1.973

<h4>Note that:</h4>

In [None]:
np.sqrt(400)

In [None]:
plot_and_summarize_sample_means(625,10000)

<h4>If I take a sample of size <tt>sample_size=625</tt>, the standard deviation of my sample is reduced by a factor of:</h4> 

In [None]:
39.48 / 1.577

<h4>Note that:</h4>

In [None]:
np.sqrt(600)

SLIDE: Variability of the Sample Mean

<h3>Now let's see the effect of increasing the number of trials&mdash;that is, the number of samples that we take&mdash;for a specified sample size (e.g., <tt>sample_size=400</tt>):</h3>

<h4>Sample Size=400<br>
Number of Trials=10</h4>

In [None]:
sample_means_400_10 = many_sample_means(400,10)
Table().with_column(
    'Mean of 400 flight delays', sample_means_400_10).hist(bins=20)

print('Population Average:', np.round(delay_mean,2))
print('Sample Average:', np.round(np.mean(sample_means_400_10),2))

<h4>Sample Size=400<br>
Number of Trials=50</h4>

In [None]:
sample_means_400_50 = many_sample_means(400,50)
Table().with_column(
    'Mean of 400 flight delays', sample_means_400_50).hist(bins=20)

print('Population Average:', np.round(delay_mean,2))
print('Sample Average:', np.round(np.mean(sample_means_400_50),2))

<h4>Sample Size=400<br>
Number of Trials=100</h4>

In [None]:
sample_means_400_100 = many_sample_means(400,100)
Table().with_column(
    'Mean of 400 flight delays', sample_means_400_100).hist(bins=20)

print('Population Average:', np.round(delay_mean,2))
print('Sample Average:', np.round(np.mean(sample_means_400_100),2))

<h4>Sample Size=400<br>
Number of Trials=1,000</h4>

In [None]:
sample_means_400_1000 = many_sample_means(400,1000)
Table().with_column(
    'Mean of 400 flight delays', sample_means_400_1000).hist(bins=20)

print('Population Average:', np.round(delay_mean,2))
print('Sample Average:', np.round(np.mean(sample_means_400_1000),2))

<h4>Sample Size=400<br>
Number of Trials=10,000</h4>

In [None]:
sample_means_400_10000 = many_sample_means(400,10000)
Table().with_column(
    'Mean of 400 flight delays', sample_means_400_10000).hist(bins=20)

print('Population Average:', np.round(delay_mean,2))
print('Sample Average:', np.round(np.mean(sample_means_400_10000),2))

<h4>Sample Size=400<br>
Number of Trials=50,000</h4>

In [None]:
sample_means_400_50000 = many_sample_means(400,50000)
Table().with_column(
    'Mean of 400 flight delays', sample_means_400_50000).hist(bins=20)

print('Population Average:', np.round(delay_mean,2))
print('Sample Average:', np.round(np.mean(sample_means_400_50000),2))