<div style="width: 38.5%;">
    <p><strong>City College of San Francisco</strong><p>
    <hr>
    <p>MATH 108 - Foundations of Data Science</p>
</div>

# Lecture 28: Sample Means

Associated Textbook Sections: [14.5](https://inferentialthinking.com/chapters/14/5/Variability_of_the_Sample_Mean.html)

---

## Outline

* [Central Limit Theorem](#Central-Limit-Theorem)
* [Distribution of the Sample Average](#Distribution-of-the-Sample-Average)
* [Center of the Distribution](#Center-of-the-Distribution)
* [Variability of the Sample Average](#Variability-of-the-Sample-Average)

---

## Set Up the Notebook

In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

# Re-define function from lec29
def standard_units(x):
    """Convert array x to standard units."""
    return (x - np.mean(x)) / np.std(x)

# To create normally distributed values
from scipy.stats import norm

---

## Central Limit Theorem

---

### Sample Averages

* The Central Limit Theorem describes how the normal distribution (a bell-shaped curve) is connected to random sample averages.
* We care about sample averages because they estimate population averages.

---

### Central Limit Theorem

> If the sample is large, and it is drawn at random with replacement, then regardless of the distribution of the population, the probability distribution of the sample sum (or the sample average) is roughly normal.



---

### Demo: Central Limit Theorem

* Load the November 2023 flight delay data in `delay.csv` sourced from the [Bureau of Transportation Statistic's Reporting Carrier On-Time Performance Data](https://www.transtats.bts.gov/DL_SelectFields.aspx?gnoyr_VQ=FGJ&QO_fu146_anzr=b0-gvzr). The variable `ARR_DELAY` contains the difference in minutes between scheduled and actual arrival time at the destination airport `DEST`. Early arrivals show negative numbers, and the airline code is expressed in the variable `OP_CARRIER`. 
* Demonstrate the Central Limit Theorem by showing how the distribution of sample means changes as the sample size increases.

In [None]:
sfo = (Table.read_table('delays.csv')
            .where('ORIGIN', 'SFO'))
nan_filter = np.invert(np.isnan(sfo.column('ARR_DELAY')))
sfo = sfo.where(nan_filter)
sfo

In [None]:
sfo.hist('ARR_DELAY', bins=np.arange(-30, 150, 10))

In [None]:
delays = sfo.column('ARR_DELAY')
delay_mean = np.mean(delays)
delay_sd = np.std(delays)
(print(f'The average flight delay was {delay_mean: .2f} \
mins with a standard deviation of {delay_sd:.2f} mins.'))

In [None]:
def one_sample_mean(sample_size):
    """ Takes a sample from the population of flights and computes its mean"""
    sampled_flights = sfo.sample(sample_size)
    sampled_delays = sampled_flights.column('ARR_DELAY')
    mean_delay = np.mean(sampled_delays)
    return mean_delay

In [None]:
one_sample_mean(100)

In [None]:
def ten_thousand_sample_means(sample_size):
    means = make_array()
    for i in np.arange(1_000):
        one_mean = one_sample_mean(sample_size)
        means = np.append(means, one_mean)
    return means

In [None]:
sample_means_100 = ten_thousand_sample_means(100)
Table().with_column('Mean of 100 flight delays', sample_means_100).hist(bins=20)
print('Population Average:', delay_mean)

In [None]:
sample_means_400 = ten_thousand_sample_means(400)
Table().with_column('Mean of 400 flight delays', sample_means_400).hist(bins=20)
print('Population Average:', delay_mean)

In [None]:
sample_means_900 = ten_thousand_sample_means(900)
Table().with_column('Mean of 900 flight delays', sample_means_900).hist(bins=20)
print('Population Average:', delay_mean)

---

## Distribution of the Sample Average

---

### Why is There a Distribution?

* You have only one random sample, and it has only one average. 
* But the sample could have come out differently.
* And then the sample average might have been different.
* So there are many possible sample averages.


In [None]:
(print(f"There are {sfo.num_rows * 400:,} possible \
sample averages for samples of size 400 from this data set."))

---

### Distribution of the Sample Average 

* Imagine all possible random samples of the same size as yours. There are lots of them.
* Each of these samples has an average.
* The distribution of the sample average is the distribution of the averages of all the possible samples.

---

### Demo: Distribution of the Sample Average

Compare the distribution of sample averages for a few different sample sizes.

In [None]:
means_tbl = Table().with_columns(
    '400', sample_means_400,
    '900', sample_means_900)

In [None]:
means_tbl.hist(bins = np.arange(-10, 15, 0.5))
plt.title('Distribution of Sample Average')
plt.show()

---

### Specifying the Distribution

* Suppose the random sample is large.
    * We have seen that the distribution of the sample average is roughly bell shaped.
* Important questions remain:
    * Where is the center of that bell curve?
    * How wide is that bell curve?

---

## Center of the Distribution

---

###  The Population Average

The distribution of the sample average is roughly a bell curve centered at the population average.

---

## Variability of the Sample Average

---

### Why Is This Important?

* Along with the center, the spread helps identify exactly which normal curve is the distribution of the sample average.
* The variability of the sample average helps us measure how accurate the sample average is as an estimate of the population average.
* If we want a specified level of accuracy, understanding the variability of the sample average helps us work out how large our sample has to be.

---

### Two Histogram

<img src="./dist_sample_ave_update.png" width = 50%>

* The gold histogram shows the distribution of 10,000 values, each of which is an average of 900 randomly sampled flight delays.
* The blue histogram shows the distribution of 10,000 values, each of which is an average of 400 randomly sampled flight delays.
* Both are roughly bell shaped.
* The larger the sample size, the narrower the bell.

---

### Demo: The Influence of Sample Size

Re-display the population mean and standard deviation.

In [None]:
(print(f'The average flight delay was {delay_mean: .2f} \
mins with a standard deviation of {delay_sd:.2f} mins.'))

* Notice how the sample size impacts the distribution of sample averages.
* Additionally, notice how the ratio of the population SD to the sample SD is approximately equal to the square root of the sample size.

In [None]:
def plot_and_summarize_sample_means(sample_size):
    """Empirical distribution of random sample means"""
    sample_means = ten_thousand_sample_means(sample_size)
    sample_means_tbl = Table().with_column('Sample Means', sample_means)
    sample_mean = np.mean(sample_means)
    sample_sd = np.std(sample_means)
    
    # Print some information about the distribution of the sample means
    print("Sample size: ", sample_size)
    print("Population mean:", delay_mean)
    print("Average of sample means: ", sample_mean)
    print("Population SD:", delay_sd)
    print("SD of sample means:", sample_sd)
    print("Ratio of population SD to sample SD:", delay_sd / sample_sd)
    print("Square Root of the sample size:", np.sqrt(sample_size))

    # Plot a histogram of the sample means
    sample_means_tbl.hist(bins=20)
    plt.xlabel('Sample Means')
    plt.title('Sample Size ' + str(sample_size))

    # Overlay a curve representing the normal distribution
    from scipy.stats import norm
    x = np.linspace(np.min(sample_means), np.max(sample_means), 100)
    y = norm.pdf(x, delay_mean, delay_sd/np.sqrt(sample_size))
    plt.plot(x, y, linestyle='--', lw=5, label='Normal Distribution')

    # Add a vertical dashed line showing the mean delay
    plt.axvline(x=delay_mean, color='black', linestyle='--', lw=2, label='Population Mean')
    plt.legend()

In [None]:
plot_and_summarize_sample_means(100)

In [None]:
plot_and_summarize_sample_means(400)

In [None]:
plot_and_summarize_sample_means(625)

---

### Probability of Sample Average

* The distribution of all possible sample averages of a given size is called the distribution of the sample average.
* We approximate the distribution of sample averages by an empirical distribution.

---

### The Central Limit Theorem

* If 
    * the sample is large and 
    * drawn at random with replacement, 
* Then, regardless of the distribution of the population, the probability distribution of the sample average:
    * is roughly normal
    * mean = population mean
    * SD = (population SD) / the square root of the sample size
    
_Note: For this [theory](https://en.wikipedia.org/wiki/Central_limit_theorem), the results are more reliable when the data has been standardized._

---

<footer>
    <p>Adopted from UC Berkeley DATA 8 course materials.</p>
    <p>This content is offered under a <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">CC Attribution Non-Commercial Share Alike</a> license.</p>
</footer>