<div style="width: 38.5%;">
    <p><strong>City College of San Francisco</strong><p>
    <hr>
    <p>MATH 108 - Foundations of Data Science</p>
</div>

# Lecture 17: Distributions

Associated Textbook Sections: [10.1, 10.2, 10.3, and 10.4](https://inferentialthinking.com/chapters/10/1/Empirical_Distributions.html)

<div style="width: 38.5%;">
    <p><strong>City College of San Francisco</strong><p>
    <hr>
    <p>MATH 108 - Foundations of Data Science</p>
</div>

## Outline

* [Distributions](#Distributions)
* [Law of Large Numbers](#Law-of-Large-Numbers)
* [A Statistic](#A-Statistic)

## Set Up the Notebook

In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

---

## Distributions

### Probability Distribution

* Random quantity with various possible values
* Probability Distribution:
    * All the possible values of the quantity
    * The probability of each of those values
* If you can do the math, you can work out the probability distribution without ever simulating it
* But... simulation is often easier!


### Empirical Distribution

* Empirical: based on observations
* Observations can be from repetitions of an experiment
* Empirical Distribution:
    * All observed values
    * The proportion of times each value appears

### Demo: Distributions

Create a table with the face numbers of a six-sided die. Use that table with the `sample` method to simulate randomly rolling a die. Visualize the empirical distribution.

In [None]:
die = Table().with_column('Face', np.arange(1, 7))
die

In [None]:
...

In [None]:
...

Update the bins and randomly sample for a few different sample sizes.

In [None]:
roll_bins = np.arange(0.5, 6.6, 1)
die.hist(bins=...)

In [None]:
die.sample(...).hist(bins=roll_bins)

In [None]:
die.sample(...).hist(bins=roll_bins)

In [None]:
die.sample(...).hist(bins=roll_bins)

---

## Law of Large Numbers

### Law of Averages / Law of Large Numbers

* If a chance experiment is repeated many times, independently and under the same conditions, then the proportion of times that an event occurs gets closer to the theoretical probability of the event
* As you increase the number of rolls of a six-sided die, the proportion of times you see the face with five spots gets closer to 1/6


### Empirical Distribution of a Sample

If the sample size is large, then the empirical distribution of a uniform random sample resembles the distribution of the population, with high probability.


### Demo: Large Random Samples

Load the flight delay data in `delay_july_2022.csv` sourced from the [Bureau of Transportation Statistics's Airline On-Time Performance Data](https://www.transtats.bts.gov/Tables.asp?QO_VQ=EFD&QO_anzr=Nv4yv0r%FDb0-gvzr%FDcr4s14zn0pr%FDQn6n&QO_fu146_anzr=b0-gvzr)

In [None]:
delays = Table.read_table('./data/delays_july_2022.csv')
delays

Narrow the data to flights that have left from SFO.

In [None]:
sfo_delays = delays.where('ORIGIN', 'SFO').drop(0, 1, 3).relabeled('ARR_DELAY', 'DELAY')
sfo_delays

Remove `nan` values from the data set.

In [None]:
sfo_delays = sfo_delays.sort('DELAY', True)
sfo_delays

**You are not responsible for understanding how to make the nan filter.**

In [None]:
nan_filter = np.invert(np.isnan(sfo_delays.column('DELAY')))
nan_filter

In [None]:
sfo_delays = sfo_delays.where(...)
sfo_delays

Visualize the distribution of flight delays in the delay data.

In [None]:
sfo_delays.hist('DELAY', unit='Minute')

In [None]:
min(sfo_delays.column('DELAY'))

In [None]:
max(sfo_delays.column('DELAY'))

In [None]:
bins = np.arange(-60, 120, 5)
sfo_delays.hist('DELAY', bins = bins, unit='Minute')

In [None]:
np.average(sfo_delays.column('DELAY'))

In [None]:
np.median(sfo_delays.column('DELAY'))

Randomly sample 10 and 1000 flights from the delay data and visualize the distributions of the samples.

In [None]:
sfo_delays.sample(10).hist('DELAY', bins = bins, unit='Minute')

In [None]:
sfo_delays.sample(1000).hist('DELAY', bins = bins, unit='Minute')

---

## A Statistic

### Inference

* Statistical Inference: Making conclusions based on data in random samples
* Example: Use the data to guess the value of an unknown and fixed number by creating an estimate of the unknown quantity that depends on the random sample.



### Terminology

* Parameter: A number associated with the population
* Statistic: A number calculated from the sample
* A statistic can be used as an estimate of a parameter

### Demo: Simulating Statistics

Calculate the median parameter for the flight delays and compare it to the median statistic associated with a random sample of 10 flights.

In [None]:
np.median(sfo_delays.column('DELAY'))

In [None]:
np.median(sfo_delays.sample(10).column('DELAY'))

Define a function that randomly samples delay data for an inputted sample size and returns the median delay for that sample.

In [None]:
def sample_median(size):
    ...

In [None]:
sample_median(10)

Simulate randomly sampling 10 flights 1000 times and storing the sample delay medians for each iteration. Add the results to a table and visualize the sampling distribution.

In [None]:
sample_medians = ...

...

In [None]:
Table().with_column('Sample medians', ...).hist(bins = np.arange(-30,20), unit='Minute')

Repeat the sample simulation except use a sample size of 1000, instead of 10.

In [None]:
...

In [None]:
Table().with_column('Sample medians', ...).hist(bins = np.arange(-30,20), unit='Minute')

### Probability Distribution of a Statistic

* Values of a statistic vary because random samples vary
* Sampling distribution (or probability distribution) of the statistic:
    * All possible values of the statistic,
    * and all the corresponding probabilities
    * Can be hard to calculate
        * Either have to do the math
        * Or have to generate all possible samples and calculate the statistic based on each sample


### Empirical Distribution of a Statistic

* Empirical distribution of the statistic:
    * Based on simulated values of the statistic
    * Consists of all the observed values of the statistic,
    * and the proportion of times each value appeared
* Good approximation to the probability distribution of the statistic if the number of repetitions in the simulation is large


---

<footer>
    <p>Adopted from UC Berkeley DATA 8 course materials.</p>
    <p>This content is offered under a <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">CC Attribution Non-Commercial Share Alike</a> license.</p>
</footer>