<img src="./ccsf.png" alt="CCSF Logo" width=200px style="margin:0px -5px">

# Lecture 16: Sampling

Associated Textbook Section: [10.0](https://ccsf-math-108.github.io/textbook/chapters/10/Sampling_and_Empirical_Distributions.html)

---

## Outline

* [Sampling](#Sampling)
* [Sampling with Technology](#Sampling-with-Technology)

---

## Set Up the Notebook

In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

---

## Sampling

---

### Random Samples

<a href="https://en.wikipedia.org/wiki/Sampling_(statistics)"><img src="./simple_random_sampling.png" width=400px alt="A visual representation of selecting a simple random sample."/></a>

* Population: Set of all elements from whom a sample will be drawn
* Deterministic sample: The sampling scheme doesn't involve chance
* Random (Probability) sample: 
    * Before the sample is drawn, you have to know the selection probability of every group of people in the population
    * Not all individuals/groups have to have an equal chance of being selected
    * If the chances are equal, then the sample is a simple random sample.

---

### Sample of Convenience

* Example: sample consists of whoever walks by
* Just because you think you're sampling "randomly", doesn't mean you have a random sample.
* If you can't figure out the following ahead of time, then you don't have a random sample
    * what's the population
    * what's the chance of selection, for each group in the population

---

### With and Without Replacement

<a href="https://towardsdatascience.com/an-introduction-to-probability-sampling-methods-7a936e486b5/"><img src="./sampling.webp" width=700px alt="A visual representation of sampling with and without replacement."/></a>

* Sampling with Replacement:
    * One event happening does not impact the chance of another event happening
    * Associated with the concept called independent events.
* Sampling without Replacement:
    * One event happening may impact the chance of another event happening
    * Associated with the concept called dependent events.

---

## Sampling with Technology

---

### Sampling from Arrays and Tables

* Sampling from a table
    * Use can use `take` to systematically sample data from a table 
    * `tbl.sample(k=n, with_replacement=True)`:
        * Randomly samples with replacement `n` rows from `tbl` and creates a new table
        * `k` is `tbl.num_rows` by default
        * `with_replacement` is `True` by default
* Sampling from an array
    * `np.random.choice(a=an_array, size=n, replace=True)`:
        * Randomly samples with replacement `n` elements from the elements in `an_array`
        * `size` is 1 by default
        * `replace` is `True` by default

---

### Demo: Sampling from Arrays and Tables

<a href="https://www.bts.gov/"><img src="./Tarmac.png" alt="An airplane landing on a tarmac."/></a>

Load the June 2025 flight delay data in `delay.csv` sourced from the [Bureau of Transportation Statistic's Reporting Carrier On-Time Performance Data](https://www.transtats.bts.gov/DL_SelectFields.aspx?gnoyr_VQ=FGJ&QO_fu146_anzr=b0-gvzr). The variable `ARR_DELAY` contains the difference in minutes between scheduled and actual arrival time at the destination airport `DEST`. Early arrivals show negative numbers, and the airline code is expressed in the variable `OP_CARRIER`.

In [None]:
delays = Table.read_table('delays.csv')
delays

---

Demonstrate how to use the `take` method to sample the data in a few ways.

In [None]:
...

In [None]:
...

In [None]:
start = ...
systematic_sample = ...
systematic_sample.show()

---

Demonstrate how to get a simple random sample of 12 flight delays using `np.random.choice` and `sample`.

In [None]:
delays_arr = delays.column('ARR_DELAY')
...

In [None]:
random_sample = ...
random_sample.show()

--- 

## Attribution

This content is licensed under the <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)</a> and derived from the <a href="https://www.data8.org/">Data 8: The Foundations of Data Science</a> offered by the University of California, Berkeley.

<img src="./by-nc-sa.png" width=100px>