In [None]:
# Imports
import babypandas as bpd
import numpy as np

import matplotlib.pyplot as plt
import warnings; warnings.simplefilter('ignore')
plt.style.use('fivethirtyeight')

# Lecture 14 – Distributions and Sampling

## DSC 10, Fall 2021

### Announcements

- Midterm **this Wednesday 10/27 during lecture**.
    - Remote, open-notes, must work alone.
    - See logistics [here](https://campuswire.com/c/G9636FFCF/feed/259) and practice exams [here](https://dsc10.com/resources/#practice-exams).
    - Have handy: [DSC 10 Reference Sheet](https://drive.google.com/file/d/1mQApk9Ovdi-QVqMgnNcq5dZcWucUKoG-/view), calculator/computer for calculations.
- Midterm Review Session **tomorrow 5-6PM on Zoom**.
    - See Canvas Calendar for the link.
    - Will be recorded.
- Project 1 due **Tuesday 11/2 at 11:59pm**.
- Janine and Suraj will both be at the **DSC Student-Faculty Mixer at 1pm tomorrow**. Come say hi 👋!
    - Zoom link: https://ucsd.zoom.us/j/98335299546.

### Agenda

- Probability distributions and empirical distributions.
- Sampling: populations and samples.
- Sample means.
- **Remember:** today's lecture is **not** in scope for the midterm.

### How confident are you regarding the upcoming midterm?

A. Not confident at all

B. Not really confident

C. Meh

D. Confident

E. Very confident

### To answer, go to [menti.com](https://menti.com) and enter the code 8197 3501.

## Distributions

### Distributions
- A **distribution** describes the number of occurrences of each value of a variable.
- We visualize distributions of numerical variables with histograms.

### Probability distributions
- Consider a random quantity with various possible values, each of which has some associated probability.
- A **probability distribution** is a description of:
    - All the possible values of the quantity.
    - The theoretical probability of each value.
* Example, for rolling a single die:

| Value     |Probability |
| ----------- | ----------- |
| 1      | 1/6       |
| 2   | 1/6        |
| 3      | 1/6       |
| 4   | 1/6        |
| 5      | 1/6       |
| 6   | 1/6        |


### Example: probability distribution of a die roll

- The distribution is **uniform**, meaning that each outcome has the same probability of occurring.

In [None]:
die_faces = np.arange(1, 7, 1)
die = bpd.DataFrame().assign(face=die_faces)
die

In [None]:
bins = np.arange(0.5, 6.6, 1)

# Note that you can add titles to your visualizations, like this!
die.plot(kind='hist', y='face', bins=bins, density=True, ec='w', 
         title='Probability Distribution of a Die Roll');

### Empirical distributions

- Unlike probability distributions, which are theoretical, empirical distributions are **based on observations**.
- Commonly, these observations are of repetitions of an experiment.
- An **empirical distribution** describes:
    - All observed values.
    - The proportion of observations in which each value occurred.
- Unlike the probability distribution, it represents what actually happened in practice. 

### Example: Empirical distribution of a die roll
- Let's simulate a roll by using `np.random.choice`.
- Rolling a die = sampling with replacement.
    - If you roll a 4, you can roll a 4 again.

In [None]:
num_rolls = 25
many_rolls = np.random.choice(die_faces, num_rolls)
many_rolls

In [None]:
bpd.DataFrame().assign(face=many_rolls).plot(kind='hist', y='face', bins=bins, density=True, ec='w');

**Question:** As we increase the number of die rolls, what happens to the shape of the empirical distribution? 🤔

### Law of averages

The **law of averages** states that if a chance experiment is repeated 
- many times,
- independently, and
- under the same conditions,
    
then the **proportion** of times that an event occurs gets closer to the **theoretical probability** of that event.

Example: As you roll a die repeatedly, the proportion of times you roll a 5 gets closer to 1/6.

In [None]:
for num_rolls in [10, 50, 100, 500, 1000, 5000, 10000]:
    die.sample(n=num_rolls, replace=True) \
        .plot(kind='hist', y='face', bins=bins, density=True, ec='w', title=f'Distribution of {num_rolls} Die Rolls');

## Sampling

### Populations and samples

- A **population** is a group of people, objects, or events that we want to learn something about.
- It's often infeasible to collect information about every member of a population.
- Instead, we can collect a **sample**, which is a subset of the population.
- **Goal**: estimate the distribution of some numerical variable in the population, using only a sample.
    - For example, say I want to know the number of credits each UCSD student is taking this quarter.
    - It's too hard to get this information for every UCSD student, so we don't know the **population distribution**.
    - Instead we collect data from only certain UCSD students to generate a **sample distribution**.
    

**Question:** How do we collect a good sample, so that the sample distribution closely approximates the population distribution?

### Convenience samples

- A **convenience sample** is when you sample whoever you can get ahold of.
    - e.g. a voluntary internet survey.
    - e.g. people in line at Panda Express at Price Center.
    - e.g. the first 50 people to submit the homework.
- **Pro:** They're easy and inexpensive (they're the most common type of sample!).
- **Cons:** The results won't generalize to the population as a whole and are likely **biased**.

### Example of bias in a convenience sample

- **Study:** Determine the average age of gamblers at a casino.
- **Methodology:** Ask casino-goers for their age for three hours on a weekday afternoon.
- **Bias:** Might overrepresent elderly people who have retired and underrepresented by people of working age.

### Main takeaway: convenience samples are not good 🙅‍♀️.

### Probability sample (aka random sample)

- In order for a sample to be a probability sample, you **must be able to calculate the probability of selecting any subset of the population**.
- Not all individuals need to have an equal chance of being selected.
- What's the point?
    - There's a better chance of collecting a **representative** sample this way.
- Not all probability samples are good, though!

### Example: movies

In [None]:
top = bpd.read_csv('data/top_movies.csv')
top

### A probability sample
- **Scheme:** Start with a random number between 0 and 9 take every tenth row thereafter.
    - **This is a probability sample!**
- Any given row is equally likely to be picked, with probability 1/10.
- It is **not** true that every subset of rows has the same probability of being selected.
    - There are only 10 possible samples: rows (0, 10, 20, 30, ..., 190), rows (1, 11, 21, ..., 191), and so on.
- This is a probability sample.

In [None]:
start = np.random.choice(np.arange(10))
top.take(np.arange(start, 200, 10))

### Simple random sample

- A **simple random sample (SRS)** is a sample drawn **uniformly** at random **without** replacement.
- In an SRS...
    - Every individual has the same chance of being selected.
    - Every pair has the same chance of being selected.
    - Every triplet has the same chance of being selected.
    - And so on...
- To perform an SRS from an array `options`, we use `np.random.choice(options, replace=False)`.
    - If we use `replace=True`, then we're **sampling uniformly at random with replacement** – there's no simpler term for this.

### Sampling rows from a DataFrame

If we instead want to sample rows from a DataFrame, we can use the `.sample` method on a DataFrame.

```py
df.sample(n)
```

returns a random subset of `n` rows of `df`, drawn **without replacement** (i.e. the default is `replace=False`, unlike `np.random.choice`).

In [None]:
# Without replacement
top.sample(5)

In [None]:
# With replacement
top.sample(5, replace=True)

### Sample size

- We saw earlier that when we repeat a random experiment more and more times, the empirical distribution looks more and more like the probability distribution.
- Similarly, if we take a larger simple random sample, then the sample distribution is more likely to better approximate the true population distribution.

### Example: distribution of flight delays

`united_full` contains information about all United flights leaving SFO between 6/1/15 and 8/31/15.

In [None]:
united_full = bpd.read_csv('data/united_summer2015.csv')
united_full

### Only need delays...

In [None]:
united = united_full.get(['Delay'])
united

### Population distribution of flight delays

In [None]:
bins = np.arange(-20, 300, 10)
united.plot(kind='hist', y='Delay', bins=bins, density=True, ec='w', title='Population Distribution of Flight Delays', figsize=(10, 5));

### Sample distribution of flight delays

- The 13825 flight delays in `united` constitute our population.
- Normally, we won't have access to the entire population.
- To replicate a real-world scenario, we will sample from `united` **without replacement**.

In [None]:
# Sample distribution
N = 100
united.sample(N).plot(kind='hist', y='Delay', bins=bins, density=True, ec='w',
                                   title='Sample Distribution of Flight Delays',
                                   figsize=(10, 5));

Note that as we increase `N`, the sample distribution of delays looks closer and closer to the true population distribution of delays.

## Sample means

### Average flight delay

- What is the average delay of United flights out of SFO?
- We'd love to know the average delay of **population**, but we only have a **sample**.
- How does the mean of the **sample** compare to the mean of the **population**?

### Mean of the population

In [None]:
# Calculate the mean
united_mean = united.get('Delay').mean()
united_mean

### Mean of the sample

- This is called the **sample mean**.
- Because the sample is random, the **sample mean** is too!

In [None]:
# Each time you run this, you will get a different result.
# This cell is sampling 100 flight delays from the population and averaging those results.
united.sample(100).get('Delay').mean()

### Mean of a large SRS
- As the sample gets bigger, the sample mean gets closer to the population mean.
- To illustrate this, we'll collect samples of various sizes and look at their corresponding means.

In [None]:
# The mean of a lot of samples
sizes = np.arange(100, 10000, 200)
means = np.array([])
for n in sizes:
    m = united.sample(int(n)).get('Delay').mean()
    means = np.append(means, m)
    
many_samples = bpd.DataFrame().assign(sample_size=sizes, sample_mean=means)
many_samples

In [None]:
many_samples.plot(kind='line', x='sample_size', y='sample_mean', 
                  title='Relationship Between Sample Size and Sample Mean');

Note that while the sample mean can be quite inaccurate for small sample sizes, as we increase our sample size it tends to get closer and closer to the true population mean.

### What if we repeat this process many, many times?

Don't worry about how this graph is created, but at a high level:
- Each line corresponds to one repetition of the process from the previous slide (collecting samples of various sizes and determining the sample mean).
- As samples get larger and larger, the sample mean gets closer to the population mean.

In [None]:
sizes = np.arange(100, 10000, 200)

def create_many_samples():
    means = np.array([])
    for n in sizes:
        m = united.sample(int(n)).get('Delay').mean()
        means = np.append(means, m)
    return means

repeated_samples = [create_many_samples() for _ in np.arange(100)]
plt.figure(figsize=(10, 5), dpi=200)
for rep in repeated_samples:
    plt.plot(sizes, rep, alpha=0.25);

plt.title('Relationship Between Sample Size and Sample Mean');

**What do you notice?** The vertical spread of the graph decreases as sample size increases.

### How good is the _sample mean_?

- In other words, is it close to the population mean?
- If the sample is small, there's a high chance that sample mean is bad.
- If the sample is big, there's a small chance that sample mean is bad.

### Small random sample

<center><img src="data/bullseye-high.png"></center>

### Large random sample

<center><img src="data/bullseye-low.png"></center>

### Distribution of sample means

- Let's...
    - Repeatedly draw a bunch of samples.
    - Record the mean of each.
    - Draw a histogram of the resulting distribution.
- What's the point?
    - This helps us answer the question "what would the sample mean have looked like if we drew a different sample?"
- Try different sample sizes and look at the resulting histogram!

In [None]:
# Sample one thousand flights, two thousand times
sample_size = 1000
n_experiments = 2000
means = np.array([])

for n in np.arange(n_experiments):
    m = united.sample(sample_size).get('Delay').mean()
    means = np.append(m, means)

bpd.DataFrame().assign(means=means).plot(kind='hist', bins=np.arange(10, 25, 0.5), density=True, ec='w',
                                         title=f'Distribution of Sample Mean with Sample Size {sample_size}')
plt.axvline(x=united_mean, c='r');

### Discussion Question

We just sampled **one thousand** flights, two thousand times. If we now sample **one hundred** flights, two thousand times, how will the histogram change?

- A.  narrower  
- B.  wider  
- C.  shifted left  
- D.  shifted right  
- E.  unchanged

### To answer, go to [menti.com](https://menti.com) and enter the code 8197 3501.

### How we sample matters

* So far, we've taken large uniform random samples, **without replacement**, from the full population.
    * If the population is large enough, then it doesn't really matter if we sample with or without replacement.
* The sample mean, for samples like this, is a good approximation of the population mean.
* But this is not always the case if we sample differently.

### Example: uniform random sample of flights to Denver
* Estimation of the mean is highly biased!

In [None]:
n_experiments = 2000
means = np.array([])

den = united_full[united_full.get('Destination') == 'DEN'].get(['Delay'])
for n in np.arange(n_experiments):
    m = den.sample(100).get('Delay').mean()
    means = np.append(m, means)

bpd.DataFrame().assign(means=means).plot(kind='hist', bins=np.arange(0, 40, .5), density=True)
plt.axvline(x=united_mean, c='r');
# plt.axvline(x=den.get('Delay').mean(), c='y');

## Summary

### Summary, next time

- The **probability distribution** of a random quantity describes the values it takes on along with the probability of each value occurring.
- An **empirical distribution** describes the values and frequencies of the results of a random experiment.
- With more trials of an experiment, the empirical distribution gets closer to the probability distribution.
- A **population distribution** describes the values and frequencies of some characteristic of a population.
- A **sample distribution** describes the values and frequencies of some characteristic of a sample, which is a subset of a population.
- When we sample uniformly at random, as we increase our sample size, the sample distribution (in particular, the sample mean) gets closer and closer to the population distribution (in particular, the population mean).
- **Next time (Friday):** more on populations and samples.
    - None of today's lecture is on the midterm!