In [None]:
# Imports
import babypandas as bpd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

# Lecture 14 – Distributions and Sampling

## DSC 10, Winter 2022

<center><img src='data/wordle.png' width=300></center>

### Announcements

- Homework 4 is due **tomorrow 2/5 at 11:59pm**.
- The Midterm Exam is on **Wednesday 2/9 during lecture**.
    - Remote, open-notes, must work alone.
    - See logistics [here](https://campuswire.com/c/G6950E967/feed/529) and past exams [here](https://dsc10.com/resources/).
    - Have handy the [DSC 10 Reference Sheet](https://drive.google.com/file/d/1mQApk9Ovdi-QVqMgnNcq5dZcWucUKoG-/view) and a calculator/computer for calculations.
- Discussion section on Monday will be review for the exam; make sure to attend.
- The Midterm Project is due **Saturday 2/12 at 11:59pm**.
- Don't forget about in-person office hours! See the [Calendar](https://dsc10.com/calendar) for details.
- **Note:** you can watch recordings on [podcast.ucsd.edu](https://podcast.ucsd.edu) too.

### Agenda

- Probability distributions and empirical distributions.
- Populations and samples.
- Parameters and statistics.
- **Remember:** today's lecture is **not** in scope for the midterm (but is extremely important)!

## Distributions

### Distributions
- A **distribution** describes the frequency of each value of a variable.
- We visualize distributions of numerical variables with histograms.

### Probability distributions
- Consider a random quantity with various possible values, each of which has some associated probability.
- A **probability distribution** is a description of:
    - All possible values of the quantity.
    - The theoretical probability of each value.
* Example, for rolling a single die:

| Value     |Probability |
| ----------- | ----------- |
| 1      | $\frac{1}{6}$ |
| 2   | $\frac{1}{6}$        |
| 3      | $\frac{1}{6}$       |
| 4   | $\frac{1}{6}$       |
| 5      | $\frac{1}{6}$       |
| 6   | $\frac{1}{6}$        |


### Example: probability distribution of a die roll

The distribution is **uniform**, meaning that each outcome has the same probability of occurring.

In [None]:
die_faces = np.arange(1, 7, 1)
die = bpd.DataFrame().assign(face=die_faces)
die

In [None]:
bins = np.arange(0.5, 6.6, 1)

# Note that you can add titles to your visualizations, like this!
# figsize=(10, 5) sets the size of the plot
die.plot(kind='hist', y='face', bins=bins, density=True, ec='w', 
         title='Probability Distribution of a Die Roll',
         figsize=(10, 5));

### Empirical distributions

- Unlike probability distributions, which are theoretical, **empirical distributions are based on observations**.
- Commonly, these observations are of repetitions of an experiment.
- An **empirical distribution** describes:
    - All observed values.
    - The proportion of observations in which each value occurred.
- Unlike probability distributions, empirical distributions represent what actually happened in practice. 

### Example: Empirical distribution of a die roll
- Let's simulate a roll by using `np.random.choice`.
- Rolling a die = sampling with replacement.
    - If you roll a 4, you can roll a 4 again.

In [None]:
num_rolls = 25
many_rolls = np.random.choice(die_faces, num_rolls)
many_rolls

In [None]:
bpd.DataFrame().assign(face=many_rolls) \
               .plot(kind='hist', y='face', bins=bins, density=True, ec='w',
                     title=f'Empirical Distribution of {num_rolls} Dice Rolls',
                     figsize=(10, 5));

**Question:** As we increase the number of die rolls, what happens to the shape of the empirical distribution? 🤔

### Law of averages

The **law of averages** states that if a chance experiment is repeated 
- many times,
- independently, and
- under the same conditions,
    
then the **proportion** of times that an event occurs gets closer and closer to the **theoretical probability** of that event.

**Example:** As you roll a die repeatedly, the proportion of times you roll a 5 gets closer to $\frac{1}{6}$.

### Example: many die rolls

In [None]:
for num_rolls in [10, 50, 100, 500, 1000, 5000, 10000]:
    # Don't worry about how .sample works just yet – we'll cover it shortly
    die.sample(n=num_rolls, replace=True) \
       .plot(kind='hist', y='face', bins=bins, density=True, ec='w', 
             title=f'Distribution of {num_rolls} Die Rolls',
             figsize=(8, 3));

## Sampling

### Populations and samples

- A **population** is a group of people, objects, or events that we want to learn something about.
- It's often infeasible to collect information about every member of a population.
- Instead, we can collect a **sample**, which is a subset of the population.
- **Goal**: estimate the distribution of some numerical variable in the population, using only a sample.
    - For example, say I want to know the number of credits each UCSD student is taking this quarter.
    - It's too hard to get this information for every UCSD student, so we don't know the **population distribution**.
    - Instead we collect data from only certain UCSD students to generate a **sample distribution**.

**Question:** How do we collect a good sample, so that the sample distribution closely approximates the population distribution?

- **Bad idea ❌:** Survey whoever you can get ahold of (e.g. internet survey, people in line at Panda Express at PC).
    - Such a sample is known as a convenience sample.
    - Convenience samples often contain hidden sources of **bias**.

### Probability sample (aka random sample)

- In order for a sample to be a probability sample, you **must be able to calculate the probability of selecting any subset of the population**.
- Not all individuals need to have an equal chance of being selected.
- What's the point?
    - There's a better chance of collecting a **representative** sample this way.
- Not all probability samples are good, though!

### Example: movies

In [None]:
top = bpd.read_csv('data/top_movies.csv')
top

### A probability sample
- **Scheme:** Start with a random number between 0 and 9 take every tenth row thereafter.
    - **This is a probability sample!**
- Any given row is equally likely to be picked, with probability 1/10.
- It is **not** true that every subset of rows has the same probability of being selected.
    - There are only 10 possible samples: rows (0, 10, 20, 30, ..., 190), rows (1, 11, 21, ..., 191), and so on.

In [None]:
start = np.random.choice(np.arange(10))
top.take(np.arange(start, 200, 10))

### Simple random sample

- A **simple random sample (SRS)** is a sample drawn **uniformly** at random **without** replacement.
- In an SRS...
    - Every individual has the same chance of being selected.
    - Every pair has the same chance of being selected.
    - Every triplet has the same chance of being selected.
    - And so on...
- To perform an SRS from a list/array `options`, we use `np.random.choice(options, replace=False)`.
    - If we use `replace=True`, then we're **sampling uniformly at random with replacement** – there's no simpler term for this.

### Sampling rows from a DataFrame

If we want to sample rows from a DataFrame, we can use the `.sample` method on a DataFrame.

```py
df.sample(n)
```

returns a random subset of `n` rows of `df`, drawn **without replacement** (i.e. the default is `replace=False`, unlike `np.random.choice`).

In [None]:
# Without replacement
top.sample(5)

In [None]:
# With replacement
top.sample(5, replace=True)

### The effect of sample size

- The law of averages states that when we repeat a chance experiment more and more times, the empirical distribution will look more and more like the true probability distribution.
- **Similarly, if we take a large simple random sample, then the sample distribution is likely to be a good approximation of the true population distribution.**

### Example: distribution of flight delays

`united_full` contains information about all United flights leaving SFO between 6/1/15 and 8/31/15.

In [None]:
united_full = bpd.read_csv('data/united_summer2015.csv')
united_full

### Only need delays...

In [None]:
united = united_full.get(['Delay'])
united

### Population distribution of flight delays

In [None]:
bins = np.arange(-20, 300, 10)
united.plot(kind='hist', y='Delay', bins=bins, density=True, ec='w', 
            title='Population Distribution of Flight Delays', figsize=(10, 5));

Note that this distribution is **fixed**.

### Sample distribution of flight delays

- The 13825 flight delays in `united` constitute our population.
- Normally, we won't have access to the entire population.
- To replicate a real-world scenario, we will sample from `united` **without replacement**.

In [None]:
# Sample distribution
N = 100
united.sample(N).plot(kind='hist', y='Delay', bins=bins, density=True, ec='w',
                                   title='Sample Distribution of Flight Delays',
                                   figsize=(10, 5));

Note that as we increase `N`, the sample distribution of delays looks more and more like the true population distribution of delays.

## Parameters and statistics

### Terminology

- **Statistical inference** is the practice of making conclusions about a population, using data from a random sample.
- **Parameter**: A number associated with the population.
    - Example: the population mean.
- **Statistic**: A number calculated from the sample.
    - Example: the sample mean.
- A statistic can be used as an **estimate** for a parameter.

_To remember: **p**arameter and **p**opulation both start with p, **s**tatistic and **s**ample both start with s._

### Mean flight delay

**Question:** What is the average delay of United flights out of SFO? 🤔

- We'd love to know the **mean delay in the population (parameter)**, but in practice we'll only have a **sample**.

- How does the **mean delay in the sample (statistic)** compare to the **mean delay in the population (parameter)**?

### Population mean

The **population mean** is a **parameter**.

In [None]:
# Calculate the mean of the population
united_mean = united.get('Delay').mean()
united_mean

This number (like the population distribution) is fixed, and is not random.

### Sample mean

The **sample mean** is a **statistic**. Since it depends on our sample, which was drawn at random, the sample mean is **also random**.

In [None]:
# Size 100
united.sample(100).get('Delay').mean()

- Each time we run the cell above, we are:
    - Collecting a new sample of size 100 from the population, and
    - Computing the sample mean.
- We see a slightly different value on each run of the cell.
    - Sometimes, the sample mean is close to the population mean.
    - Sometimes, it's far away from the population mean.

### The effect of sample size

What if we choose a larger sample size?

In [None]:
# Size 1000
united.sample(1000).get('Delay').mean()

- Each time we run this cell, the result is still slightly different.
- However, the results seem to be much closer together – and much closer to the true population mean – than when we used a sample size of 100.
- **In general**, statistics computed on larger samples tend to be more accurate than statistics computed on smaller samples.

<center><img src='data/bullseye-high.png' width=300><img src='data/bullseye-low.png' width=300></center>

### Probability distribution of a statistic

- The value of a statistic, e.g. the sample mean, is random, because it depends on a random sample.
- Like other random quantities, we can study the "probability distribution" of the statistic (also known as its "sampling distribution").
    - This describes all possible values of the statistic and all the corresponding probabilities.
- Unfortunately, this can be hard to calculate exactly.
    - Option 1: do the math by hand.
    - Option 2: generate **all** possible samples and calculate the statistic on each sample.
    - Both approaches are hard.

### Empirical distribution of a statistic
- The empirical distribution of a statistic is based on simulated values of the statistic. It describes
    - all the observed values of the statistic, and
    - the proportion of times each value appeared.
- The empirical distribution of a statistic can be a good approximation to the probability distribution of the statistic, **if the number of repetitions in the simulation is large**.

### Distribution of sample means

- Let's...
    - Repeatedly draw a bunch of samples.
    - Record the mean of each.
    - Draw a histogram of the resulting distribution.
- Try different sample sizes and look at the resulting histogram!

In [None]:
# Sample one thousand flights, two thousand times
sample_size = 1000
repetitions = 2000
sample_means = np.array([])

for n in np.arange(repetitions):
    m = united.sample(sample_size).get('Delay').mean()
    sample_means = np.append(sample_means, m)

bpd.DataFrame().assign(sample_means=sample_means) \
               .plot(kind='hist', bins=np.arange(10, 25, 0.5), density=True, ec='w',
                     title=f'Distribution of Sample Mean with Sample Size {sample_size}',
                     figsize=(10, 5));
    
plt.axvline(x=united_mean, c='r');

### What's the point?

- In practice, we will only be able to collect one sample and calculate one statistic.
    - Sometimes, that sample will be very representative of the population, and the statistic will be very close to the parameter we are trying to estimate.
    - Other times, that sample will not be as representative of the population, and the statistic will not be very close to the parameter we are trying to estimate.
- The empirical distribution of the sample mean helps us answer the question "**what would the sample mean have looked like if we drew a different sample?**"

### Discussion Question

We just sampled **one thousand** flights, two thousand times. If we now sample **one hundred** flights, two thousand times, how will the histogram change?

- A.  narrower  
- B.  wider  
- C.  shifted left  
- D.  shifted right  
- E.  unchanged

### To answer, go to [menti.com](https://menti.com) and enter the code 5355 3796.

### How we sample matters

* So far, we've taken large simple random samples, **without replacement**, from the full population.
    * If the population is large enough, then it doesn't really matter if we sample with or without replacement.
* The sample mean, for samples like this, is a good approximation of the population mean.
* But this is not always the case if we sample differently.

## Summary

### Summary

- The **probability distribution** of a random quantity describes the values it takes on along with the probability of each value occurring.
- An **empirical distribution** describes the values and frequencies of the results of a random experiment.
    - With more trials of an experiment, the empirical distribution gets closer to the probability distribution.
- A **population distribution** describes the values and frequencies of some characteristic of a population.
- A **sample distribution** describes the values and frequencies of some characteristic of a sample, which is a subset of a population.
    - When we take a simple random sample, as we increase our sample size, the sample distribution gets closer and closer to the population distribution.
- A **parameter** is a number associated with a **population**, and a **statistic** is a number associated with a **sample**.
- We can use statistics calculated on a random samples to **estimate** population parameters.
    - For example, to estimate the mean of a population, we can calculate the mean of the sample.
    - Larger samples tend to lead to better estimates.