# Discussion 5 - Sampling, Bootstrapping, and Confidence Intervals

## DSC 10, Summer 2024

### Agenda
- Review of concepts:
    - Distributions.
    - Sampling.
    - Bootstrapping. 
- Work in groups of 2-4 on practice problems covering these topics.
    - Available at [practice.dsc10.com](https://practice.dsc10.com/).
- All together, go over the ones people had the most trouble with at the end.

## Distributions

### 1) Probability distributions

A probability distribution describes all possible values of a random quantity and the **theoretical** probability of each value.
- Dice roll: takes on values 1 through 6, each with probability $ {1 \over 6} $

### 2) Empirical distributions

An empirical distribution describes all **observed** values of a random experiment and the proportion of experiments in which each value occurred.

In [None]:
import numpy as np

die_faces = np.arange(1, 7)
num_rolls = 25
rolls = np.random.choice(die_faces, num_rolls)
rolls

### 3) Law of large numbers

The **law of large numbers** states that if a chance experiment is repeated 
- many times, 
- independently, and 
- under the same conditions, 

the **proportion** of times that an event occurs gets closer and closer to the **theoretical probability** of that event.

In [None]:
# theoretical prob that roll is 6
1/6

In [None]:
num_rolls = 25
die_faces = np.arange(1, 7)
rolls = np.random.choice(die_faces, num_rolls)

# proportion of times roll is 6
np.count_nonzero(rolls == 6) / num_rolls 

## Sampling

### 1) Populations and Samples

- A **population** is the complete group of people, objects, or events that we are interested in.
    - Often infeasible to collect information about every member of a population.
- A **sample** is a subset of the population.

We want to estimate the distribution of some numerical variable in the population using only a sample.

- A **parameter** is a number associated with the population.
- A **statistic** is a number calculated from the sample.
- A statistic can be used as an estimate for a parameter.

### 2) Samping strategies

A **simple random sample (SRS)** is a sample drawn uniformly at random without replacement.

To perform an SRS on an array `arr`:
    <br>
    <br>
    <center>
    <code>np.random.choice(arr, sample_size, replace = False)</code>
    </center>

To perform an SRS on a DataFrame `df`:
    <br>
    <br>
    <center>
    <code>df.sample(sample_size)</code>
    </center>

## Bootstrapping

Suppose we have a random sample of 500 UCSD students and we calculate the average height (a statistic) to estimate the average height of all UCSD students (parameter).

Our estimate depends on the random sample. If our sample was different, our estimate may also have been different. But **how different**?

We could answer this question if we knew what the distribution of the sample mean looked like.

### 1) Impractical approach

To get the distribution of the sample mean, we could repeatedly collect random samples of 500 UCSD students and compute their average heights.

Drawing new samples from the population each time is impractical &mdash; if we were able to do this, why not just collect more data in the first place?

### 2) Bootstrapping

The key insight is that the original sample itself looks a lot like the population.

So, resampling from the sample (**bootstrapping**) is kind of like sampling from the population.

Let `my_sample` be a DataFrame containing a random sample of 500 UCSD students and their heights collected from the population of all UCSD students. To bootstrap, resample from the sample **with replacement**:


```py
n_resamples = 1000
boot_means = np.array([])

for i in np.arange(n_resamples):

    # resample from my_sample with replacement
    # create resamples that are of the same size as the original sample
    resample = my_sample.sample(500, replace = True)
    
    # compute mean of the resample
    mean = resample.get('height').mean()
    
    # store it in the array of means
    boot_means = np.append(boot_means, mean)
```