# HW07 notes
### [Chapter 13](https://inferentialthinking.com/chapters/13/Estimation)


`Percentile` function:
- The percentile function takes two arguments: a rank between 0 and 100, and a array. It returns the corresponding percentile of the array.


Recall:
- there are 5 variables, in percentage they are 20%, added up to equal 100.

```python
sizes = make_array(12, 17, 6, 9, 7)
sorted = np.sort(sizes)

print(sorted) # array([ 6,  7,  9, 12, 17])

>>> percentile(70, sorted)
12

```
Analogously, the 70th percentile is the smallest value in the collection that is at least as large as 70% of the elements of sizes. Now 70% of 5 elements is “3.5 elements”, so the 70th percentile is the 4th element on the list. That’s 12, the same as the 80th percentile for these data.



#### Bootstrapping review
1. draw a sample at random without replacement
```python
sample_data = sf2019.sample(500, with_replacement=False)
sample_data.select('column to test for').hist(bins=sf_bins)
```
2. Generate an estimate of median, over a range of tests (empirical distributions)
```python
est_median = percentile(50, our_sample.column('column to test for')) # returns some number
```
3. Bootstrap:
    - Treat the original sample as if it were the population.
    - Draw from the sample, at random with replacement, the same number of times as the original sample size.
```python
resample_1 = sample_data.sample()
resampled_median_1 = percentile(50, resample_1.column('column to test for'))
```
<center>Code to handle that</center>

```python
def one_bootstrap_median():
    resampled_table = our_sample.sample()
    bootstrapped_median = percentile(50, resampled_table.column('Total Compensation'))
    return bootstrapped_median

num_repetitions = 5000
bstrap_medians = make_array()
for i in np.arange(num_repetitions):
    bstrap_medians = np.append (bstrap_medians, one_bootstrap_median())

resampled_medians = Table().with_column('Bootstrap Sample Median', bstrap_medians)
median_bins=np.arange(120000, 160000, 2000)
resampled_medians.hist(bins = median_bins)

# Plotting parameters; you can ignore this code
parameter_green = '#32CD32'
plots.ylim(-0.000005, 0.00014)
plots.scatter(pop_median, 0, color=parameter_green, s=40, zorder=2);
```

**Middle 95%** Rule:

_Spoiler alert: The statistical theory of the bootstrap says that the number should be around 95. It may be in the low 90s or high 90s, but we don’t expect it to be much farther off 95 than that._

- We will start by writing a function bootstrap_median that takes two arguments: the name of the table containing the original random sample, and the number of bootstrap samples to draw. It returns an array of bootstrapped medians, one from each bootstrap sample.

```python
def bootstrap_median(original_sample, num_repetitions):
    medians = make_array()
    for i in np.arange(num_repetitions):
        new_bstrap_sample = original_sample.sample()
        new_bstrap_median = percentile(50, new_bstrap_sample.column('Total Compensation'))
        medians = np.append(medians, new_bstrap_median)
    return medians

# THE BIG SIMULATION: This one takes several minutes.

# Generate 100 intervals and put the endpoints in the table intervals to collect the middle 95% of bootstrapped medians each iteration 

left_ends = make_array()
right_ends = make_array()

for i in np.arange(100):
    original_sample = sf2019.sample(500, with_replacement=False)
    medians = bootstrap_median(original_sample, 5000)
    left_ends = np.append(left_ends, percentile(2.5, medians))
    right_ends = np.append(right_ends, percentile(97.5, medians))

intervals = Table().with_columns(
    'Left', left_ends,
    'Right', right_ends
)    
```

#### Estimating Population Average`

```python

def one_bootstrap_mean():
    resample = births.sample()
    return np.average(resample.column('Maternal Age'))

# Generate means from 5000 bootstrap samples
num_repetitions = 5000
bstrap_means = make_array()
for i in np.arange(num_repetitions):
    bstrap_means = np.append(bstrap_means, one_bootstrap_mean())

# Get the endpoints of the 95% confidence interval
left = percentile(2.5, bstrap_means)
right = percentile(97.5, bstrap_means)

make_array(left, right)
# output: array([26.90630324, 27.55962521])

```

Notes:
- you can construct a confidence interval test at any level, or answer a question that is aiming to find the unknown parameter (with estimation by bootstrapping confidence interval)
- When conducting bootstrap method:
    - start with large random sample
    replicate resampling 10,000 times
    - The bootstrap percentile method works well for estimating the population median or mean based on a large random sample. For example, it is not expected to do well in the following situations.
        - When it wont work: 
            - The goal is to estimate the minimum or maximum value in the population, or a very low or very high percentile, or parameters that are greatly influenced by rare elements of the population.
            - The probability distribution of the statistic is not roughly bell shaped.
            - The original sample is very small, say less than 10 or 15.

#### Confidence Interval
- A confidence interval has a single purpose – to estimate an unknown parameter based on data in a random sample.
    - use confidence interval to test hypotheses 

```python
# in this scenario, the hypothesis statement was that the population parameter equals a specific value (that means drop  =  0) && the alt hypothesis is the ≠ 0 
def one_bootstrap_mean():
    return np.average(table.sample().column('drop'))

# Generate 10,000 bootstrap means
num_repetitions = 10000
bstrap_means = make_array([one_bootstrap_mean() for _ in range(num_repetitions)])

# Compute 99% confidence interval
left, right = percentile([0.5, 99.5], bstrap_means)
print("99% Confidence Interval:", left, right)

# Plot results
resampled_means = Table().with_column('Bootstrap Sample Mean', bstrap_means)
resampled_means.hist()
plots.plot([left, right], [0, 0], color='yellow', lw=8)

```