## Estimating a population parameter

In the past handful of pages, we've been doing a lot of work around testing if samples fit some existing assumptions we have about their populations. But what if we don't really have assumptions about the population? Instead, we want to figure out what a population parameter might be by only using our one sample.

### The issue with just guessing the statistic

It seems very convenient to just guess that the value of the parameter is equal to the value of the sample statistic we observed -- after all, representative samples should exhibit statistics similar to the parameter.

Alas, a cautionary tale awaits you.

You, a famed data magician, are tasked with demonstrating your powers by predicting the proportion of students at your school who prefer apples over oranges. You survey a bunch of students and proclaim that it's 0.7 because that's what you saw in your sample. Soon, the school performs its annual "Apples vs Oranges" census, and it turns out the true proportion is only 0.6... you were wrong. Your powers have been denounced and now you sit in shambles wondering how you could have prevented this from happening.

Unfortunately, samples don't always turn out exactly the same. Our *best guess* for the population parameter usually *does* equal (or involve a formula based on) the statistic. But if our sample looks different each time, then our statistic will look different each time, too.

Data science is not data magic. Data science is all about recognizing and embracing this uncertainty. When we make predictions, we don't just give one number, but two:
1. What is our best guess for the parameter (a formula involving the statistic)
2. How different could our guess have been (what we're about to learn)

## How different could our statistic have been

When we collect samples from a population distribution, each sample will resemble the population distribution in shape. Due to the randomness of sampling, though, there will be a bit of variability in the sample, and thus in the statistic as well.

We can see this directly by loading in a population distribution, then taking a bunch of samples and plotting the empirical {dterm}`sampling distribution` (distribution of sample statistics). Let's look at the salary dataset again. Because it's a very skewed population, we'll be computing the median salary and overlay this true population proportion on each of our plots. We'll use samples of size 100 for simplicity.

In [None]:
import babypandas as bpd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
population = bpd.read_csv('../../data/salaries.csv', names=['Salary'])
pop_median = population.get('Salary').median()
print('Population median:', pop_median)
population.plot(kind='hist', density=True)
plt.title('Population distribution')
plt.axvline(pop_median, c='gold')

In [None]:
trials = 10_000

sample_medians = []

for i in range(trials):
    sample = population.sample(100)
    median = sample.get('Salary').median()
    sample_medians.append(median)
    
plt.hist(sample_medians, density=True)
plt.title('Distribution of sample medians')

plt.axvline(pop_median, c='gold')

With this distribution of the sample statistic computed, it's very easy to see how different any one observed statistic could have been. There are a whole range of values that we could have randomly seen. Relatively often we could randomly find a sample with a statistic in the middle -- pretty close to the true population median of 100,000. But not all the time. Most of the time, our randomly observed sample median seems to fall somewhere in the range between 90,000 and 110,000.

## Confidence intervals

- if we can somehow generate a sampling distribution, then we can uncover the range of values that 

---

This sampling distribution is a true representation of how different our sample statistic could have been. Sadly, in the real world we can't actually generate all of these new samples to calculate the sampling distribution because we don't have access to multiple samples! In most cases, we only have a single sample.

In [None]:
sample = population.sample(100)

How can we use this one sample to replicate a sampling distribution which comes from *lots* of samples?

### The bootstrap

To 

- so now comes question "how can we create a sampling distribution when we don't have a population to sample from!"
- true, we don't have the population distribution to take new samples from. However if we could use a distribution that looks *similar* to the population, we could conceivably take new samples from that similar distribution instead.
- turns out we do have a distribution that is similar to the population -- the sample itself!
- let's try it -- let's repeat the process of taking new samples from the population, only this time we pretend that our original sample *is* the population
    - because the entire process is the same, we should continue to take the same sample size
    - it's worth noting that we *must* take these samples with replacement -- otherwise every single sample will look exactly the same (try picking n items from a collection of n items, without replacement -- yeah.)

```










```

- bootstrap -- here's how we find out how different it could have been
    - problem, can't go back and get more data
    - imagine we could, then we could keep getting more samples and see how diff answer could have been (begin introducing concept of confidence interval, like it looks like stat lies in this range) -- might want to actually run that experiment, show the sampling distribution, that shows how different it could have been
    - BUT, we can't actually get new samples
    - we don't have the population to take new samples from, but we can replace it with something that looks pretty similar to the population -- the sample itself!
    - since we're pretending we're sampling from the population, we should take same sample size

- bootstrap -- 
    - we want to sample from a distribution that looks like this (true population)
    - we don't have that exact distribution, but we hopefully have something that looks *really close* (the sample distribution!). So let's sample from that!
    - lots of statistics are a property of the *distribution* -- so as long as the distribution looks similar to the right distribution, we don't care if there are duplicates!
    - some statistics work well with bootstrap, some don't (like mean vs max)

- lots of people get confused between when to use permutation or bootstrap -- so we don't need to emphasize the similarities too much, maybe just a sentence and then follow it up by saying "But, not for the same thing"
- differences: Permutation test is for two samples, want to see if from same population; HT. Bootstrap when have one sample, and want to generate more samples; CI.

- both work by saying we don't have assumptions about distribution, so we need to be clever about rearranging our current data

- one way we can do things in statistics is by operating under really rigid assumptions -- then can use math
- other way doesn't make this strict 

- do bootstrap examples for median or other stats *not* informed by central limit theorem
- we want them to be useful examples where CLT won't work

- how to introduce need for CI:
    - say "It could have been different!"
    - sure, we could say it's equal to the sample stat, but am I really confident that's right?
    - I want to know how accurate I am
    - How different could my answer have been?
    - If it could have be really different, I'm not super confident in my answer
    - if it couldn't have been really different than I can be confident!
    - only issue with confidence -- you're not 95% sure you're answer is right -- you're pretty sure that the answer is *close*
    - you're not confident that your answer is right, you're confident that your answer is *close*
    
    
- order:
    - want to estimate
    - sure we can guess our sample statistic
    - but our sample could have been really different!
    - bootstrap -- here's how we find out how different it could have been
        - problem, can't go back and get more data
        - imagine we could, then we could keep getting more samples and see how diff answer could have been (begin introducing concept of confidence interval, like it looks like stat lies in this range) -- might want to actually run that experiment, show the sampling distribution, that shows how different it could have been
        - BUT, we can't actually get new samples
        - we don't have the population to take new samples from, but we can replace it with something that looks pretty similar to the population -- the sample itself!
        - since we're pretending we're sampling from the population, we should take same sample size
    - then CI
    - then CI/HT duality
    
- can use error probabilities to link the HT and CI
- let's get a bunch of samples, run bootstrap on each, build CI on each, and see how often we're wrong

- is important to run experiment and explain that you can (and will) be wrong x% of time





- **a statistic is a property of a distribution**
- we can have two different distributions with the same statistic, but if the same statistic is different then it *must* imply the distributions are different