```















```

In the previous chapter we ended off by talking about the "birthday problem" and used simulation to stumble upon a surprising result. But when we solved the problem we relied on the assumption that *every single birthday is equally likely*. Is this really a valid assumption?

What if in actuality less people are born over the summer, or perhaps more people are born nine months after Valentines day? In that case, certain birthdays would be more likely than others. But how can we actually find out whether or not this is true?

One way is to ask everyone in the world what their birthday is... good luck with that! As it turns out, it's not feasible to survey billions of people. But, we can get a pretty good idea of how birthdays are distributed by collecting a *sample* of the population.

This is precisely the concept we'll introduce in this chapter.

Many interersting data-driven questions revolve around understanding some aspect of a 'population' -- be it all users of a specific app, all houses in the South of France, or even something like all coin flips. In practice, we usually can't collect data on the entire population, so instead we rely on samples to give us a representative idea of what the population actually looks like, or how it actually behaves.

## Populations

A {dterm}`population` in data science and statistics is an entire group of people, objects, or events on which we're capable of collecting data to answer a specific question. The answers to questions like "do people prefer to watch content A or content B" or "what is a typical price for a house" are dependent on what *population* of people or houses they're referring to. Interestingly, a population can also refer to a group of events, such as "all fair dice rolls".

### Specifying a population

There are two steps we follow to narrow in on a population.
1. What group is our question about?

    Maybe we're only interested in particular group -- like house prices specifically in the South of France. Therefore, our population would be narrowed down from all houses in the world to just houses in the South of France.
    
2. What group can we collect data from?

    Our data collection is often a limiting factor -- for example, if we're running a survey hosted on our website then we can only collect data on the people who visit our site. Therefore, our population is narrowed down from all users of the internet to just users who visit our site. Even if we wanted a larger population, we must remember that it is always limited by the data we collect.

In the birthday problem above, we're probably interested in the population being the entire world -- are *all* birthdays actually equally likely across the globe? However, in later exploration we'll be using [a data set](https://github.com/fivethirtyeight/data/tree/master/births) that was sampled from the United States, therefore we must narrow our population to only birthdays in the U.S. since that's the population that our data is coming from.

### Population distributions

Recall that knowing the *distribution* of some data is imperitive to understanding and gaining insight from the data. The distribution is referred to as the 'shape' of the data, or simply 'what the data *looks* like'.

The {dterm}`population distribution` would arise if we were able to measure a feature across the entire specified population, and in most settings the population distribution is considered *fixed* -- unchanging -- for the duration of a study. Notice that even something like the distribution of birthdays in the U.S., which technically changes every time someone is born or dies, is unlikely to change by a significant amount over the course of a study.

In most cases we don't know and will never know the true population distribution. However, in the birthday scenario above we operated under the *assumption* every birthday is equally likely. Under this assumption, our population distribution is assumed to be *uniform*. We're ignoring leap years for simplicity.

In [None]:
#: Replace this cell with a drawing of a uniform population distribution

import matplotlib.pyplot as plt
import numpy as np

N = 330_221_340

b = np.repeat(1/365, 365)

plt.bar(range(1,366), b, width=1, edgecolor='k', lw=0.1)
plt.xlabel("day of year")
plt.ylabel("density")

**Question**:
 At the time of this writing, the United States populace is approximately $330,221,340$ people. Let's call the population size $N$. Since each of the 365 days in a year is assumed to be equally likely $P(\text{day})=\frac{1}{365}$ and we're using a frequentist approach to probability, $P(\text{day})=\frac{\text{# people with this birthday}}{N}$, we can calculate the number of people in the U.S. that we're assuming have their birthday on any given day.

<details><summary><b>Answer</b>:</summary>$$
\begin{aligned}
N &= 320,221,340 \\
\\
P(\text{day}) &= \frac{1}{365} = \frac{\text{# people with this birthday}}{N} \\
\\
\text{# people with this birthday} &= \frac{365}{N} \\
&= 904,716 \\
\end{aligned}
$$</details>

Buuuuut, the distribution above is still only an assumption. An assumption that we still don't know the validity of. As far as we know, the actual population distribution of birthdays in the U.S. could look quite a bit different from uniform!

![other possible population distributions]

The truth is: we don't know, and we'll never know what the *true* population distribution is for most measurements since it's infeasible to collect data on every individual in the population!

#### Probability distributions

Fortunately, a noteable exception to this dilemma comes to mind: we *do* know the true population distribution for measurements that arise from *probability distributions*.

Suppose we're working with a different population, such as "the outcomes of all fair six-sided dice rolls". In this case, we know *by the definition of 'fair'* that the true population distribution is the uniform distribution.

![uniform dice roll distribution]

When we're working with a probability distribution, we already know the probability of each outcome -- so we already know the height of each bar/bin in the corresponding population distribution! The uniform distribution is where all outcomes have equal probability, but there are *loads* of other probability distributions out there. For example, you may have seen the 'bell curve' before, officially known as the Normal or Gaussian distribution.

![probability distributions, normal and some others, labeled]

So why are probability distributions helpful?

Well, if we can convince ourselves that a particular population is the result of some random process that we already have a probability distribution for -- such as fair dice rolls, fair card drawings, [more examples... maybe with some combinatorics or other distributions] -- then we know that the true population distribution will pretty much match that probability distribution!

Other times, even if we're not sure whether or not a population arises from a probabilistic process, we can still hypothesize that the population distribution looks *similar* to a probability distribution -- such is the case when we are comparing the distribution of U.S. birthdays to a uniform distribution.

## Samples

No matter what our hypotheses may be, if we want to understand what our population looks like based on data, we must turn to the power of sampling.

In contrast to a population, a {dterm}`sample` is a subset of individuals randomly taken from a population. While it's infeasible to collect data from every member of a population, we can easily collect data from some of them.

### Sampling schemes

The process of choosing individuals from a population is called *sampling*, and while there are many different ways you can sample from a population, some approaches are better than others!

For example, if we were tasked with collecting a sample of people in the U.S., in an attempt to answer the birthday assumption, we might be tempted to just ask the people around us -- you might choose to ask your classmates, or simply the first hundred people you encounter at your campus. This is called a *convenience sample* because, well, it's convenient. Unfortunately it's a bad practice.

Poor sampling schemes, like convenience sampling, produce samples that don't accurately represent the population. Remember that our population is determined in part by the data we're able to collect -- so when conducting convenience sampling we're essentially limiting our population to individuals in our geographical area at best.

A much better sampling scheme should produce samples that are *representative* of the population -- and therefore this sample must be diverse in nature.

By far the most common, simple, and yet extremely powerful method of sampling is called the "*simple random sample*". NumPy has a name for it too: `np.random.choice`.

As you might already suspect, this representative sampling scheme boils down to one simple step: *pick individuals from the total population, completely at random*.

### Collecting a random sample

NumPy's `random.choice` function should be somewhat familiar to us by now since we used it to simulate coin flips. In general, the function works by randomly picking elements from a sequence, like a list or an array.

Let's test it out by creating an example population of 100 individuals that we can sample from.

In [None]:
import numpy as np

The following code creates our population containing 100 individuals -- an unknown amount of which have the value 'kiwi', 'lemon', 'mango', or 'nectarine'. Perhaps these are the contents of someone's (rather large) grocery trip, thus our population might be "all fruits purchased on this grocery trip".

In [None]:
# Set the random 'seed' so that this gives us the same result each time
np.random.seed(1_000_001)

# Get four random floats between 0 and 1, then normalize the array so they sum
# to 100, then finally round to whole numbers.
counts = np.random.rand(4) 
counts = (counts / sum(counts) * 100).round()

population = np.repeat(['kiwi', 'lemon', 'mango', 'nectarine'], counts.astype(int))
population.size

Much like in real life, the distribution of this population is unknown to us!

Although we'd like to figure out how much of each fruit this mysterious shopper bought, we don't want to spend time counting every single fruit in their basket -- nor do we need to! As long as a collect a strong sample, we can avoid a lot of work and still understand roughly what distribution of fruits this customer bought.

Excited to see what fruit this person likes most, we reach into the basket and pull out the first fruits we see.

In [None]:
population[0]

Then the next one.

In [None]:
population[1]

And the next after that.

In [None]:
population[2]

Surely the shopper didn't buy 100% kiwi right?

We just fell victim to convenience sampling. Chances are this shopper bought all of each fruit at the same time, so of course every fruit at the top of the basket is the same.

Let's make use of a simple random sample instead. We can use `np.random.choice` to reach into the basket, pull out a random fruit, then put it back.

In [None]:
np.random.choice(population)

#### Sample size and replacement

If you remember from the last chapter, `random.choice` also accepts a `size=` argument. That argument may have seemed strange in the context of simulating coin flips, but in the context of collecting a sample it fits right in. We can set the `size=` argument to our desired sampled size!

Careful though. By default, the `random.choice` function *puts the fruit back* each time it samples an individual, which poses a problem if we try to sample multiple individuals. If we're unlucky, we might pick out a fruit, put it back, and then randomly grab the same fruit again -- this potential problem is only worsened by the fact that our population size is so tiny.

Instead, we want to take our sample all at the same time, *without* replacing the individual after each pull. Therefore, we must remember to specify `replace=False`.

Let's try it now with a sample size of ten.

In [None]:
np.random.choice(population, size=10, replace=False)

- what happens if we try grabbing more than 100? can't. exactly 100? we get exactly the population.

#### Sampling from probability distributions

- you may notice similarities with np.random.choice for sampling and np.random.choice for simulating probabilities
- that's because when we simulate a probabilistic event, we're essentially sampling from the probability distribution!

#### Sampling from tables

- to create a sample of rows from a DataFrame, we use the DataFrame method `.sample()`
- it's pretty much like the random choice function, but it is replace=False by default to make things easier!

### Sample distributions

- we need a way to quickly see if our sample *looks like* our hypothesized population -- what better way than to create a visualization with a chart! (bar or histogram)

- we know [this] to be the true population distribution for fair dice rolls, what does the distribution of the sample look like?
- as we increase sample size, the sample distribution looks closer to the population

- [in population above, make a drawing of the population with some arrows pointing to the sample distribution]

#### Why do sample distributions resemble their population?

- if we're making a representative sample -- assume random for now -- then areas of our population distribution that are more dense (higher bar) are more likely to be randomly picked from, and less likely to pick a point from lower areas
- therefore, when we plot out the distribution of our randomly chosen points, we'll expect that more of them will be from more dense areas and thus result in a higher bar in our chart, less from less dense areas thus lower bar
- obviously, there is some random chance involved meaning we most likely won't get something that looks exactly like the population -- and each sample will look a bit different too!
- the more different from our population, the less likely -- it's really rare that we'll get a large sample distribution that looks super different from our population distribution
- we can figure out how likely it is that a sample comes from a hypothesized population!

---
# Old work / Staging area

- we have a sample (the data)
- we can create a sample from our assumed population (equally likely) by using np.random.choice
- creating samples:
    - from probabilities, arrays, or directly from from tables!
    - with/without replacement
    - sample size
    - as size increases, samples look closer to the population
        - this is like 'law of averages' that old book uses dice rolls to show -- each time we roll a die we're essentially sampling from the uniform probability distribution (could be thought of as sampling with replacement...). the more that we sample (i.e. the more rolls we make) the closer our observed sample distribution will look like the underlying, true probability distribution.

---

- we need pictures for this chapter!!!
- we take a sample from that distribution -- more likely to pick a point from higher (more dense) areas, less likely to pick a point from lower areas
- therefore the shape of our sample (if it's representative, large enough) *should* look pretty much the same as the general shape of our population
- *but*, there's random chance involved! e.g., we might randomly pick more people from the low end
- intuitively, if sometimes in each bin we randomly select more or less people, we'd expect that on average our sample should look pretty much like the population


- just ask what the birthdays are of all your teammates on your local hokey team -- turns out more hockey players born in Q1

- easter egg about Korean birthdays?

- https://fivethirtyeight.com/features/some-people-are-too-superstitious-to-have-a-baby-on-friday-the-13th/
- https://github.com/fivethirtyeight/data/tree/master/births

Took 2000-2014 data, removed leap years (df.year %4 != 1), sum grouped by month,date_of_month, dropped year and day_of_week, added day_of_year index.

Should probably include earlier data too.

In [None]:
import babypandas as bpd

In [None]:
births = bpd.read_csv('../../data/us_total_births_2000-2014_no_leaps.csv')
births.sort_values('births')

In [None]:
births.plot(kind='bar', x='day_of_year', y='births', width=1)

In [None]:
births[(births.get('day_of_year') < 100) & (births.get('births') > 130000)]

- hmmmmm... we should probably find something that has a hist associated with it that we can do sampling with. That way we have categorical and numerical distributions!
- actually let's do this first. We can find a numerical distribution for another section -- like the predicting parameters and CLT chapter

- add Continuous Distributions and Samples section
- this could be like a "why do samples look like the population" thing
- kinda just want to explain the whole "if the histogram is higher then you're more likely to randomly pick an individual from that bin, vice-versa with lower"-spiel

## Example: a fake birthday population

- let's create the true population distribution -- make it not uniform
- then we sample from that (those are the people we collect data on)

## Example: products on a manufacturing line

- continuous numerical measurement