```















```

In the previous chapter we ended off by talking about the "birthday problem" and used simulation to stumble upon a surprising result. But when we solved the problem we relied on the assumption that *every single birthday is equally likely*. Is this really a valid assumption?

What if in actuality less people are born over the summer, or perhaps more people are born nine months after Valentines day?

One way to check if birthdays are all actually equally likely is to ask everyone in the world what their birthday is and see if ths distribution of birthdays is uniform... good luck with that! As it turns out, it's not feasible to ask billions of people when their birthdays are. But, we can get a pretty good idea of how birthdays are distributed by collecting a *sample* of the population.

This is precisely the issue we'll discuss in this chapter.

Many interersting data-driven questions revolve around understanding some aspect of a 'population' -- be it all people in the world, all users of a specific app, or even something like all houses in the South of France. In practice, we often can't collect data on the entire population, so instead we rely on samples to give us a representative idea of what the population actually looks like, or how it actually behaves.

## Populations

A {dterm}`population` in data science is an entire group of people, objects, or events on which we're capable of collecting data to answer a specific question. The answers to questions like "do people prefer to watch content with a five-star rating or with a percent rating" or "what is a typical price for a house" are dependent on what *population* they're referring to.

### Specifying a population

There are two ways we can narrow in on a population, and we usually rely on both methods.
1. What population is our question about?

    Maybe we're only interested in house prices in the South of France. Therefore, our population would be narrowed down from all houses in the world to just houses in the South of France.
    
2. What population can we collect data from?

    If we're running an A/B test to figure out which website layout people click on more often, we can only collect data on the people who come to our site. Therefore, our population is narrowed down from all users of the internet to just users who visit our site -- no matter what our original population was.

In the birthday problem above, we're probably interested in the population being the entire world -- are *all* birthdays actually equally likely across the globe? However, we'll be using a data set that was sampled from the United States, therefore we must narrow our population to only birthdays in the U.S. since that's the population that our data is coming from.

### Population distributions

Recall that knowing the *distribution* of some feature is imperitive to understanding and gaining insight from the data. The distribution is referred to as the 'shape' of the data, and even in your data science career most people will simply as "what does the data *look like*" and usually mean "what is its distribution".

The {dterm}`population distribution` arises if we were somehow able to measure a feature across the entire specified population. In practice, a population distribution is usually considered fixed for the duration of a study -- even something like the distribution of birthdays in the U.S. is unlikely to change significantly over the course of a study.

In most cases we won't know the true population distribution, so it's shape is unknown to us. However, in the birthday scenario above, we operated under the *assumption* that all people from the population [equally likely birthday]. Under this assumption, our population distribution should be uniform. We're ignoring leap years for simplicity.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

b = np.repeat(1/365, 365)

plt.bar(range(1,366), b, width=1, edgecolor='k', lw=0.1)

# Replace this with a drawing!

At the time of this writing, the United States populace is approximately 330,221,340 people. Since each of the 365 days in a year is assumed to be equally likely $P(\text{day})=\frac{1}{365}$ and we're using a frequentist approach to probability, $P=\frac{n}{N}$, we can calculate the number of people in the U.S. that we're assuming have their birthday on a given day.

$$
\begin{aligned}
N &= 320,221,340 \\
\\
P(\text{day}) &= \frac{1}{365} = \frac{n}{N} \\
\\
n &= \frac{365}{N} \\
&= 904,716 \\
\end{aligned}
$$

- but, it could actually look like this, or this, or this (more pictures). Maybe our perception of time is way off and it actually looks like this wacky shape

- truth is: we don't know because we can't ask everyone when their birthday is!
- there is some underlying universal truth to the population distribution, but we don't know it -- we can only make educated guesses using the power of probabilities.

## Samples

- definition of sample

- we have a sample (the data)
- we can create a sample from our assumed population (equally likely) by using np.random.choice
- creating samples:
    - from probabilities, arrays, or directly from from tables!
    - with/without replacement
    - sample size
    - has size increases, samples look closer to the population

---

- if all equally likely, expect roughly a *uniform distribution* in the population

- we need pictures for this chapter!!!
- remember from Exploring Data that distribution is the *shape* of the data, we say "what the data looks like" and usually mean "what is its distribution"
- the population distribution is the true shape of all of our underlying data -- in practice this is fixed but unknown to us!
- we take a sample from that distribution -- more likely to pick a point from higher (more dense) areas, less likely to pick a point from lower areas
- therefore the shape of our sample (if it's representative, large enough) *should* look pretty much the same as the general shape of our population
- *but*, there's random chance involved! e.g., we might randomly pick more people from the low end
- intuitively, if sometimes in each bin we randomly select more or less people, we'd expect that on average our sample should look pretty much like the population


- just ask what the birthdays are of all your teammates on your local hokey team -- turns out more hockey players born in Q1

- easter egg about Korean birthdays?

- https://fivethirtyeight.com/features/some-people-are-too-superstitious-to-have-a-baby-on-friday-the-13th/
- https://github.com/fivethirtyeight/data/tree/master/births

Took 2000-2014 data, removed leap years (df.year %4 != 1), sum grouped by month,date_of_month, dropped year and day_of_week, added day_of_year index.

In [None]:
import babypandas as bpd

In [None]:
births = bpd.read_csv('../../data/us_total_births_2000-2014_no_leaps.csv')
births.sort_values('births')

In [None]:
births.plot(kind='bar', x='day_of_year', y='births', width=1)

In [None]:
births[(births.get('day_of_year') < 100) & (births.get('births') > 130000)]

- hmmmmm... we should probably find something that has a hist associated with it that we can do sampling with. That way we have categorical and numerical distributions!
- actually let's do this first. We can find a numerical distribution for another section -- like the predicting parameters and CLT chapter