```















```

In the previous chapter we ended off by talking about the "birthday problem" and used simulation to stumble upon a surprising result. But when we solved the problem we relied on the assumption that *every single birthday is equally likely*. Is this really a valid assumption?

What if in actuality less people are born over the summer, or more people are born nine months after Valentines day?

One way to check if birthdays are all actually equally likely is to ask everyone in the world what their birthday is... good luck with that! As it turns out, it's not feasible to ask millions of people when their birthdays are. But, we can get a pretty good idea of how birthdays are distributed by collecting a *sample* of the population.

This is precisely the issue we'll discuss in this chapter.

Many interersting data-driven questions revolve around understanding some aspect of a 'population' -- be it all people in the world, all users of a specific app, or even something like all houses in the South of France. In practice, we often can't collect data on the entire population, so instead we rely on samples to give us a representative idea of what the population actually looks like, or how it actually behaves.

## Populations

A {dterm}`population` in data science is the entire group of people, objects, or events on which we're capable of collecting data to answer a specific question. The answers to questions like "do people prefer to watch content with a five-star rating or with a percent rating" or "what is a typical price for a house" are dependent on what *population* they're referring to.

### Specifying a population

There are two ways we can narrow in on a population
1. If our question is about a specific population  

    Maybe we're only interested in house prices in the South of France. Therefore, our population would be narrowed down from all houses in the world to just houses in the South of France.
    
2. If we can only collect certain data

    If we're running an A/B test to figure out which website layout people click on more often, we can only collect data on the people who come to our site. Therefore, our population is unwillingly narrowed down from all users of the internet to just users who visit our site.

In the birthday problem above, we're probably interested in the population being the entire world -- are *all* birthdays actually equally likely across the globe? However, we'll be using a data set that was sampled from the United States, therefore we must narrow our population to only birthdays in the U.S. since that's the population that our data is coming from.

### Population distributions

Recall that knowing the *distribution* of some feature is imperitive to understanding that data. The distribution is referred to as the 'shape' of the data, and even in your data science career most people will simply as "what does the data *look like*" and usually mean "what is its distribution".

The {dterm}`population distribution` would arise if were somehow able to measure a feature across the specified population. However, while the population distribution has some true, underlying and unchanging shape, it's almost always unknown to us!

In the birthday scenario above, we start by operating under the *assumption* that all birthdays are equally likely, and we've narrowed down the population to just birthdays in the United States (ignoring leap years).

- assumption would expect uniform distribution

- example sketches
    - perhaps all birthdays in the U.S. truly are equally likely, in which case we'd expect to see a population distribution like this: ![uniform birthdays]
    - but it could very well look like this

In practice, you're unlikely to find someone with enough time, patience, or grant money to conduct such a large-scale measurement, so the population distribution remains unknown to us. There is some underlying *universal truth*, but we don't know it -- we can only make educated guesses using the power of probabilities.

- if in the birthday problem we define our population to be the United States, which has 330,221,340 people at the time of this writing
- if when we select 23 people from the US, all birthdays are equally likely, then recall from Probability that P=1/365=n/N --> n=N/365 so our population should have 904716 people with a birthday on each of the 365 days of the year (ignoring leap years for simplicity)

In [None]:
N = 330221340
N/365

In [None]:
import numpy as np

In [None]:
population = np.repeat(N/365, 365)

- if all equally likely, expect roughly a *uniform distribution* in the population

- we need pictures for this chapter!!!
- remember from Exploring Data that distribution is the *shape* of the data, we say "what the data looks like" and usually mean "what is its distribution"
- the population distribution is the true shape of all of our underlying data -- in practice this is fixed but unknown to us!
- we take a sample from that distribution -- more likely to pick a point from higher (more dense) areas, less likely to pick a point from lower areas
- therefore the shape of our sample (if it's representative, large enough) *should* look pretty much the same as the general shape of our population
- *but*, there's random chance involved! e.g., we might randomly pick more people from the low end
- intuitively, if sometimes in each bin we randomly select more or less people, we'd expect that on average our sample should look pretty much like the population


- just ask what the birthdays are of all your teammates on your local hokey team -- turns out more hockey players born in Q1

- easter egg about Korean birthdays?

- two ways to choose population:
    1. What you want the result to encompass (population informs sampling)
    2. What you ultimately sampled from (sampling informs population)
    - number (2) *always* applies -- so even if you *wanted* your population to be the entire world, if you only sampled people from the U.S. then your population is only allowed to be the U.S.

- https://fivethirtyeight.com/features/some-people-are-too-superstitious-to-have-a-baby-on-friday-the-13th/
- https://github.com/fivethirtyeight/data/tree/master/births

Summed the 2000-2014 data by day of year, removed Feb 29, added counter for day of year.

In [None]:
import babypandas as bpd

In [None]:
births = bpd.read_csv('../../data/us_total_births_2000-2014.csv')
births

In [None]:
births.plot(kind='bar', x='day_of_year', y='births', width=1)

- hmmmmm... we should probably find something that has a hist associated with it that we can do sampling with. That way we have categorical and numerical distributions!
- let's do this first, then we can find a numerical distribution for another section