![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Finteresting-problems&branch=main&subPath=notebooks/birthday-problem.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

# The Birthday Problem

The [birthday problem](https://en.wikipedia.org/wiki/Birthday_problem) is a a classic question about the probability of shared birthdays in a group. It is often stated something like "How large of a group of people would be required in order for the probability of a shared birthday to be greater than 50%?"

We'll explore the theoretical and experimental probability.

## Theoretical Probability

To find the theoretical probability of a shared birthday, it's easier to find the probability of every group member having a unique birthday. We'll assume that each of the 366 possible dates is equally likely, which isn't true but is close enough.

If there is one person, the probablity of a unique birthday is 100%. A second unique birthday could be any of the remaining 365 days, so the probability of a unique birthday is $\frac{365}{366}$. For every new person, the probability of a unique birthday decreases and the total probability is the product of all of those. The probability of only unique birthdays in a group of five people would be $\frac{366}{366} \times \frac{365}{366} \times \frac{364}{366} \times \frac{363}{366} \times \frac{362}{366}$. Let run that calculation.

In [None]:
n = 5
p = 100
for x in range(n):
    p = p * (366-x)/366
print(p)

So a group of five people has about a 97% chance of having unique birthdays. We could also say that there is about a 3% chance of two people in the group sharing a birthday.

Let's calculate and plot the probabilities of unique birthdays, up to a group size of 366.

In [None]:
import pandas as pd
import plotly.express as px

unique_probabilities = {}
p = 100
for x in range(366):
    p = p * (366-x)/366
    unique_probabilities[x+1] = p
df = pd.DataFrame.from_dict(unique_probabilities, orient='index').reset_index()
df.columns = ['Group Size', 'Probability of Unique Birthdays']
px.line(df, x='Group Size', y='Probability of Unique Birthdays', title='Probability of Unique Birthdays in a Group')

We can see from the plot that a group size of about 23 is where the theoretical probability of unique birthdays drops below 50%.

We can also plot the theoretical probability of at least one shared birthday.

In [None]:
df['Probability of Shared Birthdays'] = 100-df['Probability of Unique Birthdays']
px.line(df, x='Group Size', y='Probability of Shared Birthdays', title='Probability of Shared Birthdays in a Group')

When a group is larger than 40 there is a greater than 90% chance of shared birthdays.

## Experimental Probability

To run a simulated experiment about shared birthdays, we can use the Python `random` library. Every time you run the next cell, it will generate random birthdays until a birthday is repeated.

In [None]:
import random
import numpy as np

birthdays = []
for x in range(366):
    birthday = random.randint(1, 366)
    birthdays.append(birthday)
    number_of_birthdays = len(birthdays)
    unique_birthdays = len(np.unique(birthdays))
    if unique_birthdays != number_of_birthdays:
        print('In this trial the first repeat birthday was after', x, 'people.')
        break

We can also run a simulation to correlate group size to the number of unique birthdays.

In [None]:
birthdays = []
experimental_probabilities = {}

for x in range(366):
    birthday = random.randint(1, 366)
    birthdays.append(birthday)
    number_of_birthdays = len(birthdays)
    unique_birthdays = len(np.unique(birthdays))
    experimental_probabilities[number_of_birthdays] = unique_birthdays
edf = pd.DataFrame.from_dict(experimental_probabilities, orient='index').reset_index()
edf.columns = ['Group Size', 'Unique Birthdays']
px.line(edf, x='Group Size', y='Unique Birthdays', title='Unique Birthdays versus Total Number of Birthdays')

### More Trials

Let's try running a bunch of trials to see if the average will stabilize somewhere around a group size of 23 people. Feel free to change the `number_of_trials = 1000` to a larger or smaller number.

In [None]:
number_of_trials = 1000

def find_duplicate_birthdays():
    birthdays = []
    for group_size in range(366):
        birthday = random.randint(1, 366)
        birthdays.append(birthday)
        number_of_birthdays = len(birthdays)
        unique_birthdays = len(np.unique(birthdays))
        if unique_birthdays != number_of_birthdays:
            return group_size
group_sizes = []
for trial in range(number_of_trials):
    group_sizes.append(find_duplicate_birthdays())
gdf = pd.DataFrame(group_sizes)
agdf = gdf.expanding().mean().reset_index()
agdf.columns = ['Trials', 'Average Group Size']
px.line(agdf, x='Trials', y='Average Group Size', title='Minimum Group Size with a Duplicate Birthday')

## Conclusion

Using both theoretical calculations and experimental simulations we've found that we need a group size of about **23** in order for the likelyhood of a shared birthday to be over 50%.

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)