# Lab 1 Birthday Probability

## Intro

The purpose of this notebook is to use the Monte Carlo simulation to answer questions related to the probability of a person having a birthday a collison with another person in a random group size larger than 23. The Monty Carlo simulation preforms a simulation of a large number of experiments while determining the ratio of sucess to the overall number. For instance, if you run **N** experiments and look at the how many times a certain desired outcome occurs which would be **M**, it can be estimated that the probability of that the outcome will be M/N.

### Written By: Cecelia Henson

##### Importing Libraries

In [1]:
import random
import time 

The **birthdayGen** method generates a list of random birthdays from 1 to 365 based on **n** numbers

In [2]:
def birthdayGen(n = 25):
    """
    Generates list of n number of random birthdays integers between 1 and 365
    
    param n: the number of birthday that will be generated; default argument is 25.
    return: a list of n number of random birthdays
    """
    birthdays = list()
    for x in range(n):
        birthdays.append(random.randint(1,365))
    return birthdays

The **sameBirthday** method checks to see if there are any repeating birthdays in the list by using the returned list from **birthdayGen**

In [3]:
def sameBirthday(birthdays):
   
    birthdaySet = set()
    for birthday in birthdays:
        birthdaySet.add(birthday)
    if(len(birthdaySet) != len(birthdays)):
        return 1
    else: 
        return 0
    

The **birthdayProb** method calculates the probability of a group size (G) having the multiple birthdays that are the same based on a desired number of simulations (N) with an influence from the Monte Carlo simulation

In [4]:
def birthdayProb(G, N):
    
    counter = 0
    for x in range(N):
        birthdays = birthdayGen(G)
        counter += sameBirthday(birthdays)
    return counter / N * 100
    

This cell tests the **birthdayProb** method by using 30 as (G) and 70 as (N)

In [5]:
birthdayProb(30, 70)

71.42857142857143

This cell demonstrates the usages of the help command in python, which is used to show the docstrings of a module, function, classes, etc 

In [6]:
help(birthdayGen)

Help on function birthdayGen in module __main__:

birthdayGen(n=25)
    Generates list of n number of random birthdays integers between 1 and 365
    
    param n: the number of birthday that will be generated; default argument is 25.
    return: a list of n number of random birthdays



#### Calculation Cell:

This cell is prompting the user to enter the size of the group **G** and the number of simulations **N** they want to use to find the probability of having the same number of birthdays in the desired group size **G**

In [7]:
G = input("Enter the number of people you want in your group: ")
N = input("Enter how many times you want to run an experiment: ")

G = int(G)
N = int(N)

probability = birthdayProb(G, N)
print("The probability of a group size " + str(G) + " having a common birthday is " + str(probability) + "%")

Enter the number of people you want in your group: 25
Enter how many times you want to run an experiment: 10000
The probability of a group size 25 having a common birthday is 56.07%


The **calculateSmallestGroup** method calculates the smallest group size to have a greater than 50% probabiility of having a repeating birthday determined by the number of simulation ran **N**

In [13]:
def calculateSmallestGroup(N):
    
    group = 1
    while birthdayProb(group, N) < 50:
        group += 1
    return group

23
Wall time: 5.18 s


#### Benchmarking
This cell tests the **calculateSmallestGroup** by benchmarking how long it takes to run the program to get the results of the of the smallest group greater than 50% probability of repeating birthdays.

In [14]:
%%time
print(calculateSmallestGroup(20000))

23
Wall time: 5.06 s


###### Benchmark Results: 

Running on personal computer -  
Number of Simulations - 20,000  
Group Size - 25  
Result - 6.78s

Running on cluster -  
Number of Simulations - 20,000  
Group Size - 25  
Result - 5.17s  

The result of my benchmark of running this cell on a cluster with Rosie versus my personal computer is about a one second difference in favor of a faster cluster that Rosie ran, however my testing environments was my computer being plugged in with a full battery. If my battery was more drained and not plugged in, my hypothesis is that the time result would have been more drastic. 

The longest running cell is the benchmarking cell to calculate the smallest group to have an over 50% probability because it touches almost every function cell to come to the time conclusion. 

## Conclusion 

The probability of a group size of 20 people that have at least one pair of the same birthday was found to be **41.01%** based on a simulation of 10,000 experiments

The smallest group size to have a probability greater than 50% with two people sharing the same birthday by running this lab was a group size of 23 because when testing group sizes from 10 to 30 with a simulation experiment number of 10,000 each time when increasing the group size it resulted consistently at 23. When I would increase the group to 30 as a test it still stayed at 23 for the number lowest number to have a greater than 50% probability. 

The **N** value I found had to be 10,000 or greater for an accurate result. When I was first doing the lab I was using 100 and 1,000 for a group size of 20-25 and I kept getting 100% probability which I knew wasn't correct. Since the amount of experiments were so low it wasn't giving me any confidence in that approach, I then decided to add another zero to the **N** value, making it 10,000 which gave a more specific answer. I then tested it with 20,000 and I would get a similar result with a +/- 1 difference.

A hash collison is when two pieces of data in a has table share the same hash value. This problem relates to a hash collisons because running the simulation of each group size just a single time can give you the same result regardless of the different size of the group. If you have a group size of 20 and test it by a simulation number 1 you will get 100%, then you run a group size of 100 but still only testing it one time you will still get a probability of 100%. By increasing the amount of simulations run you get a better more accurate answer because there you are checking the type of data more than one time which removes outliers.