In [1]:
# SETUP

# These lines import the Numpy and Datascience modules.
import numpy as np
from datascience import *
from prob140 import *

# These lines do some fancy plotting magic
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)
from client.api.assignment import load_assignment
autograder = load_assignment('main.ok')

# Probability 140 -- Lab 1: The Birthday Problem, Remastered 

## Introduction

In class, we discussed the classical [birthday problem](https://textbook.prob140.org/ch1/Birthday_Problem.html). 

This lab will be concerned with finding the probability distribution of the number of people until the first and second matching birthdays in a group of N people. The coding exercises in this lab are not hard. The focus in this lab is writing probability functions by inspecting sequences of events. We'll be using the `Prob140` library, which is an extension of the datascience library. You can find tutorials and [documentation here](http://probability.gitlab.io/prob140/html/)

We start with a relevant code review from the datascience library (optional). We will then move on to a small example, and then to a larger outcome space.

Read through the following code and make sure you understand what each cell does. This will be review: consider it a memory jogger. If you have trouble please consult staff or the lab instructor [datascience documentation](http://data8.org/datascience/tables.html). 

Since we will be manipulating numpy arrays throughout this lab, you may want to refer to [Section 4.4 of the Data8 textbook](https://www.inferentialthinking.com/chapters/04/4/arrays.html) to review functions involving arrays.

You may also find `np.append()` useful:

```python
>>> np.append(1, np.array([2, 3, 4]))
array([1, 2, 3, 4])
```

## A Die Example 

We will walk through a die example that is closely related to the birthday problem. Keep that in mind when reading this section. Elements of this example will help you with the calculations in Part 1 and Part 2.

In this section, we will also familiarize ourselves with the [prob140 library](https://probability.gitlab.io/prob140/html/).

Imagine we have a six-sided die, and we roll until we witness our first repeat; then we stop. We want to know how long we will have to wait until we observe the first repeated side.

Let $X = \text{number of rolls until the first repeated side appears} $. 

For example, if the sequence is 24134, then $X = 5$.

We will find the probability distribution of X, as follows.

&zwnj;
### Question I.1

When working with a random variable, its best practice to be aqcuianted with its possible values (its range). What are the possible values for X in terms of N? Instantiate the array `my_values` using `np.arange`

In [3]:
N = 6
my_values = ...

In [5]:
_ = autograder.grade('q1')

Below we start a new probability distribution, ``dist_X``, to begin inspection of the probability distribution of the number of times we roll until we stop.

In [5]:
dist_X = Table().with_column('n', my_values)


### Question I.2


With the possible values of $X$ in mind, we need a way to calculate the distribution of the first roll such that we see a repeated side. 

One way to do this, as seen in discussion, is to subtract "tail probabilities" to find the desired probability distribution.  

We can do this with the following formula:

$P(X = k) = P(X > k-1) - P(X > k)$

That is, the probability of the first repeated side on the $k$th roll is the probability that the first repeated side occurs after the $(k-1)$th roll minus the the probability of the first repeat occurring after the $k$th roll. 

The following function ``no_matches`` returns the probability that there are no repeated faces in $k$ rolls of the die. The value returned by ``no_matches`` is equal to $P(X > k)$, since no matches in $k$ rolls means that the first match must occur after the $k$th roll. 

In [7]:
lower_bound = ...
upper_bound = ...

def no_matches(k):
    if k >= lower_bound and k < upper_bound:
        return np.prod(np.arange(N, N - k, -1) / N)
    else:
        return 0

In [9]:
_ = autograder.grade('q2')


### Question I.3
In order to find $P(X = k)$, We need an array of the probabilities $P(X > k)$ for $k = 2 \cdots 7$. 

Let's calculate a $P(X = 2)$ and $P(X = 3)$ without using our defined function.

In [10]:
p_x_2 = (N / N) * (1 / N)
p_x_3 = ...
print("Probability of first repeat on 2nd and 3rd rolls, respectively is:", p_x_2, p_x_3)


We have the table ``dist_X`` with the possible values of $X$. To apply ``no_matches`` across the column, we use the `apply` function, which returns an array of the values of ``no_matches`` across all possible values of $X$. Verify that the first couple of values of ``first repeat`` correspond to those found in I.3. 

In [12]:
def p_first_match(p_different):
    return -1 * np.diff(np.append(1, p_different))

In [13]:
different = dist_X.apply(no_matches, 'n')

first_repeat = p_first_match(different)

dist_X = dist_X.with_column('P(X = n)', first_repeat)
dist_X

### Using The Prob140 library 

Using the prob140 library we could have done this in one line of code:

In [14]:
dist = Table().values(my_values).probability(first_repeat)

Let's break this down:

1. Table() -- creates a new Table object.
2. values(my_values) -- populates Table with column ``Value`` with my_values.
3. probability(first_repeat) -- inserts array of values ``Probability``.

Run the cell below and see for yourself that ``dist`` and ``dist_X`` are equivalent

In [15]:
dist

Run the cell below, which uses out ``Plot`` method. Plot formats the bins automatically! No need for the ``hist`` method, here.

In [16]:
Plot(dist_X)

## A Host's Guide to the Birthday Problem

Imagine you are hosting a large birthday party for your friend. You expect at least 400 people to attend because you're really popular and made a lot of friends in Prob140. 

Suppose your job is to greet the guests at the door and record their birthdays. Unlike before, we will approach the Birthday Problem as if people and their birthdays arrive to us in sequence, like faces of a die in a sequence of rolls.

Let $M_1$ denote the ``hitting time`` of the first matching birthday among the guests who have arrived, ignoring your friend's birthday and your own. That is, $M_1$ is how many people you greet at the door until (and including) the first person with a birthday you recorded earlier. 

Note the similarities between this and the die example.

&zwnj;

## Part 1

We will investigate the distribution of the $M_1$, the trial on which we have the first match. 



### Question 1.1

Using `np.arange`, create an array `m1_values` that contains the values of $M_1$.

In [17]:
N = 365
m1_values = ...

In [19]:
_ = autograder.grade('q3')

In the following cell, start a new probability distribution with the possible values defined above. Make sure the table looks how you expected.

Note that when calling the `prob140` functions `.values()` and `.probability()`, you don't need to call them together at the same time.

In [9]:
match1 = ...
match1

&zwnj;

### Question 1.2

Compute $P(M_1=2)$ and $P(M_1=3)$

In [22]:
p_m1_2 = ...
p_m1_3 = ...
print("Probability of first match is at second person", p_m1_2)
print("Probability of first match is at third person", p_m1_3)

In [24]:
_ = autograder.grade('q4')


### Question 1.3

Create an array `first_match` that contains the probabilities $P(M_1=k)$ for all $k$ over the possible values specified in 1.1. When you are finished, you should see the values you calculated in 1.2.

You should follow the steps you did in I.2 and I.3. Don't forget about `p_first_match`!

In [25]:
def no_matches(k):
    return ...
different = ...

first_match = ...

In [27]:
_ = autograder.grade('q5')

Now run the cell below to see the first 10 elements of `first_match`. Make sure you see the values of `p_m1_2` and `p_m1_3` that you calculated in Question 1.2

In [28]:
first_match[:10]

&zwnj;

### Question 1.4

Using the prob140 library, add the probabilites of ``first_match`` to the probability distribution ``match 1``. Make sure ``first_match`` and ``match1`` have the same number of values. Check that the first few values correspond to your answers in 1.2.

In [29]:
match1 = ...
match1

Finally, plot ``match1`` in the cell below.

In [31]:
####
# Your code here
####

# This sets the limits of the x-axis
# Feel free to play around with the values
plt.xlim(0, 100);

&zwnj;

## Part 2

After your calculations in Part 1, you become curious about the second match.

Let $M_2$ be the number of guests you greet until your second match.


### Question 2.1

Analogously to what you did before, let `m2_values` represent the possible values of $M_2$. Use `np.arange` in terms of $N$

In [33]:
m2_values = ...

In [35]:
_ = autograder.grade('q6')

In lecture yesterday, we learned about joint distributions. For every pair of values of two random variables X and Y, the joint distribution of X and Y contains the probability of X and Y taking on those values.

If we can find the joint distribution of the first match and the second match, we can calculate the distribution of the second match by taking the marginal distribution of the second match.




### Question 2.2a

Explicitly calculate the probability that $M_1 = 2$ and $M_2 = 4$


*Provide your answer and reasoning in this Markdown cell.*


#QUESTION

### Question 2.2b
Define a function `joint_func` that takes in `m1` and `m2` and returns the probability of the first match being at m1 and the second match being at m2. Think about what values are impossible and thus have a probability of zero. There's some start code, but you don't have to use it.

Don't try writing the code for this right away. Use a notebook and pencil first!

Tip: Test your function by passing in values of `m1` and `m2` that you know

In [36]:

def joint_func(m1, m2):
    
    if ...:
        return 0
    
    p_so_far = match1.prob_event(m1) # P(M1=m1)
    
    unique_values_seen = ...
    
    for i in np.arange(..., ...):
        if ...:
            p_so_far *= ...
        else:
            ...    
    
    return ...

In [38]:
_ = autograder.grade('q7')

&zwnj;

### Question 2.3

We can construct a joint probability table of 2 random variables using the `prob140` library by calling `Table().values("Name_of_RV1", values_of_RV1, "Name_of_RV2", values_of_RV2).probability_function(joint_func)`. Construct a joint distribution table called `joint_table` that contains the joint distribution table of $M_1$ and $M_2$

Note that this make take a bit of time to run. How many possible values are there in the domain of this table?

In [39]:
joint_table = ...
joint_table


### Question 2.4

To convert our table into a JointDistribution object, we can call `joint_table.toJoint(reverse=False)`. Once we have the JointDistribution object, call `.both_marginals()` on it to view the marginal distribution of $M_1$ and $M_2$. (You might have to scroll down a lot.) Make sure the marginal distribution of $M_1$ is what you saw from before

In [41]:
joint_dist = ...

In [None]:
joint_dist.both_marginals()

&zwnj;

### Question 2.5

We can calculate the marginal distributions by calling `joint_distribution.marginal_dist("name_of_RV")`. Set `match1_marginal` to be the marginal distribution of $M_1$ and `match2_marginal` to be the marginal distribution of $M_2$. Make sure `match1_marginal` contains the values you calculated in Part 1

In [43]:
match1_marginal = ...
match1_marginal

In [45]:
match2_marginal = ...
match2_marginal

&zwnj;

### Question 2.6

Plot both $M_1$ and $M_2$ together using `Plots`. The syntax is `Plots("Name of RV1", RV1, "Name of RV2", RV2, ...)`

In [47]:

# Again, feel free to play around with the limits of the x-axis
plt.xlim(0,100);

In [49]:
_ = autograder.grade('q8')


### Question 2.7

Which distribution has a lower peak? Why?

*your answer here*

You are now finished with the required portion of the lab! See the lab instructor to get checked off.

&zwnj;

## Part 3 (OPTIONAL, EXTRA CREDIT!)

As before, let $M_2$ be the number of guests you greet until the second match. In this part, you will calculate the distribution of $M_2$ in a different way. 

You can then code up your answer and check that it agrees with the distribution you found above, but you don't need to do the coding to get the extra credit.

Let $D_n$ be the event that $n$ people all have different birthdays. In Part 1 you created the array `different` that contains $P(D_n)$ for all $n$ for which $D_n$ is possible.


### Question 3.1
Find the $x$ that makes the following statement true. Your answer will be in terms of $k$, which you can assume is a possible value of $M_2$.

$$
P(M_2 = k) = P(\text{exactly 1 match through k-1 trials}) \cdot \frac{x}{365}
$$

*Provide your answer and reasoning in this Markdown cell.*


### Question 3.2
Suppose there are $n \le 365$ people. Let $j$ be an integer such that $1 < j \le n$. Let $E_{j,n}$ be the event "matching birthday at person $j$, no other matches".

True or false: For fixed $n$, the $n-1$ events $E_{j,n}$ are mutually exclusive.



*Provide your answer and reasoning in this Markdown cell.*


### Question 3.3
Show that $P(E_{j,n}) = P(D_{n-1}) \cdot \frac{j-1}{365}$

[Hint: If you have trouble thinking about the event, find $P(E_{j,5})$ for each $j = 2, 3, 4, 5$. Don't simplify the fractions or products. Look at all four products. What similarities do you notice?]

*Provide your answer and reasoning in this Markdown cell.*


### Question 3.4
Use 3.2 and 3.3 to show that
$$
P(\text{exactly 1 match through n trials}) = 
P(D_{n-1}) \cdot \frac{(n-1)n}{2 \times 365}
$$

Hint: You will need to use the arithmetic series formula: $$\sum_{i=1}^na_i=\frac{n(a_1+a_n)}{2}$$


*Provide your answer and reasoning in this Markdown cell.*


### Question 3.5

Put 3.4 and 3.1 together to find the distribution of $M_2$. It's fine to leave your answers in terms of $P(D_n)$ for $2 \le n \le 365$.

*Provide your answer and reasoning in this Markdown cell.*

&zwnj;

### Question 3.6 (OPTIONAL, UNGRADED)

Use 3.1 - 3.5 to finish this code block to see if the distribution of $M_{2}$ visually agrees with what you found in 2.6.

In [50]:

match2 = Table().values(np.arange(1, 368))

def p_all_different(n):
    if (n < ...):
        return ...
    individuals_array = np.arange(n)
    return np.prod((N - individuals_array)/N)

different = match2.apply(p_all_different, 'Value')

def p_exactly_one_match(j):
    return different.item(...) * ...
def prob_func_M2(i):
    return ...
    
match2 = Table().values(...).probability_function(prob_func_M2)
Plot(match2)
plt.xlim(0, 100);

In [52]:
match2

## Submitting to Gradescope

This semester, we will be grading Jupyter assignments using Gradescope and an export tool called `gsExport`. If you aren't enrolled in Prob140 on Gradescope, use entry code `9DRY49`. In order to submit your HW, 

1. Save the notebook using File->"Save and Checkpoint" (a keyboard shortcut is Control-S / Command-S )
2. Run the 2 cells below
3. If all went well, there should be a link called **Download this and submit to gradescope!**
4. Download that PDF and submit to the Gradescope assignment: HW1

If you are getting errors with this process, first double check that your LaTeX is correct, and if problems persist, please post on Piazza.

In [None]:
# For your convenience, you can run this cell to run all the tests at once!
import os
_ = [autograder.grade(q[:-3]) for q in os.listdir("tests") if q.startswith('q')]

In [None]:
import gsExport
gsExport.generateSubmission()