# Lab2: Poisson Approximations #

In this lab we will study Poisson approximations to a variety of distributions. To measure the distance between a distribution and its Poisson approxmation, we will use the *total variation distance* (TVD) familiar to you from Data 8.

We will start with the binomial $(n, p)$  distributions for a fixed $p$ and varying $n$. As you have seen in lecture, for fixed $p$ the shape of the binomial distribution depends on $n$, and sometimes can be approximated by an appropriate Poisson distribution. In Part 1 we will explore an example about roulette.

## Part 1: Binomial 

Roulette is a classic betting game where a ball is spun at random around a wheel with 38 unique numbers. Each number is labeled red, black, or green; 2 are green, and there are 18 each of red and black. Before any spin, players can bet a set of pockets in which they think the ball will land. 

**Assumptions of Randomness:**
The spins are independent of each other, and on each spin the ball is equally likely to land in any of the 38 pockets.

### Question 1.1: 35 Independent Bets on Green
Suppose each person in a lab of 35 people (roughly the enrollment of this lab) has a roulette wheel and decides to bet on green.

First, fill in the blanks:

Let $G_{35}$ be the number of people who win. The distribution of $G_{35}$ is binomial $(n, p)$ for some $n$ and $p$. Why is it binomial? 

Please specify $n$ and $p$ in the cell below.

In [92]:
n = ...
p = ...

In [94]:
_ = autograder.grade('q1')

As you saw in lecture, in Python we can calculate the probability mass function of a binomial using ``stats.binom.pmf``. 

For example, the chance of 3 sixes in 12 rolls of a die is:

In [95]:
stats.binom.pmf(3, 12, 1/6)

If you specify an of possible values, you get back an array of probabilities. The probabilities of 4 sixes, 5 sixes, and so on through 12 sixes:

In [96]:
stats.binom.pmf(np.arange(4, 13, 1), 12, 1/6)

That's it for review of lecture material.

Now plot the distribution of $G_{35}$ by filling in the blanks in the code below. Use the variable names $n$ and $p$ as defined above.

In [97]:
k = np.arange(...)
binom_35 = stats.binom.pmf(...)
green_35 = Table().values(...).probability(...)

Plot(...)

# The line below specifies the limits on the horizontal axis.
# Why 70? Wait and see.
# The -1 is an annoyance, but it's there because
# the 0 bar starts at -0.5. 
plt.xlim(-1, 70)

### Question 1.2: 70 Independent Bets on Green ###

Now, suppose there are 70 people in Prob140 and all 70 people bet independently on green. Let $G_{70}$ be the number of people who win. Plot the distribution of $G_{70}$ in the cell below. 

In [99]:
n = 70



green_70 = ...

Plot(...)

plt.xlim(-1, 70)
# The line below keeps the vertical scale
# the same as in the previous histogram,
# for comparability.
plt.ylim(0, 30);

Notice how the distribution of $G_{70}$ is wider and hence lower; also notice that its mass has shifted to the right compared to the distribution of $G_{70}$.

### Question 1.3:  700 Independent Bets on Green ###

Data 8 has about 700 students this term. Suppose each of 700 people bets independently on green, and let $G_{700}$ be the number of people who win. Plot the distribution of $G_{700}$, keeping the horizontal and vertical scales exactly the same as in the previous two histograms for comparability.

In [101]:





Plot(..., edges=True)
plt.ylim(0, 30)
plt.xlim(-1, 70);

Your histogram should be considerably wider and lower than the previous two, and should have an oh-so-familiar shape.

### Question 1.4: A Poisson Approximation

As seen in class, an appropriate Poisson distribution is a good approximation to the binomial when $p$ is small and $n$ is large or even moderate. 

As you know, we can calculate the probability mass function of the Poisson in Python using ``stats.poisson.pmf``. The Poisson distribution has just one positive parameter, usually called $\mu$. So for example to find all the chances that a Poisson (4.5) distribution has the values 3, 4, 5, you can do:

In [104]:
stats.poisson.pmf(np.arange(3, 6, 1), 4.5)

You're now ready to compare a binomial distribution with a Poisson approximation. Replot the distribution of $G_{35}$ so that the only bars in the plot are at 0, 1, 2, $\ldots$ 10. Don't worry about the limits on the vertical axis.

In [105]:

Plot(...)

Now try the Poisson (2) distribution as an approximation to the distribution of $G_{35}$. Therefore, let $k$ be all the possible values of $G_{35}$, but plot only the bars 0, 1, 2, $\ldots$, 10 for comparability with the binomial above.

In [None]:

n = 35
k = ...
probs_poi2 = ...
poi_2 = Table().values(...).probability(...)

Plot(poi_2)

Now use the ``Plots`` function to overlay the two histograms. We will use the syntax shown in lecture:

``Plots``(String name of distribution 1, dist1, String name of distribution 2, dist2) 

In [None]:
Plots("Poisson(2)", poi_2, "Binomial(35, 2/38)", green_35)
plt.xlim(-1, 10);

Not bad. But based on what you learned in lecture, can you think of a better approximation than Poisson (2)$? 

### Question 1.5: A Better Poisson Approximation ###
Overlay the binomial distribution of $G_{35}$ with the better Poisson approximation you have in mind.

In [None]:
k = range(...)
poi_aprx_better = stats.poisson.pmf(k, ...)
poi_better = Table().values(k).probability(poi_aprx_better)
Plots(..., ..., ..., ...)
plt.xlim(-1, 10):

### Question 1.6: Total Variation Distance

The second approximation looks better, but can we quantify which pair of distributions is closer?

In data8 you saw a metric called [Total Variation Distance](https://www.inferentialthinking.com/chapters/16/1/two-categorical-distributions.html). 

Suppose you have two distributions on the same set of possible values $x_1, x_2, \ldots , x_n$. Let the two distributions be $p_1, p_2, \ldots, p_n$ and $r_1, r_2, \ldots, r_n$, where on each $i$, the $p$-distribution places mass $p_i$ and the $r$-distribution places mass $r_i$.

Then the total variation distance between the two distributions is defined by

$$
TVD(p, r) = 
\frac{1}{2} \sum_{i=1}^n |p_i - r_i| 
$$

Define a function `tvd` that takes an array `p` and an array `r` as its arguments and returns the total variation distance, assuming that each array is a probability distribution as described above.

In [106]:
def tvd(p, r):
    return ...

In your code above, you defined `binom_35` to be the complete array of binomial (35, 2/38) probabilities, and `probs_poi_2` to be the Poisson (2) probabilities of those same values.

Use `tvd` to find the total variation distance between the two distributions. The Poisson (2) probabilities don't quite add up to 1, as you saw in lecture, but the sum is close enough to 1 that our computing system doesn't care.

In [108]:
tvd(binom_35, probs_poi2)

In [109]:
_ = autograder.grade('q2')

In [110]:
tvd(binom_35, probs_poi_better)

In [111]:
_ = autograder.grade('q3')

#### Comprehension Check

Are the two TVDs consistent with your sense that the second approximation is better? Explain how.

*Provide your answer and reasoning in this Markdown cell.*

## Part 2. Poisson (1) 

### Question 2.1 
Plot the Poisson (1) distribution on the values 0, 1, $\ldots$, 10.

In [112]:
poi1_dist = ...

In [114]:
Plot(poi1_dist)

### Question 2.2 
In th cell below, overlay the Binomial (50, 1/50) distributoin and the Poisson (1) distribution, on the same range of values as in the histogram above.

In [117]:
_ = autograder.grade('q4')

### Question 2.3 Total Variation Distance

As you saw in lecture, each term in the Binomial (n, 1/n) distribution converges to the Poisson (1) distribution as $n$ gets large. In this question you will look at the entire Binomial (n, 1/n) distribution and compare it with the Poisson (1) distribution, for different values of $n$.

As a preliminary, define a function `bin_poi_tvd` that takes $n$, $p$, and $\mu$ as arguments and returns the total variation distance between the Binomial (n, p) and Poisson ($\mu$) distributions. Use the function `tvd` that you defined above. 

In [118]:
def bin_poi_tvd(n, p, mu):
    return ...

Construct an array `tvds` that contains the total variation distance of Binomial (n, 1/n) and Poisson (1) for $n$ = 2 through 10000.

In [22]:
tvds = ...
n_values = ...

for n in n_values:
    ...

tvds

Run the code below to plot the TVDs as a function of $n$. We are not plotting the TVDs beyond $n=100$ for reasons that will become clear from the graph.

In [31]:
tvd_table = Table().with_columns(
    "n", n_values,
    "TVD", tvds
)
tvd_table.plot(0)

plt.xlim(0, 100);

#### Comprehension Check

In class we showed that for each $k$, the Binomial (n, 1/n) probability of $k$ success converges to the corresponding Poisson (1) probability. The graph above tells you something much stronger. What does it tell you?

## Part 3: Approximate Distribution of the Number of Matches

In this part, you will simuate the distribution of the number of matches in the classical letter/envelope matching problem you studied in class, and compare it to a Poisson distribution.

This is your first simulation in Prob140, so here are some reminders from Data 8.

### Question 3.1 Code Review

You will need:

- `np.random.choice`, which appears in the [Randomness section of the Data 8 textbook](https://www.inferentialthinking.com/chapters/08/randomness.html).
- `np.count_nonzero`, which also appears in that section.
- `np.diff`, which you used in Lab 1.

`np.random.choice` samples uniformly at random from a given array. The sample size is specified by the `size=` option. If you don't specify a size, it will be set equal to the length of your input array. By default, the sampling is done with replacement. To sample without replacement, use the option `replace=False`.

Create an array `shuffle` that is a random permutation of letters labeled 1, 2, $\ldots$, 10.

In [60]:
shuffle = ...
shuffle

`np.count_nonzero` takes an array as its argument and returns the number of nonzero values. Run the following cells and check that you understand the ouput, especially of the third cell. Remember that `False` is 0 and `True` is 1.

In [62]:
np.count_nonzero(make_array(5, -1, 0, 118.4))

In [63]:
x = make_array(1, 2, 3, 4)
np.count_nonzero(x == 3)

In [64]:
x = make_array(1, 0, 0, 4, 0)
np.count_nonzero(x == 0)

** Simulation Review. ** In Data 8 you frequently simulated a variety of random variables, such as the sample median or predicted values at a given value of one variable. All of the simulations had the same main steps.

- Start with an empty array in which you will collect all the simulated values. The expression `make_array()` is one way of doing this.
- Write a `for` loop that runs through the number of repetitions that you want.
- The body of the loop contains the simulation of *one* value of the random quantity, and appends that value to your array of simulated values.
- At the end you should have an array that contains the same number of simulated values that you specified as the number of repetitions.


### Question 3.2: Simulate the Number of Matches

Simulate the number of matches in the matching problem with $n=100$ letters/envelope pairs. Specifically, let $M$ be the number of matches.
- Simulate $M$ 10000 independent times.
- Construct an array `num_matches` that contains the results of the 10000 simulations. In the end, `num_matches` should contain 10000 elements, the $k$ element being the number of matches you got in your $k$th replication of the experiment.

It will help to remember that for two number x and y, the statement "x == y" is equivalent to "x - y = 0".

In [65]:
repetitions = 10000
n = 100
num_matches = ...
for i in range(repetitions):
    ...


The following code will take your array `num_matches` and convert it into a distribution called `emp_dist` by grouping the values to find the number of occurences of each match value observed. **Please run this cell, and don't change its contents**

In [1]:
match_counts = Table().with_column('matches', num_matches).group(0)
emp_dist = Table().values(match_counts.column(0))
emp_dist = emp_dist.probability(match_counts.column(1)/repetitions)

### Question 3.3: The Empirical Distribution of $M$

Now let's plot a histogram of `emp_dist`.


In [68]:

plt.xlim(-1, 10)

### Question 3.4 A Poisson (1) Approximation

Now, overlay `emp_dist` and the Poisson(1) distribution. 

In [70]:
#STUDENTSONLY


### Question 3.5

You can see that the empirical distribution of $M$ that you got by simulation is very close to Poisson (1). Given a qualitative explanation of why it is close. 

For extra credit, you will derive the exact distribution of $M$ by counting. See Part 5 below.

*Provide your answer and reasoning in this Markdown cell.*

### Part 4. The Number of "Unseparated Pairs" ###

In a permutation, an "unseparated pair" is a pair $k$, $k+1$ that appears without any other element in between. For example, suppose you are shuffling the values [1, 2, 3, 4, 5] and get the permutation [5, 3, 4, 1, 2]. This permutation has two unseparated pairs: (3, 4) and (1, 2). The permutation [5, 4, 3, 2, 1] has no unseparated pairs.

Set $n=100$ and let $U$ be the number of unseparated pairs in a random permutation of 1, 2, $\ldots, n$. Simulate the distribution of $U$ and plot the empirical histogram.

Note: Use 10000 repetitions as you did in Part 3, and follow the general scheme of simulation outlined in that part. The only difference will be in the code needed to count the number of unseparated pairs in one shuffled deck. The Code Review section in Part 3 will help.

In [37]:

num_pairs = ...

Run the cell below to generate the empirical distribution

In [36]:
pair_counts = Table().with_column('pairs', num_pairs).group(0)
emp_dist = Table().values(pair_counts.column(0))
emp_dist = emp_dist.probability(pair_counts.column(1)/repetitions)

Plot(emp_dist)
plt.xlim(-1, 10);

This distribution should look very familiar by now. Which distribution does it resemble?

**Note.** It's not very easy to see why this distribution is roughly Poisson (1), but nor is it completely out of your reach. The details are in [a paper](https://arxiv.org/pdf/1308.5459v2.pdf) by Diaconis, Evans (of our department), and Graham. Take a look at the para above Theorem 1.1 on page 2; you'll recognize a name there. The proof that I think is within your reach is on page 4.

In the notation of the paper, $S_n$ is the number of unseparated pairs and $T_n$ is the number of fixed points. The authors show in several different ways that the $S_n$ and $T_n$ have the same distribution. Since you have already observed that $T_n$ is roughly Poisson (1), and will prove it if you do the Extra Credit part below, you will also have shown that $S_n$ is roughly Poisson (1).

### Part 5. Extra Credit ###
In this part, you will derive the exact distribution of the number of matches and show that it is roughly Poisson (1) when $n$ (the number of letter/envelope pairs) is large.

Let $M_n$ be the number of matches. To find the distribution of $M_n$, you need $P(M_n = k)$ for all $k = 1, 2, \ldots , n$. So fix a $k$ in this range.

For there to be exactly $k$ matches (no more and no fewer), there must be a set of $k$ envelopes each of which contains the correct letter, and there must be no matches in the remaining $n-k$ places.



### Question 5.1###
In how many ways can you choose $k$ places for the $k$ matches?


*Provide your answer and reasoning in this Markdown cell.*

### Question 5.2###
Fix any one of the choices in 5.1. What is the chance of that there are matches in all those places (with no restrictions or conditions on the other $n-k$ places)?


*Provide your answer and reasoning in this Markdown cell.*


### Question 5.3 ###
Given that there are matches in a specified set of $k$ places, there remain $n-k$ envelopes and $n-k$ letters *that have the same labels*. In other words, the remaining $n-k$ places are their own little matching setup.

If you have $n-k$ letter/envelope pairs, what is the chance that there are no matches? These are the *derangements* that we studied in class; look up the Course Notes.


*Provide your answer and reasoning in this Markdown cell.*

### Question 5.4 ###
Now put your answers to 5.1 through 5.3 together to find an exact formula for $P(M_n = k)$.

*Provide your answer and reasoning in this Markdown cell.*

### Question 5.5 ###
Explain why the distribution of the number of matches is approximately Poisson (1) for large $n$.

*Provide your answer and reasoning in this Markdown cell.*

In [None]:
# For your convenience, you can run this cell to run all the tests at once!
import os
_ = [autograder.grade(q[:-3]) for q in os.listdir("tests") if q.startswith('q')]

In [None]:
import gsExport
gsExport.generateSubmission()