# Symbulate Lab 3 - Discrete Distributions

This Jupyter notebook provides a template for you to fill in.  Read the notebook from start to finish, completing the parts as indicated.  To run a cell, make sure the cell is highlighted by clicking on it, then press SHIFT + ENTER on your keyboard.  (Alternatively, you can click the "play" button in the toolbar above.)

In this lab you will use the Symbulate package.  You should have completed [Section 2](https://github.com/dlsun/symbulate/blob/master/tutorial/gs_rv.ipynb) of the "Getting Started Tutorial" and read Sections 1-4 and parts of Section 5 of the [documentation](https://dlsun.github.io/symbulate/index.html) (you can ignore parts about continuous random variables for now).  A few specific links to the documentation are provided below, but it will probably make more sense if you read the documentation from start to finish.  **You should Symbulate commands whenever possible.**  If you find yourself writing long blocks of Python code, you are probably doing something wrong.  For example, you should not need to write any `for` loops.

Remember to run the next cell first.

In [1]:
from symbulate import *
%matplotlib inline

## Part I: Binomial and Hypergeometric distributions

Shuffle a standard deck of 52 cards (13 hearts, and 39 other cards) and draw 5. Consider the number of hearts drawn.

## Problem 1

First suppose the draws are made **with replacement**, and let $X$ represent the number of hearts among the 5 cards drawn.

### a)

Define a probability space `P` in which an outcome corresponds to an ordered sequence of draws **with replacement.**   (Hint: you only need to consider whether a card is a heart or not.  Let 1 represent heart, and 0 not a heart.  See the examples for [BoxModel](https://dlsun.github.io/symbulate/probspace.html#boxmodel); use the `probs` argument like in `In[6:]`, or a dictionary like input like in `In[7:]`.)

In [2]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

### b)

Define a `RV` $X$ on the probability space `P` which counts the number of hearts.  (Hint: what simple function will count the number of 1s in a sequence of 0/1s?)

In [3]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

### c)

Simulate 10000 values of $X$, store the values in a variable `x`, and summarize its approximate distribution in a table.

In [4]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

### d)

Display the approximate distribution of $X$ in a plot.  Overlay the true probability mass function on the plot.  ([Hint](https://dlsun.github.io/symbulate/common_discrete.html#binomial).)

In [5]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

### e)

Use the simulation results to estimate $P(X=3)$.  Enter the appropriate Symbulate commands below; don't just use the above table.

In [6]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

### f)

Use the [.pdf() method](https://dlsun.github.io/symbulate/common_general_comments.html#pdf) to calculate the exact value of $P(X=3)$.  (Hint: what is the name of the distribution of $X$ in this case?)  Compare the approximation from the previous part with the exact value; recall that a relative frequency based on $N$ repetitions of a simulation is likely to be within $1/\sqrt{N}$ of the true probability.

In [7]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

### g)

Use the simulation results to estimate $E(X)$.  Compare the approximate expected value with the theoretical expected value.  (A mean based on $N$ repetitions of a simulation is likely to be within $2SD(X)/\sqrt{N}$ of the true expected value.)

In [8]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

## Problem 2

Now suppose the draws are made **without replacement**, and let $Y$ represent the number of hearts among the 5 cards drawn.

### a)

Define a probability space `Q` in which an outcome corresponds to an ordered sequence of draws **without replacement.**   (Hint: As in problem 1, you only need to consider whether a card is a heart or not, but now it is necessary to specify the actual number of cards of each type.)

In [9]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

### b)

Define a `RV` $Y$ on the probability space `Q` which counts the number of hearts.

In [10]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

### c)

Simulate 10000 values of $Y$, store the values in a variable `y`, and summarize its approximate distribution in a table.

In [11]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

### d)

Display the approximate distribution of $Y$ in a plot.  Overlay the true probability mass function on the plot.  ([Hint](https://dlsun.github.io/symbulate/common_discrete.html#hyper).  Also, see Handout 11.)

In [12]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

### e)

Use the simulation results to estimate $P(Y=3)$.  Enter the appropriate Symbulate commands below; don't just use the above table.

In [13]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

### f)

Use the [.pdf() method](https://dlsun.github.io/symbulate/common_general_comments.html#pdf) to calculate the exact value of $P(Y=3)$.  (Hint: See Handout 11.  What is the name of the distribution of $Y$ in this case?)  Compare the approximation from the previous part with the exact value; recall that a relative frequency based on $N$ repetitions of a simulation is likely to be within $1/\sqrt{N}$ of the true probability.

In [14]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

### g)

Use the simulation results to estimate $E(Y)$.  Compare the approximate expected value with the theoretical expected value.  (A mean based on $N$ repetitions of a simulation is likely to be within $2SD(X)/\sqrt{N}$ of the true expected value.)  Also compare the expected value of $Y$ (without replacement) and $X$ (with replacement); are these values within the  margin of error of each other?

In [15]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

### h)

Compare your results from Problems 1 and 2.  How does the distribution of the number of hearts drawn change between with and without replacement?  Are the expected values the same?  (Nothing to respond, just think about it.)

## Part II: Poisson approximation of the Binomial

When $n$ is "large" and $p$ is "small", a Binomial($n$, $p$) distribution is well approximated by a Poisson($np$) distribution.  This part illustrates this fact.

Let $X$ have a Binomial distribution with $n$ trials and probability of success on each trial $p=\lambda /n$, where $\lambda$ is a constant.  When $n$ is large, the number of trials is large but the probability of success on any single trial is small.  Note that the expected value of $X$ is $n(\lambda/n) = \lambda$, which does not depend on $n$.

We will assume $\lambda = 3$.

### a)

Let $n=10$.

- Define a `RV` $X$ which has a Binomial($n$, $3/n$) distribution.  ([Hint](https://dlsun.github.io/symbulate/rv.html#distribution), also refer to Example 2.7 in the Symbulate tutorial.)
- Simulate 10000 values of $X$ and plot the approximate distribution.
- Overlay the Poisson(3) probability mass function.

Does a Poisson(3) distribution seems like a good approximation of a Binomial(10, 3/10) distribution?

In [16]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

### b)

Let $n=100$.

- Define a `RV` $X$ which has a Binomial($n$, $3/n$) distribution.  ([Hint](https://dlsun.github.io/symbulate/rv.html#distribution), also refer to Example 2.7 in the Symbulate tutorial.)
- Simulate 10000 values of $X$ and plot the approximate distribution.
- Overlay the Poisson(3) probability mass function.

Does a Poisson(3) distribution seems like a good approximation of a Binomial(100, 3/100) distribution?

In [17]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

## Part III: Poisson approximation in the matching problem

Consider the matching babies problem again. (Last time, I promise!)  There are $n$ mothers and $n$ babies, and one baby is returned to each mother completely at random.  Let $X$ represent the number of babies that are returned to the correct mother.

Recall that in HW1 you used an [applet](http://www.rossmanchance.com/applets/randomBabies/RandomBabies.html) to run simulations for different values of $n$.  You should have observed

- Regardless of the value of $n$, the expected value of $X$ is 1.
- Aside from the smallest values of $n$, the probability of at least one match was about 0.63.

See HW1 solutions in PL for a refresher.

We will investigate these two properties further in this part.  To put this problem in the context of what we have been discussing this week:

- Each time a baby is returned to a mother can be considered a trial.
- Each trial results in success (the baby is returned to the correct mother) or failure (not).
- There are a fixed number of trials, $n$.

So far, the conditions for the Binomial situation are satisfied.  But does $X$ have a Binomial distribution?

### a)

What is the probability that any particular mother receives the correct baby?  Is the probability of success the same for each trial?

**TYPE YOUR EXPLANATION HERE.**

### b)

Are the trials independent?  Does $X$ have a Binomial distribution?

**TYPE YOUR EXPLANATION HERE.**

### c)

In Part II you saw how Poisson distributions can sometimes approximate Binomial distributions.  But Poisson approximations are valid much more generally.  In particular, unless $n$ is really small, the number of matches $X$ in the matching problem has an approximate Poisson distribution with mean 1.

Explain why $E(X)=1$ regardless of $n$.  You don't need to give a proof, but do think of a reasonable explanation.  (Hint: consider part a) of Part III.  Also consider your comparison of Binomial and Hypergeometric from Part I, and the means in particular; what happens here is similar.)

**TYPE YOUR EXPLANATION HERE.**

### d)

Now you will use simulation to approximate the distribution of $X$ when $n=6$.

- Label the babies $0, 1, \ldots, n-1$ (the code `labels = list(range(n))` below does this).
- Define an appropriate probability space `P` in which an outcome corresponds to the ordered shuffling of the babies.
- Define a `RV` $X$ on the probability space `P` through an appropriate function.  You can use the `number_matches` function below.
- Simulate 10000 values of $X$ and display the approximate distribution in a plot.
- Overlay the Poisson(1) probability mass function.
- Optional: use the simulation results to approximate $P(X\ge 1)$ and $E(X)$.  This is optional because you already did it in HW1 using the applet, but make sure you know how to do it in Symbulate.

Does a Poisson(1) distribution seem like a good approximation to the distribution of $X$ when $n=6$?

In [18]:
n = 6

labels = list(range(n))

def number_matches(x):
    count = 0
    for i in range(0, n, 1):
        if x[i] == labels[i]:
            count += 1
    return count

In [19]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

### e)

Pick another value of $n\ge 6$ and repeat part d).
Does a Poisson(1) distribution seem like a good approximation to the distribution of $X$ for this value of $n$?

In [20]:
n = 6 # BE SURE TO CHANGE THIS VALUE

labels = list(range(n))

def number_matches(x):
    count = 0
    for i in range(0, n, 1):
        if x[i] == labels[i]:
            count += 1
    return count

In [21]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

### f)

Given that the number of matches $X$ has an approximate Poisson(1) distribution, explain why the probability of at least one match is approximately 0.63 (for all but the smallest values of $n$).

**TYPE YOUR EXPLANATION HERE.**

## Submission Instructions

Before you submit this notebook, click the "Kernel" drop-down menu at the top of this page and select "Restart & Run All". This will ensure that all of the code in your notebook executes properly. Please fix any errors, and repeat the process until the entire notebook executes without any errors.