# Symbulate Lab 1 - Probability Spaces

This Jupyter notebook provides a template for you to fill in.  Complete the parts as indicated.  To run a cell, hold down SHIFT and hit ENTER.

In this lab you will use the Python package [Symbulate](https://github.com/dlsun/symbulate).  You should have already completed Section 1 of the "Getting Started with Symbulate Tutorial" **ADD LINK** and read sections 1 through 3 of the [Symbulate documentation](https://dlsun.github.io/symbulate/index.html).  A few specific links to the documentation are provided below, but it will probably make more sense if you read the documentation from start to finish.  **Whenever possible, you should use Symbulate commands, not general Python code.**

To use Symbulate, you must first run (SHIFT-ENTER) the following commands.

In [1]:
from symbulate import *
%matplotlib inline

## Part I. Introduction to Symbulate, and conditional versus unconditional probability

A deck of 16 cards contains 4 cards in each of four suits ['clubs', 'diamonds', 'hearts', 'spades'].  The deck is shuffled and two cards are drawn in sequence.  We are interested in the following questions.

1. What is the probability that the first card drawn is a heart?
1. What is the probability that the second card drawn is a heart?
1. If the first card drawn is a heart, what is the probability that the second card drawn is a heart?

Before proceeding, give your best guess of each of these probabilites.

We'll use simulation to obtain approximations to the probabilities in the questions above.  First we define the deck of cards (we only care about the suits for this exercise).

In [2]:
cards = ['club', 'diamond', 'heart', 'spade'] * 4  # 4 cards of each suit
len(cards)

16

Now we define a [`BoxModel`](https://dlsun.github.io/symbulate/probspace.html#boxmodel) probability space corresponding to drawing two cards (`size=2`) from the deck at random.  We'll assume that the cards are drawn without replacement (`replace=False`).  We also want to keep track of which card was drawn first and which second (`order_matters=True`).  

In [3]:
P = BoxModel(cards, size=2, replace=False, order_matters=True)

The `.draw()` method simulates a single outcome from the probability space.  Note that each outcome is an ordered pair of cards.

In [4]:
P.draw()

(spade, spade)

Many outcomes can be simulated using `.sim()`. The following simulates 10000 draws and stores the results in the variable `sims`.

In [5]:
sims = P.sim(10000)
sims

Index,Result
0,"(spade, heart)"
1,"(club, club)"
2,"(heart, heart)"
3,"(heart, diamond)"
4,"(heart, club)"
5,"(club, diamond)"
6,"(spade, heart)"
7,"(heart, diamond)"
8,"(diamond, spade)"
...,...


We can summarize the simulation results with `.tabulate()`.  Note that `('heart', 'club')` is counted as a separate outcome than `('club', 'heart')` because the order matters.

In [6]:
sims = P.sim(10000)
sims.tabulate()

Outcome,Value
"(club, club)",495
"(club, diamond)",672
"(club, heart)",631
"(club, spade)",650
"(diamond, club)",656
"(diamond, diamond)",465
"(diamond, heart)",661
"(diamond, spade)",696
"(heart, club)",702
"(heart, diamond)",697


The above table could be used to estimate the probabilities in question.  Instead, we will illustrate several other tools available in Symbulate to summarize simulation output.

First, we use a `filter` function to creat a subset of the simulated outcomes for which the first card is a heart.  We define a function `first_is_heart` that takes as an input a pair of values (`x`) and returns `True` if the first value in the pair (`x[0]`) is equal to 'hearts', and `False` otherwise. (Python indexing starts at 0: 0 is the first enrty, 1 is the second, and so on.)

In [7]:
def first_is_heart(x):
    return (x[0] == 'heart')

first_is_heart(('heart', 'club'))

True

In [8]:
first_is_heart(('club', 'heart'))

False

Now we `filter` the simulated outcomes to create the subset of outcomes for which `first_is_heart` returns `True`.

In [9]:
sims_first_is_heart = sims.filter(first_is_heart)
sims_first_is_heart.tabulate()

Outcome,Value
"(heart, club)",702
"(heart, diamond)",697
"(heart, heart)",530
"(heart, spade)",643
Total,2572


Returning to question 1, we can estimate the probability that the first card is a heart by dividing the number of simulated draws for which the first card is a heart divided by the total number of simulated draws (using the length function `len` to count.)

In [10]:
len(sims_first_is_heart) / len(sims)

0.2572

The true probability is 4/16 = 0.25.  Your simulated probability should be close to 0.25, but there will be some natural variability due to the randomness in the simulation.  Very roughly, the margin of error of a probability estimate based on $N$ simulated repetitions is about $1/\sqrt{N}$, so about 0.01 for 10000 repetitions. The interval constructed by adding $\pm 0.01$ to your estimate will likely contain 0.25.

## a)

Recall question 2: What is the probability that the second card drawn is a heart? Use an analysis similar to the above &mdash; including defining an appropriate function to use with `filter` &mdash; to estimate the probability.  (Is your simulated value close to your initial guess?)

Type your commands in the following code cell.  Aside from defining a `second_is_heart` function and using `len`, you should use Symbulate commands exclusively.

In [11]:
# Type your Symbulate commands in this cell.

## b)

Many people confuse the probabilities in (2) and (3).  The probability in (2) is an *unconditional* probability: we do not know whether or not the first card is a heart so we need to account for both possibilities.  All we know is that each of the 16 cards in the deck is equally likely to be shuffled into the second position, so the probability that the second card is a heart (without knowing what the first card is) is 4/16 = 0.25.

In contrast, the probability in question 3 is a *conditional* probability: *given that the first card drawn is a heart*, what is the probability that the second card drawn is a heart?  Again, aside from maybe defining a new `is_heart` function and using `len`, you should use Symbulate commands exclusively.

In [12]:
# Type your Symbulate commands in this cell.

## c)

Given that the first card is a heart, there are 15 cards left in the deck, each of which is equally likely to be the second card, of which 3 are hearts.  So the conditional probability that the second card is a heart given that the first card is a heart is 3/15 = 0.20.  Verify that your simulated value is consistent with the true value.

Now you will do a few calculations by hand.

1. Compute, by hand, the conditional probability that the second card is a heart given that the first cards is NOT a heart.
1. Construct, by hand, a hypothetical two-way table representing the results of 10000 draws.
1. Use the hypothetical table to compute the probability that the second card is a heart.
1. What is the relationship between the probability that the second card is a heart and the two conditional probabilities?

(Nothing to respond here, just make sure you understand the answers.)

## d)

How would the answers to the previous questions change if the draws were made with replacement (so that the first card is replaced and the deck reshuffled before the second draw is drawn?)  In this case, what can we say about the events "the first card is a heart" and "the second card is a heart"?

**Type your response here.**

## Part II.  Collector's problem

Each box of a certain type of cereal contains one of $n$ distinct prizes and you want to obtain a complete set. Suppose
that each box of cereal is equally likely to contain any one of the $n$ prizes, and the particular prize
that appears in one box has no bearing on the prize that appears in another box. You purchase
cereal boxes one box at a time until you
have the complete set of $n$ prizes.  What is the probability that you buy more than $k$ boxes?  In this problem you will use simulation to estimate this probability for different values of $n$ and $k$.

Here is a little Python code you can use to label the $n$ prizes from 0 to $n-1$.  (Remember: Python starts indexing at 0.)

In [13]:
n = 10
prizes = list(range(n))
prizes

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

And here is a function that returns the number of distinct prizes collected among a set of prizes.

In [14]:
def number_collected(x):
    return len(set(x))

# For example
number_collected([2, 1, 2, 0, 2, 2, 0])

3

**Aside from the above functions, you should use Symbulate commands exclusively for Part II. **

## Problem 1.

We'll assume that there are 3 prizes, $n=3$, a situation in which exact probabilities can easily be computed by enumerating the possible outcomes.

In [15]:
n = 3
prizes = list(range(n))
prizes

[0, 1, 2]

### a)

Define a probability space for the sequence of prizes obtained after buying $3$ boxes (first box, second box, third box), and simulate a single outcome.  (Hint: try [BoxModel](https://dlsun.github.io/symbulate/probspace.html#boxmodel).)

In [16]:
# Type your Symbulate commands in this cell.

### b)

Now simulate many outcomes and summarize the results.  Does it appear that each sequence of prizes is equally likely?  (Hint: try the various [Simulation tools](https://dlsun.github.io/symbulate/sim.html#sim) like `.sim()` and `.tabulate()`.)

In [17]:
# Type your Symbulate commands in this cell.

### c)

Count the number of distinct prizes collected for each of the simulated outcomes using the `number_collected` function.  (Hint: try [`.apply()`](https://dlsun.github.io/symbulate/sim.html#apply).)

In [18]:
# Type your Symbulate commands in this cell.

### d)

Use the simulation results to estimate the probability the more than $k=3$ boxes are needed to complete a set of $n=3$ prizes.  (Hint: see this [summary of the simulation tools](https://dlsun.github.io/symbulate/sim.html#summary) section for a few suggestions.)

In [19]:
# Type your Symbulate commands in this cell.

## Problem 2.

Use simulation to estimate the probability that more than $k=100$ boxes are need to complete a set of $n=20$ prizes, a situation for which it is extremely difficult to compute the probability analytically.

In [20]:
# Type your Symbulate commands in this cell.

## Problem 3.

How large of a group of people is needed to have a probability of greater than 0.5 that on every day of the year someone in the group has a birthday?  Greater than 0.9?  Greater than 0.99?  (Assume 365 equally likely birthdays, no multiples, etc.)  Before coding, I encourage you to make some guesses for the answers first.

Formulate this scenario as a collector's problem and experimemt with values of $n$ or $k$ until you are satisfied.  (You don't have to get any fancier than guess-and-check, but you can if you want.) 

In [21]:
# Type your relevant code in this cell for 0.5

In [22]:
# Type your relevant code in this cell for 0.9

In [23]:
# Type your relevant code in this cell for 0.99

## Problem 4.

Now suppose that some prizes are harder to find than others.  In particular, suppose that the prizes are labeled 1, 2, 3, 4, 5.  Assume that prize 2 is twice as likely as prize 1, prize 3 is three times as likely as prize 1, prize 4 is four times as likely as prize 1, and prize 5 is five times as likely as prize 1.

Estimate the probability that you'll need to buy more than 20 prizes to obtain a complete set.  How does this probability compare to the probability in the equally likely situation?
Hint: define a [BoxModel with a dictionary-like input](https://dlsun.github.io/symbulate/probspace.html#dictionary).

In [24]:
# Type your Symbulate commands in this cell.

## Submission Instructions

Before you submit this notebook, click the "Kernel" drop-down menu at the top of this page and select "Restart & Run All". This will ensure that all of the code in your notebook executes properly.