# Symbulate Lab 2 - Random Variables

This Jupyter notebook provides a template for you to fill in.  Read the notebook from start to finish, completing the parts as indicated.  To run a cell, make sure the cell is highlighted by clicking on it, then press SHIFT + ENTER on your keyboard.  (Alternatively, you can click the "play" button in the toolbar above.)

In this lab you will use the Symbulate package.  You should have completed [Section 2](https://github.com/dlsun/symbulate/blob/master/tutorial/gs_rv.ipynb) of the "Getting Started Tutorial" and read Sections 1-4 of the [documentation](https://dlsun.github.io/symbulate/index.html) (you can ignore parts about continuous random variables for now).  A few specific links to the documentation are provided below, but it will probably make more sense if you read the documentation from start to finish.  **You should Symbulate commands whenever possible.**  If you find yourself writing long blocks of Python code, you are probably doing something wrong.  For example, you should not need to write any `for` loops.

**Warning:** You may notice that many of the cells in this notebook are not editable. This is intentional and for your own safety. We have made these cells read-only so that you don't accidentally modify or delete them. However, you should still be able to execute the code in these cells.

In [1]:
from symbulate import *
%matplotlib inline

## Part I

Cards labeled 1, 2, ... $n$ are shuffled and $k$ are drawn out.  Let $X$ represent the smallest number drawn.  For example, if the cards drawn are (4, 10, 4, 7, 6) then $X=4$.  In this problem you will investigate the distribution of the random variable $X$ and how it changes depending on whether the cards are drawn with or without replacement.

We will assume $n=10$ and $k=5$.

In [2]:
k = 5
n = 10
cards = list(range(1, n + 1))
cards

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

### Problem 1:  Assume the cards are drawn *with* replacement.

Before proceeding, make some guesses for how you would expect $X$ to behave.  What are the possible values?  What values would be more/less likely?  What would you guess for the expected value?  (Nothing to write up, just think about it.)

### a)

Define a probability space $P$ in which an outcome corresponds to the sequence of numbers drawn *with* replacement. (Hint: use [BoxModel](https://dlsun.github.io/symbulate/probspace.html#boxmodel).)  After defining $P$, display a few simulated outcomes.

In [3]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

### b)

Now let $X$ denote the minimum (`min`) of the numbers drawn.  Define a Symbulate [random variable](https://dlsun.github.io/symbulate/rv.html#RV) $X$ on the probability space $P$ via an appropriate function.  Evaluate $X$ for the outcome (4, 10, 4, 7, 6).

In [4]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

### c)

Simulate 10000 values of $X$, store the values in a variable `x`, and summarize its approximate distribution in a table.  ([Hints.](https://dlsun.github.io/symbulate/rv.html#RV))  Note: Your table should include all *possible* values of $X$.  However, by default `tabulate` only tabulates those values which are among the simulated values, rather than all possible values. An argument can be passed to `.tabulate()` to tabulate all outcomes in a given list. [Compare the examples in `[14]` and `[15]` here](https://dlsun.github.io/symbulate/sim.html#tabulate).

In [5]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

### d) 

Display the approximate distribution of $X$ in a plot.  ([Hint.](https://dlsun.github.io/symbulate/rv.html#plot))

In [6]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

### e)

Use the simulation results to estimate $P(X \le 3)$.  Enter the appropriate Symbulate commands below; don't just use the above table and a calculator.  ([Hints](https://dlsun.github.io/symbulate/sim.html#recap).)  (Not to hand in, but good to think about: how you would you calculate this probability analytically?)

In [7]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

### f)

Approximate $E(X)$.

In [8]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

### g)

Approximate $\text{Var}(X)$ and $\text{SD}(X)$.

In [9]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

### Problem 2:  Assume the cards are drawn *without* replacement.

Let $Y$ denote the smallest number when the cards are drawn without replacement.  Before proceeding, make some guesses for how you would expect $Y$ to behave.  In particular, how would you expect the distribution of $Y$ to compare to the distribution of $X$?

### a)

Define a probability space $Q$ in which an outcome corresponds to the sequence of numbers drawn *without* replacement. (Hint: use [BoxModel](https://dlsun.github.io/symbulate/probspace.html#boxmodel).)  After defining $Q$, display a few simulated outcomes.

In [10]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

### b)

Now let $Y$ denote the minimum (`min`) of the numbers drawn.  Define a Symbulate [random variable](https://dlsun.github.io/symbulate/rv.html#RV) $Y$ on the probability space $Q$ via an appropriate function.

In [11]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

### c)

Simulate 10000 values of $Y$, store the values in a variable `y`, and summarize its approximate distribution in a table.  ([Hints.](https://dlsun.github.io/symbulate/rv.html#RV))  Note: Your table should include all *possible* values of $Y$.  However, by default `tabulate` only tabulates those values which are among the simulated values, rather than all possible values. An argument can be passed to `.tabulate()` to tabulate all outcomes in a given list. [Compare the examples in `[14]` and `[15]` here](https://dlsun.github.io/symbulate/sim.html#tabulate).

In [12]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

### d) 

Display the approximate distribution of $Y$ in a plot.  ([Hint.](https://dlsun.github.io/symbulate/rv.html#plot)).  Also, plot the distribution of $X$ from Problem 1 in the same plot.  (Hint: use `jitter=True` [like in `[29]` here](https://dlsun.github.io/symbulate/graphics.html#custom).)  Note: the amount of offset produced by jitter is random; if the bars are still too close together, just run the cell again.

In [13]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

### e)

Use the simulation results to estimate $P(Y \le 3)$.  Enter the appropriate Symbulate commands below; don't just use the above table and a calculator.  ([Hints](https://dlsun.github.io/symbulate/sim.html#recap).)  (Not to hand in, but good to think about: how you would you calculate this probability analytically?)

In [14]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

### f)

Approximate $E(Y)$.

In [15]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

### g)

Approximate $\text{Var}(Y)$ and $\text{SD}(Y)$.

In [16]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

### h)

Write a few sentences describing and comparing the distributions of $X$ and $Y$.  What are the main effects of changing from with to without replacement?

**ENTER YOUR WRITTEN EXPLANATION HERE.**

## Part II

Recall the collector problem from Lab 1.  Each box of a certain type of cereal contains one of $n$ distinct prizes and you want to obtain a complete set. Suppose that each box of cereal is equally likely to contain any one of the $n$ prizes, and the particular prize
that appears in one box has no bearing on the prize that appears in another box. You purchase
cereal boxes one box at a time until you have the complete set of $n$ prizes. 

In Lab 1, you investigated the probability that you would need to buy more than a certain number of boxes to complete a set.  Now you will investigate the distribution of the number of boxes you purchase until you complete a set.

Let $X$ be the total number of boxes purchased, assuming you stop once you have the complete set of $n$ prizes.

We will assume $n=10$ (with prizes labeled 0, 1, ..., 9.)

In [17]:
n = 10
prizes = list(range(n))
prizes

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Before proceeding, make some guesses for how you would expect $X$ to behave.  What is the smallest possible value?  The largest?  What values would be more/less likely?  What would you guess for the expected value?  (Nothing to write up, just think about it.)

### a)

The probability space could be represented as the sequence of prizes obtained.  (First I got prize 3, second I got prize 1, third I got prize 3 (again), etc.)  While technically you would stop buying prizes when you get a complete set, it is convenient to imagine that you keep buying boxes forever.  This way, all outcomes in the probability would have the same "length".  Also, you could use such a probability space to investigate other problems too (e.g. number of boxes purchased until $r$ complete sets are obtained).

Define a probability space $P$ in which an outcome corresponds to an infinite sequence of prizes.  (Hint: use `BoxModel` with `size=inf`.)  After defining $P$, display a few simulated outcomes.

In [18]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

### b)

The `number_prizes_until_complete_set` function below takes as an input a sequence of prizes and returns how many prizes were purchased until a complete set was obtained.

In [19]:
def number_prizes_until_complete_set(outcome):
    unique_prizes = set()
    for i, prize in enumerate(outcome):
        unique_prizes.add(prize)
        if len(unique_prizes) == n:
            return i + 1

# for the outcome below, the set is completed when you get prize 7
outcome = (3, 4, 3, 0, 1, 6, 5, 3, 2, 4, 5, 6, 9, 8, 3, 4, 5, 6, 7)  
number_prizes_until_complete_set(outcome)

19

Use the above function to define a `RV` $X$ on the probability space $P$ from part a).

In [20]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

### c)

Simulate 10000 values of $X$, store the values in a variable `x`, and summarize its approximate distribution in a table.

In [21]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

### d)

Display the approximate distribution of $X$ in a plot.

In [22]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

### e)

Use the simulation results to estimate $P(X > 40)$.  Enter the appropriate Symbulate commands below; don't just use the above table and a calculator.  ([Hints](https://dlsun.github.io/symbulate/sim.html#recap).)  (Recall that you estimated probabilities like this in lab 1).

In [23]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

### f)

Approximate $E(X)$.

In [24]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

### g)

Write a sentence providing a "long run average" interpretation in this context of the value from the previous part.

**ENTER YOUR WRITTEN EXPLANATION HERE.**

### h)

Approximate $\text{Var}(Y)$ and $\text{SD}(Y)$.

In [25]:
# Type all of your code for this problem in this cell.
# Feel free to add additional cells for scratch work, but they will not be graded.

### i) 

Write a few sentences describing the distribution of $X$.

**ENTER YOUR WRITTEN EXPLANATION HERE.**

## Submission Instructions

Before you submit this notebook, click the "Kernel" drop-down menu at the top of this page and select "Restart & Run All". This will ensure that all of the code in your notebook executes properly. Please fix any errors, and repeat the process until the entire notebook executes without any errors.
