In [5]:
# SETUP

# These lines import the Numpy and Datascience modules.
import numpy as np
from datascience import *
from prob140 import *

# These lines do some fancy plotting magic
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

from scipy import stats
from client.api.assignment import load_assignment
autograder = load_assignment('main.ok')

# Lab 4: Minimum, Median, and Maximum
In this lab you will compute expectations of random variables that can't easily be written as sums of simpler random variables. This involves developing some math technique and then using it in computation.

### Part 1: Expectation Fundamentals ###
All variables in this lab have non-negative integer values. We are also going to assume that they have finitely many values, though many of the results are also true if the possible values are $0, 1, 2, 3, \ldots$. 

This Part consists of practice in using basic tools to find expectation. It should go very fast.

### 1.1 ###
Run the cell below to create the possible values of a random variable $X$, and all the corresponding probabilities. 

In [3]:
x = np.arange(11)
probs = make_array(0.1, 0.15, 0.2, 0.2, 0.15, 0.07, 0.05, 0.03, 0.02, 0.02, 0.01)

Now use `x`, `probs`, and the formula $E(X) = \sum_{\text{all }x} xP(X=x)$ to calculate `ev_X`, the expected value of $X$.

In [None]:
ev_X = sum(...)
ev_X

### 1.2 ### 
Now create a distribution object `dist` that contains the distribution of $X$, and use the `expected_value` method to confirm your value of `ev_X`.

In [6]:
dist = ...
dist.expected_value()

Run the cell below to visualize $E(X)$.

In [8]:
Plot(dist, show_ev=True)

### 1.3 ###
Now fill in the cell below and run it to compare $E(X)$ with a "long run" average of simulated values of $X$. Base your average on 10,000 simulated values of $X$.

In [10]:
sample_10000 = ....sample(10000)
samp_mean = np.mean(...)

In [13]:
emp_dist_10000 = emp_dist(...)
Plot(emp_dist_10000, show_ave=True)
print('Sample Mean =', sample_mean)

Why is it appropriate to use `show_ave=True` as the option to `Plot` here, instead of `show_ev=True` as in 1.2?

*Provide your answer and reasoning in this Markdown cell.*

### Part 2: Tail Sum Formula ###
In this Part you will discover a formula for calculating expected values of non-negative integer valued random variables, using tail probabilities.

### 2.1 ###
Plot the histogram of the distribution `dist` from Part 1, this time showing the area corresponding to $P(X > 1)$.

In [64]:
# Event {X > 1}
Plot(dist, event=...)

The function `cdf` applied to a distribution at the argument `x` returns the value of the cumulative distribution function (cdf) $F(x) = P(X \le x)$. Use `cdf` appropriately to find the numerical value of $P(X > 1)$ shaded above. This is called a *tail probability*.

In [None]:
......cdf(...)

### 2.2 ###
Fill in the cell below to define a function `all_tail_probs` that takes a distribution object as its argument and returns an array consisting of all the tail probabilities $P(X > x)$.

In [13]:
def all_tail_probs(distribution):
    """Returns an array of all tail probabilities
    of distribution."""
    def tail(k):
        return ....cdf(...)
    return ....apply(tail, ...)

### 2.3 ###
Use your function `all_tail_probs` to create an array `tails` containing all the tail probabilities of `dist`. What is the value of the last item of the array, and why?

In [45]:
tails = ...
tails

Now run the cell below. Do you recognize the answer as a quantity that you computed earlier in this lab?

In [46]:
sum(tails)

### 2.4 ###
Explain what you saw above in the following steps. It's not a formal proof. You can do that for Extra Credit.

a) Plot {X > k} for k = 0, 1, 2, 3, ... 10. Look at them, but just turn in the plot that shows $P(X > 5)$.

b) How many of the plots include {X=1}? {X=2}? {X=3}? Explain.

*Provide your answer and reasoning in this Markdown cell.*

c) Now explain why $\sum_{k=0}^{10} P(X > k) = E(X)$. For a more formal proof, see Extra Credit.

*Provide your answer and reasoning in this Markdown cell.*

### Part 3: Sample Minimum, Maximum, and Range ###
In this part you will develop formulas for the expected values of the extrema of a random sample, as well as the range (largest value $-$ smallest value) of the sample.

### 3.1 ###
Let $X_1, X_2, \ldots , X_n$ be i.i.d. with a distribution that has c.d.f $F$, and suppose each $X$ has possible values $\{0, 1, 2, \ldots, N\}$ for some fixed positive integer $N$. Define:

- $X_{min} = \min(X_1, X_2, \ldots , X_n)$
- $X_{max} = \max(X_1, X_2, \ldots , X_n)$
- $R = X_{max} - X_{min}$

Use Part 2, $F$, and $n$ to write formulas for $E(X_{min})$, $E(X_{max})$, and $E(R)$.

$$
E(X_{min}) = 
$$

$$
E(X_{max}) =
$$

$$
E(R) = 
$$

*Provide your answer and reasoning in this Markdown cell.*

### 3.2 ### 
Now test out your answer to (a) numerically. Set $n = 20$ and let $X_1, X_2, \ldots, X_{20}$ be i.i.d. draws from `dist`. Find the numerical value of $E(X_{min})$ by filling in the parentheses in the cell below. Remember that already placed all the tail probabilities of `dist` in the array `tails`.

In [33]:
# Expected minimum of 20 iid draws
n = 20
sum(tails...)

If your numerical answer above is correct, it should agree (at least roughly) with the result of simulating the sample mininum 50,000 times and computing the average of those 50,000 values. Complete the cell below. The last line should display the empirical histogram of the 50,000 simulated minima.

In [25]:

# Simulated minimum of n iid draws
# based on 50000 repetitions
n = 20
repetitions = 50000
mins = make_array()
for i in range(repetitions):
    sample = ....sample(n)
    mins = np.append(..., ...)
print('IID Sample Size ', str(n))
print('Average of ', str(repetitions), 'Sample Minima = ', str(np.mean(mins)))
Plot(...)

Is the average of your 50,000 simulated minima fairly close to the theoretical value of $E(X_{min})$ that you computed earlier? If not, give a plausible reason for the inconsistency.

*Provide your answer and reasoning in this Markdown cell.*

### 3.3 ###
Repeat 3.2 for $X_{max}$ instead of $X_{min}$. That is, first write one line of code to compute the $E(X_{max})$ for a sample size of 20. 

In [None]:

# Expected maximum of 20 iid draws
n = 20
...

Then simulate the sample maximum 50,000 times by filling in the cell below, and find the average of your 50,000 simulated maxima.

In [None]:
# Simulated minimum of n iid draws
# based on 50000 repetitions
n = 20
repetitions = 50000
maxes = make_array()
for i in range(repetitions):
    sample = ...
    maxes = ...
print('IID Sample Size ', str(n))
print('Average of ', str(repetitions), ' Sample Maxima = ', str(...))
Plot(...)

### 3.4 ###
Explain what happens to $E(X_{min})$, $E(X_{max})$, and $E(R)$ as $n$ gets larger. You should have an intuitive sense of what will happen, which you should be able to back up based on your calculations of $E(X_{min})$ and $E(X_{max})$.

*Provide your answer and reasoning in this Markdown cell.*

### Part 4: Expected Sample Median: The Math ###
Medians and other percentiles are a bit annoying to analyze in the discrete case because of "ties": draws resulting in the same outcome. So we will analyze the sample median in the straightforward case where the sample size $2n+1$ is odd and "the median" is defined as the $(n+1)st$ value when the sample is sorted in ascending order.

The example below will help clarify the definition.

In [14]:
# 7 = 2*3 + 1

sample_7 = make_array(23, 81, 34, 13, 56, 27, 26)
sample_7_median = np.sort(sample_7).item(3)
sample_7_median

### 4.1 ###
Let $X_1, X_2, \ldots , X_{2n+1}$ be i.i.d. with c.d.f $F$ and possible values $0, 1, 2, \ldots, N$. 

Let $M$ be the median of $X_1, X_2, \ldots, X_{2n+1}$, as defined above.

For each $m \ge 0$, find $P(M > m)$.

[Hint: For the median to be greater than $m$, what do the $X$'s have to do? See if you can define some of them to be "successes" and the others "failures".]

*Provide your answer and reasoning in this Markdown cell.*

### 4.2 ###
Use Part 2 and your answer to 4.2 to write a formula for $E(M)$.

*Provide your answer and reasoning in this Markdown cell.*

### Part 5: Expected Sample Median: Computation ###
Finally, the code. As before, use `dist` as the common distribution of the $X_i$'s. 

### 5.1 ###
Define a function `ev_median` that takes as its arguments a distribution object and a positive integer $n$, and returns the expected median of an i.i.d. sample of size $s = 2n+1$ drawn from the distribution. You can assume that the possible values of the distribution are $0, 1, 2, \ldots , N$ for some fixed positive integer $N$.

In [18]:

def ev_median(distribution, n):
    """Returns the expected sample median
    of an i.i.d. sample of size 2n+1 from distribution.
    Uses the function all_tail_probs defined earlier."""
    tail_probs = all_tail_probs(distribution)
    s = 2*n+1
    ev_med = 0
    for m in np.arange(distribution.num_rows+1):
        ev_med = ev_med + ...
    return ev_med

### 5.2 ###
Let $M_n$ be the median of an i.i.d. sample of size $n$ drawn from `dist`. Use your function `ev_median` to find $E(M_{51})$.

In [19]:
ev_median(...)

### 5.3 ###
What's a reasonable definition of the "population median" for the "population distribution" `dist` from which the sample is being drawn? Explain. Your answer should be specific to `dist`.

*Provide your answer and reasoning in this Markdown cell.*

### 5.4 ###
Does your numerical value of $E(M_{51})$ make sense compared to this "population median"?

*Provide your answer and reasoning in this Markdown cell.*

### 5.5 ###
What happens to $E(M_n)$ as $n$ gets larger? Answer based on your intuition.

*Provide your answer and reasoning in this Markdown cell.*

Now confirm your intuition by plotting $E(M_n)$ against $n$. Fill in the cell below to draw the plot.

In [43]:
ns = np.arange(25, 1001, 25)
evs = make_array()
for n in ns:
    evs = np.append(...)

plt.scatter(ns, evs)
plt.xlabel('n')
plt.ylabel('$E(M_n)$');  

Is the plot consistent with your intuitive answer above?

*Provide your answer and reasoning in this Markdown cell.*

### Part 6: Extra Credit ###
Let $X$ have values $0, 1, 2, \ldots, N$ for some fixed positive integer $N$. Show that 
$E(X) = \sum_{k=0}^N P(X > k)$ by using the following observations:

- In the usual formula $E(X) = \sum_{k=0}^N kP(X=k)$, the term corresponding to $k=0$ doesn't matter.
- For $k \ge 1$, $k = \sum_{i=1}^k 1$.

That's right; that's the hint. No typo.

*Provide your answer and reasoning in this Markdown cell.*

In [10]:
prob140_is_my_favorite_class = ...
I_saved_my_file = ...

In [12]:
_ = autograder.grade('q1')

In [None]:
# For your convenience, you can run this cell to run all the tests at once!
import os
_ = [autograder.grade(q[:-3]) for q in os.listdir("tests") if q.startswith('q')]

In [None]:
import gsExport
gsExport.generateSubmission()