In [None]:
from datascience import *
from prob140 import *
import numpy as np
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import pylab
import math
from scipy import stats
from scipy import misc
from client.api.assignment import load_assignment
autograder = load_assignment('main.ok')

In [None]:
def Plot_binom(interval, n, p):
    """Interval is [a, b] for integers a and b."""
    left = math.ceil(interval[0])
    right = math.floor(interval[1])
    k = np.arange(left, right+1)
    pp = stats.binom.pmf(k, n, p)
    d = Table().with_columns(
    'Value', k,
    'Probability', pp)
    Plot(d, edges=True)
    plt.xlabel("number of successes")
    plt.title('Binomial ('+str(n)+', '+str(p)+')');

In [None]:
def Plot_binom_norm(interval, n, p):
    mu = n*p
    sigma = (n*p*(1-p))**0.5
    left = interval[0]
    right = interval[1]
    d = (right-left)/100
    x = np.arange(left, right+d, d)
    y = 100*stats.norm.pdf(x, mu, sigma)
    plt.ylim(0, np.max(y)+0.1)
    Plot_binom(interval, n, p)
    plt.plot(x, y, color='red', lw=2)
    plt.fill_between(x, y, color='gold', alpha=0.6, zorder=2)

In [None]:
def Plot_binom_poisson(interval, n, p):
    """Interval is [a, b] for integers a and b."""
    left = math.ceil(interval[0])
    right = math.floor(interval[1])
    k = np.arange(left, right+1)
    mu = n*p
    p_binom = stats.binom.pmf(k, n, p)
    p_poi = stats.poisson.pmf(k, mu)
    d_binom = Table().with_columns(
    'Value', k,
    'Probability', p_binom)
    d_poi = Table().with_columns(
    'Value', k,
    'Probability', p_poi)
    Plots('Binomial', d_binom, 'Poisson', d_poi, edges=True)
    plt.xlabel("number of successes")
    plt.title('Binomial ('+str(n)+', '+str(p)+')');

# Lab 8: Binomial, Poisson, and Normal #
You have two approximations for the binomial $(n, p)$ distribution.
- For "large" $n$ and "small" $p$, the distribution is approximately Poisson.
- For "large $n$, the distribution is approximately normal.

Those seem like contradictory statements, but they aren't. In this lab you will see why.

### Part 1. The Central Limit Theorem: Fixed $p$ ###

Start with a distribution you approximated in Lab 2. Let $X_{100}$ have the binomial $(100, 0.01)$ distribution, and recall that $X_{100}$ is the sum of 100 i.i.d. indicators.

#### a) ####
Let $X_{100}$ have the binomial $(100, 0.01)$ distribution. As a warmup, find $E(X_{100})$ and $SD(X_{100})$.

In [None]:
ev = ...
sd = ...
ev, sd

#### b) ####
The function `Plot_binom` takes three arguments:
- The interval over which to plot the binomial probabilities. The interval is entered as `[left_end, right_end]` and includes both endpoints.
- n
- p

Plot the distribution of $X_{100}$ over all of its possible values.

Which distribution should be used to approximate the distribution of $X_1$: the normal or the Poisson? Provide the parameter or parameters of the approximating distribution.

#### c) ####
Let $X_n$ have binomial $(n, 0.01)$ distribution. Then $X_n$ is the sum of $n$ i.i.d. indicators, each equal to 1 with probability 0.01. According to the Central Limit Theorem, for large enough $n$ the distribution of $X_n$ should be roughly normal. 

How big do you think $n$ should be? Answer this by plotting the distribution of $X_n$ for a selection of values of $n > 100$ (just experiment; you don't have to turn them all in). 

There is no single "right" answer to this question. Just make a judgment based on what you see in the histograms. Look for a bell shape but keep in mind that the tails will give trouble in any approximation; don't be too picky about the tails.

Your answer should consist of just the plot of the distribution of $X_n$ for your choice of $n$. You can see from part (b) that it is not necessary to plot the histogram over the entire interval [0, 100]. Use an interval that makes it visually clear that your choice of $n$ makes sense.

For your choice of $n$, what is $SD(X_n)$?

In [None]:
sd = ...
sd

### Part 2: Normal Approximation to the Binomial ###
In this part you will develop a careful normal approximation to the binomial distribution.

#### a) ####
Plot the binomial $(100, 0.5)$ distribution over all its possible values.

In [None]:
Plot_binom(...)

Find $\mu$ and $\sigma$ for the approximating normal curve.

In [None]:
mu = ...
sigma = ...
mu, sigma

#### b) ####
The function `Plot_binom_norm` take the same arguments as `Plot_binom` and overlays a normal curve over the binomial histogram. The parameters of the normal curve are the mean and SD of the binomial.

Use `Plot_binom_norm` to overlay the normal curve over the distribution you displayed in part (a).

In [None]:
Plot_binom_norm(...)

#### c) ####
Now zoom in. Plot the binomial $(100, 0.5)$ histogram over the interval [45, 55].

In [None]:
Plot_binom(...)

Find the probability of the interval. You can use either the pmf or the cdf.

#### d) ####
The code in the cell below looks like a reasonable attempt to approximate the binomial probability above by the area under the corresponding normal curve. Run the cell.

In [None]:
Plot_binom_norm([45, 55], 100, 0.5)

You can see that the approximation will be off. In fact, it will be off by a a value very close to a binomial probability. The probability of what? Explain your answer and then calculate it.

In [None]:
stats.binom.pmf(...)

#### e) Continuity Correction ####
The cell below shows a better approximation known as the *normal approximation with continuity correction*. The method corrects for the fact that a discrete histogram consisting of bars is being approxmated by a continuous curve.

In [None]:
Plot_binom_norm([44.5, 55.5], 100, 0.5)

Use `stats.norm.cdf` to find the gold area above. Remember that you have already calculated `mu` and `sigma`.

Is this a good approximation to the exact probability you found in (c)?

#### f) ####
Find the exact probability of 50 heads in 100 tosses of a coin, as well as its normal approximation.

In [None]:
exact = ...
normal_approx = ...
exact, normal_approx

#### g) ####
Find the exact probability of 45 heads in 100 tosses of a coin, as well as its normal approximation.

In [None]:
exact = ...
normal_approx = ...
exact, normal_approx

### Part 3 Normal Versus Poisson: Fixed $n$ and $p$ ###
The function `Plot_binom_poisson` overlays the approximating Poisson histogram on a binomial histogram. As you have known for some time, the following is a good idea.

In [None]:
Plot_binom_poisson([0, 10], 100, 0.01)

Based on Part 1 you will agree that the following is a bad idea.

In [None]:
Plot_binom_norm([-0.5, 10], 100, 0.01)

In this part you will use total variation distance to quantify the distance between the binomial and each of its approximations.

#### a) ####
Define a function `tvd_binom_poisson` that takes `n` and `p` as its arguments and returns the total variation distance between the binomial $(n, p)$ distribution and its approximating Poisson distribution.

In [None]:
def tvd_binom_poisson(n, p):
    return ...

Use your function to find the total variation distance between the binomial $(100, 0.01)$ distribution and its approximating Poisson distribution.

#### b) ####
Define a function `tvd_binom_norm` that takes `n` and `p` as its arguments and returns the total variation distance between the binomial $(n, p)$ distribution and its normal approximation.

In [None]:
def tvd_binom_norm(n, p):
    return ...

Use your function to find the total variation distance between the binomial $(100, 0.01)$ distribution and its normal approximation.

#### c) ####
Based on the TVDs you calculated in (a) and (b), which of the two approximations should be used for binomial $(100, 0.01)$?

*Provide your answer and reasoning in this Markdown cell.*

### Part 4. Normal Versus Poisson: Fixed $n$ ###

#### a) ####
When $n = 100$, for which $p$ should the normal approximation be in preference to the binomial? To answer this question, start by defining a function `tvd_100_binom_poisson` that takes `p` as its argument and returns the total variation distance between binomial $(100, p)$ and its Poisson approximation. Use the function `tvd_binom_poisson` you wrote in Part 3.

In [None]:
def tvd_100_binom_poisson(p):
    return ...

Analogously, define `tvd_100_binom_norm` that takes `p` as its argument and returns the total variation distance between binomial $(100, p)$ and its normal approximation.

In [None]:
def tvd_100_binom_norm(p):
    return ...

#### b) ####
Run the cell below to create an array of values of $p$ and place it in the table `tvds`.

In [None]:
p = np.arange(0.01, 0.251, 0.01)
tvds = Table().with_column('p (n=100)', p)

Use `apply` to augment `tvds` with a column `TVD Binomial Poisson` containing the total variation distance between binomial $(100, p)$ and its Poisson approximation, as well as a column `TVD Binomial Normal` with the TVD between binomial $(100, p)$ and its normal approximation. Display the table `tvds`.

In [None]:
tvds = ...
tvds

#### c) ####
Recall that the Table method `plot` when called by `tbl.plot(column_label)` displays a line plot of all the other columns of `tbl` versus `column_label`. Use `plot` to overlay line plots of the two columns of TVDs versus the column of values of $p$.

In [None]:
tvds.plot(0)

The rule of thumb "$\sigma > 3$" is sometimes given as a criterion to choose the normal over the Poisson approximation to the binomial. Does it seem reasonable based on your graphs? Calculate some relevant SDs and then say whether the rule of thumb seems OK for $n = 100$.

*Provide your answer and reasoning in this Markdown cell.*

#### d) ####
Another rule of thumb that is sometimes given to choose the normal approximation over the Poisson is "$np$ and $n(1-p)$ both bigger than 10". Compare the two rules of thumb: are they very different? Keep in mind that the only situation in which you will be wondering which approximation to use is when the Poisson approximation might be valid.

*Provide your answer and reasoning in this Markdown cell.*

#### e) ####
Repeat the plot above for $n = 200$ and comment on the rule of thumb. Keep in mind that where the two graphs are close, the approximations about about equally good or equally bad.

#### f) #### 

Let's examine the contour plot of ``n`` and ``p`` on the difference between the TVDs for the Normal and Poisson approximations on the Binomial. Run the cell below which creates the corresponding contour plot for ``tvd_binom_poisson(n,p) - tvd_binom_norm(n, p)`` as a function of varying ``n`` from 30 to 200 and ``p`` from 0.001 to 0.50. Positive differences in the contour plot are in favor of the Normal versus the Poisson approximation to the Binomial. 


We overlay the curve where $\sigma = sqrt(npq) = 3$ in bold black.

In [None]:
delta1 = 10
delta2 = 0.01
n = np.arange(30, 210, delta1)
p = np.arange(0.001, 0.501, delta2)
P, N = np.meshgrid(p, n)
Z = [[tvd_binom_poisson(x, y) - tvd_binom_norm(x,y) for x,y in zip(N[i], P[i])] for i in np.arange(0, len(N))]
sigma = [[np.sqrt(x*y*(1-y)) for x,y in zip(N[i], P[i])] for i in np.arange(0, len(N))]
#PLOTS CONTOUR
plt.figure()
levels = np.array([-0.06, -0.015, -0.01, 0, 0.01, 0.02, 0.05,  0.08, 0.10, 0.15, 0.20])
TVD = plt.contour(N, P, Z, levels, linewidths=0.5, colors = 'blue')
sigma = plt.contour(N, P , sigma, make_array(3), linewidths=1, colors = 'black')
plt.clabel(TVD, inline=1, fontsize=10, )
plt.clabel(sigma, inline=1, fontsize=10)
plt.title('Difference between TVD as function of N, P ')

Based on this graph does the rule of thumb "$\sigma > 3$" to choose the normal hold for n and p in general. If so, for what values of n an p does this seem to hold and where does it seem to fail?

*Provide your answer and reasoning in this Markdown cell.*

In [None]:
_ = autograder.grade('q1')

### Extra Credit : Understanding Total Variation Distance ###
We are going to compare two probability distributions $P_{blue}$ and $P_{gold}$ on a finite set of values $S$. Suppose the values are labeled $1, 2, \ldots, n$. 

The *total variation distance between $P_{blue}$ and $P_{gold}$* is defined as

$$
\| P_{blue} - P_{gold}\|_{TVD} ~ = ~
\max\{ \lvert P_{blue}(A) - P_{gold}(A) \rvert : A \subseteq S\}
$$

The definition says: For every event $A$, compute how far off $P_{blue}(A)$ is from $P_{gold}(A)$. The tvd is the biggest among all these differences.

That doesn't look at all like what we have been calculating as the TVD starting way back in Data 8. But in fact it's the same thing. In these exercises you will show how. 

Before you get started, confirm your understanding of the definition. Suppose you calculate the TVD between two distributions and get 0.003. That says that if you list all possible events and compare their probabilities under the two distributions, the biggest difference you will get is 3/1000. The two distributions are pretty close. 

The goal of these exercises is to show that this new definition of TVD is equivalent to the calculation we have been doing all along. Let's start by setting up some notation. For each $i$ in $S$, let $P_{blue}(i) = b_i$ and let $P_{gold}(i) = g_i$. If you imagine a bar graph or histogram of each distribution, then $b_i$ is the size of the blue bar at $i$, and $g_i$ is the size of the gold bar at $i$.

In this notation, our familiar calculation of the TVD is

$$
\frac{1}{2} \sum_{i \in S} \lvert b_i - g_i \rvert
$$

In these exercises you will show that 

$$
\max\{ \lvert P_{blue}(A) - P_{gold}(A) \rvert : A \subseteq S\} ~ = ~ 
\frac{1}{2} \sum_{i \in S} \lvert b_i - g_i \rvert
$$

Three events will be important in the calculations.

The set of values for which the blue bars exceed the gold:
$$
B = \{i: b_i > g_i\}
$$

The set of values for which the gold bars exceed the blue:
$$
G = \{i: g_i > b_i\}
$$

The set of values for which the blue bars and gold bars are equal:
$$
E = \{i: b_i = g_i\}
$$

Keep in mind that for any event $A$,
$$
P_{blue}(A) = \sum_{i \in A} b_i ~~~~~~~ \text{and} ~~~~~~~
P_{gold}(A) = \sum_{i \in A} g_i
$$

#### a) ####
Find the value of
$$
\sum_{i \in B} b_i ~ + ~ 
\sum_{i \in G} b_i ~ + ~ 
\sum_{i \in E} b_i 
$$

Repeat the calculation after replacing $b_i$ by $g_i$ in all three sums above.

Hence show that
$$
\sum_{i \in B} (b_i - g_i) ~ = ~ \sum_{i \in G} (g_i - b_i)
$$

This proves a statement we have made in Data 8 and Prob140: "The amount by which the blue bars exceed the gold is the same as the amount by which the gold bars exceed the blue." 

*Provide your answer and reasoning in this Markdown cell.*

#### b) ####
Our usual calculation of the TVD is
$$
\frac{1}{2} \sum_{i \in S} \lvert b_i - g_i \rvert
$$

Partition the sum into two pieces to show that

$$
\frac{1}{2} \sum_{i \in S} \lvert b_i - g_i \rvert ~ = ~
\sum_{i \in B} (b_i - g_i) ~ = ~ \sum_{i \in G} (g_i - b_i)
$$

This proves another statement we made in Data 8: "The TVD is the amount by which the blue bars exceed the gold."

*Provide your answer and reasoning in this Markdown cell.*

#### c) ####
Now let $A$ be any event. Show that 
$$
P_{blue}(A) - P_{gold}(A) ~ = ~ 
\sum_{i \in AB} (b_i - g_i) ~ - ~ \sum_{i \in AG} (g_i - b_i)
$$

Hence show that
$$
P_{blue}(A) - P_{gold}(A) ~ \le ~ 
\sum_{i \in AB} (b_i - g_i) ~~~~~~ \text{and} ~~~~~~
P_{gold}(A) - P_{blue}(A) ~ \le ~ 
\sum_{i \in AG} (g_i - b_i)
$$

*Provide your answer and reasoning in this Markdown cell.*

#### d) ####
Use the first of the two inequalities in (c) to show that if $P_{blue}(A) - P_{gold}(A) > 0$ then

$$
\lvert P_{blue}(A) - P_{gold}(A) \rvert ~ \le ~ \sum_{i \in B} (b_i - g_i)
$$

Use the second of the two inequalities in (c) to show that if $P_{blue}(A) - P_{gold}(A) < 0$ then

$$
\lvert P_{blue}(A) - P_{gold}(A) \rvert ~ \le ~ \sum_{i \in G} (g_i - b_i)
$$

*Provide your answer and reasoning in this Markdown cell.*

#### e) ####
Identify an event for which one of the inequalities in (d) is an equality.

Explain why you now have a complete proof of

$$
\max\{ \lvert P_{blue}(A) - P_{gold}(A) \rvert : A \subseteq S\} ~ = ~ 
\frac{1}{2} \sum_{i \in S} \lvert b_i - g_i \rvert
$$

That is, our usual calculation of the TVD is equivalent to finding the biggest difference between probabilities assigned by the two distributions to any event.

*Provide your answer and reasoning in this Markdown cell.*

In [None]:
# For your convenience, you can run this cell to run all the tests at once!
import os
_ = [autograder.grade(q[:-3]) for q in os.listdir("tests") if q.startswith('q')]

In [None]:
import gsExport
gsExport.generateSubmission()