# Lec12: Discrete Random Variables and Distributions 
***

In this notebook we'll get some practice with discrete random variables and see how we can play with binomial distributions using Python.   

We'll need Numpy and Matplotlib for this notebook, so let's load them.  We'll also need SciPy's binom function for computing binomial coefficients.  

In [None]:
import numpy as np 
import matplotlib.pylab as plt 

%matplotlib inline

## Plotting Discrete Probability Distributions

In lecture we examined the following probability mass function (i.e. probability distribution for the discrete random variable X):

|k | P(X=k)|
|--|--|
|3 | 0.1|
|4 | 0.2 |
|6 | 0.4 | 
| 8 |0.3 |

Let's look at two options for plotting the histogram for this pmf.

**OPTION 1:**

Since we know the exact probabilities for each value of the random variable $X$, we can plot bars with height equal to the probabilities, and width = 1.  
Numpy has a built-in function `.bar` that can do this for us:

In [None]:
k=[3, 4, 6, 8]
prob_k = [0.1, 0.2, 0.4, 0.3]

In [None]:
fig, ax = plt.subplots(1, 1)

ax.bar(k, prob_k, width=1, ec='white');

#Always include a title
plt.title("Distribution of X");

#Label what the x and y axes represent:
plt.ylabel("P(X=k)");
plt.xlabel("k")

**OPTION 2:**

Alternatively, we could create an array of data with values of $k$ proportional to the probability that $X=k$, and plot a **density** histogram of this data. 

That is, we could create an array of 100 values such that:
 - 10 of the values are equal to 3
 - 20 of the values are equal to 4
 - 40 of the values are equal to 6
 - 30 of the values are equal to 8
 

In [None]:
ar3 = np.ones(10)*3
ar4=np.ones(20)*4
ar6 = np.ones(40)*6
ar8=np.ones(30)*8

data = np.concatenate((ar3, ar4, ar6, ar8))

data

In [None]:
#Plotting a histogram of values in data

fig, ax = plt.subplots(1, 1)

ax.hist(data, width=1, ec='white', density=True, bins=[2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5]);

#Always include a title
plt.title("Distribution of X");

#Label what the x and y axes represent:
plt.ylabel("P(X=k)");
plt.xlabel("k");

### Exercise 1 - Implementing and Sampling the Binomial Distribution 
***

**Part A**: Write a function that takes in the parameters of the binomial distribution, $n$ and $p$, and returns the probability distribution as a Numpy array. In order to get the coefficient in the binomial distribution, you'll need a way to compute the combination ${n \choose k}$.  You can do this from scratch using Python's factorial function, or you can get the value directly using Scipy's canned function [binom](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.special.binom.html).  

In [None]:
from scipy.special import binom


def binomial_dist(n, p):
    # The following code uses list comprehension
    p = np.array([binom(n,k)*(p**k)*((1-p)**(n-k)) for k in range(n+1)])
    return p


**Part B**:
    
   i).  Create an array with the probability distribution for a binomial with n=100 and p=.40
   
   ii).  Print out the first 10 entries of the array
   
   iii). Then plot a histogram

In [None]:
prob = binomial_dist(100, 0.40)

# Here is a check that your binomial_pmf function at least sums to 1
np.sum(prob)

In [None]:
binomial_dist(100,0.40)[:10]

In [None]:
binomial_dist(100,0.40[-10:]

In [None]:
n=100
p=.40

X = np.array(range(n+1))
pmf = binomial_dist(n, p)

fig, ax = plt.subplots()
#Note we are NOT using .hist, because we have a list of probabilities 
#that each represent the height of separate bars.

ax.bar(X, pmf);
ax.grid(alpha=0.25)
plt.xlim(0,100)
plt.title("Binomial Distribution,n=100, p=0.4");



**Part C**:  Let $X~Bin(100, 0.40)$  Use your function above to calculate the following:

   i).What is $P(X=8)$?
    
   ii).  What is $P(X \leq 8)$
   
   iii).  What is $P(X \leq 15)$?

In [None]:
prob_x_equals_8 = binomial_dist(100, 0.40)[8]
prob_x_equals_8

In [None]:
prob_x_leq_8=np.sum(binomial_dist(100, 0.40)[:9])
prob_x_leq_8

In [None]:
prob_x_leq_15 = np.sum(binomial_dist(100, 0.40)[:16])
prob_x_leq_15

## Built-In Python Functions

Python's scientific computing library `scipy` has built-in functions to calculate the Probability Mass Functions (PMFs) for Discrete Random Variables

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binom.html



In [None]:
from scipy.stats import binom


#Recalculate part C(i) above using built-in functions
# X~Bin(100, .26) What is $P(X=8)$?
binom.pmf(k=8,n=100, p=0.26 )


In [None]:

X = np.array(range(9))
print(X)

binom.pmf(X,n=100, p=0.40 )

In [None]:
sum(binom.pmf(np.array(range(9)),n=100, p=0.40))

There's a built-in way to calculate the sum in the previous cell.

The Cumulative Distribution Function (CDF), $F(a)$ of a Random Variable $X$ is:

$F(a) = P(X \leq a)$

`binom.cdf(a, n, p)` will calculate this for you.



In [None]:
binom.cdf(8, n=100, p=0.4)

In [None]:
# Recreate Binomial Distribution Histogram using built-in binom.pmf function:


n=100
p=.26

X = np.array(range(n+1))
pmf = binom.pmf(X,n, p)

fig, ax = plt.subplots()
#Note we are NOT using .hist, because we don't have an array of data, we have a list of probabilities 


ax.bar(X, pmf);
plt.xlim(0,100)
plt.title("Binomial Distribution,n=100, p=0.26");



### Part D - SIMULATION: So now you know the exact distribution of a binomial random variable, but very frequently we'll want to generate samples from that distribution.  



**Useful Built-In Functions:** Python has a built in function that can generate samples from a wide variety of distributions:

https://numpy.org/doc/stable/reference/random/generator.html



To generate random samples from a Binomial distribution we can use  Numpy's [random.binomial](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.random.binomial.html) function. 

Read the documentation, and then try drawing 1000 samples from _Bin(n,p)_ for $n=100$ and $p=0.40$.  

In [None]:
x = np.random.binomial(n=100, p=0.40, size=1000)

x

**Part E**: We can approximate the theoretical distribution of _Bin(n,p)_ by drawing many many samples from the distribution and plotting a **density** histogram.  Do this now.  Additionally, use the function you wrote in **Part A** to get the exact density, and plot the distribution directly below the histogram of your sampled distribution.  How do they compare?  What happens if you use more or fewer samples in the histogram? 

In [None]:
n = 100
p = 0.40

#Simulated Values
sample = np.random.binomial(n=n, p=p, size=100)
bins = np.arange(-0.5, 100, 1)

#Actual theoretical probabilities
X = np.array(range(n+1))
pmf = binomial_dist(n,p)

fig, ax = plt.subplots(nrows=2, ncols=1)
plt.subplots_adjust(hspace=0.5)

# Use histogram because we are binning a list of simulate numbers to calculate the probabilities
ax[0].hist(sample, edgecolor='white', density=True, bins=bins)
ax[0].set_axisbelow(True)
ax[0].grid(alpha=0.25)
ax[0].set_xlim([0, 50]);
ax[0].set_title("Simulated Binomial Distribution: n=100, p=0.40")


# Use bar because we know the probabilities
ax[1].bar(X, pmf)
ax[1].set_axisbelow(True)
ax[1].grid(alpha=0.25);
ax[1].set_xlim([0, 50]);
ax[1].set_title("Theoretical Binomial Distribution, n=100, p=0.40")

**Special Case** In the case of the binomial distribution, we can actually simulate it using np.random.choice if we think of it as repeated coin tosses:
(P(heads) = 0.26, P(tails =0.74)).  

In [None]:
#Simulate one experiment (which consists of 100 trials) 
def heads_in_n_tosses(n=100):
    simulate = np.random.choice(["H","T"],size=n,p=[.26, .74])
    return sum(simulate == 'H')

In [None]:
# Repeat the experiment m times:
num_simulations = 10000;

outcomes=[]

for i in np.arange(num_simulations):
    outcomes = np.append(outcomes, heads_in_n_tosses())

plt.hist(outcomes,bins=np.arange(0,50),   density=True);

**Practice:  Use `scipy.stats.poisson` to plot a histogram of the PMF of a Poisson distribution for lambda = 3**

In [None]:
#Use scipy.stats.poisson to sketch the pmf of a poisson distribution with lambda = 3 for values of X between 0 and 20:

from scipy.stats import poisson

param= 3
n=20

X = np.array(range(n+1))
poisson_pmf = poisson.pmf(X, mu=param)

fig, ax = plt.subplots()
#Note we are NOT using .hist, because we have a list of probabilities 
#that each represent the height of separate bars.

ax.bar(X, poisson_pmf);
ax.grid(alpha=0.25)
plt.xlim(0,n)
plt.title("Poisson Distribution, lambda = 3");
