## Distributions, PDFs/PMFs, CDF and sampling from distributions

Today, we are going to explore statistical distributions, create figures of their PDFs and PMFs, calculate their CDFs, and write some code to sample from the distributions.

But first, imports:

In [None]:
import math

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy
import scipy.interpolate

%matplotlib inline

### Let's start with the binomial distribution

Use LaTeX to write the equation for the PMF in the cell below. Use the form with *N* and *p*. [Here](https://www.overleaf.com/learn/latex/Mathematical_expressions) is a guide to writing mathematical expressions with LaTeX.

$
    \binom{n}{k} p^k(1-p)^{n-k}
$

Note: $ \binom{n}{k} $ and $C_n^k$ are the same thing

$$ C_n^k = \frac{n!}{k! (n-k)!} $$

### Create a function that can be used to compute P(x)

Your function will need three arguments. Hint: look at `math.factorial` for computing factorials and `math.pow` for exponents.

In [None]:
def binomial_pmf(p, n, k):
    
    part1 = math.factorial(n)/(math.factorial(k)*math.factorial(n-k))
    probability = part1 * math.pow(p, k) * math.pow(1-p, n-k)
    
    return probability

Your answer should be 0.2255859375

In [None]:
binomial_pmf(0.5, 12, 6)

### Call your function on the integers from 0 to N

Why can't it be bigger than N? Is the PMF symmetric around the center?

In [None]:
[binomial_pmf(0.5, 20, k) for k in (0,1,2,3,4)]

Hint: look at `np.arange`

In [None]:
[binomial_pmf(0.5, 20, k) for k in np.arange(0,21)]

### Now, make a figure of *k* vs *P(k)*

Plot binomial_pmf(0.5, 20, k), binomial_pmf(0.7, 20, k), and binomial_pmf(0.5, 40, k) from 0 to N. Show their PMF as a function of k.

In [None]:
y1 = [binomial_pmf(0.5, 20, k) for k in np.arange(0, 21)]
y2 = [binomial_pmf(0.7, 20, k) for k in np.arange(0, 21)]
y3 = [binomial_pmf(0.5, 40, k) for k in np.arange(0, 41)]

ax1 = plt.scatter(np.arange(0, 21), y1, c='b', label='p=0.5 and n=20')
ax2 = plt.scatter(np.arange(0, 21), y2, c='g', label='p=0.7 and n=20')
ax3 = plt.scatter(np.arange(0, 41), y3, c='r', label='p=0.5 and n=40')
plt.legend()
plt.yticks([0, 0.1, 0.15, 0.2])
plt.ylim([-0.02, 0.27])
plt.show()

### Moving on, do the same thing for the Poisson distribution

Poisson distribution:

$
  P(k)=\frac{\lambda^k e^{-\lambda}}{k!}
$

$\lambda$ is the expected number of occurrences

### Create a function for *P(k)*

Include arguments for $\lambda$ and *k*

In [None]:
def poisson_p(lamb, k):
    probability = math.pow(lamb, k)*math.exp(-lamb)/math.factorial(k)
    
    return probability

### Call your function on integers linearly spaced, with $\lambda = 1$ 

In [None]:
poisson_p(1,1)

The result with $k=1$ should be 0.36787944117144233.

### Make a figure

Plot poisson_p(1, k), poisson_p(4, k), and poisson_p(10, k) with k from 0 to 20.

In [None]:
y1 = [poisson_p(1, k) for k in np.arange(0, 21)]
y2 = [poisson_p(4, k) for k in np.arange(0, 21)]
y3 = [poisson_p(10, k) for k in np.arange(0, 21)]

fig = plt.figure(figsize=(7, 5))

ax1 = plt.plot(np.arange(0, 21), y1, c='k', linewidth=0.75, zorder=0)
ax2 = plt.plot(np.arange(0, 21), y2, c='k', linewidth=0.75, zorder=1)
ax3 = plt.plot(np.arange(0, 21), y3, c='k', linewidth=0.75, zorder=2)

ax1 = plt.scatter(np.arange(0, 21), y1, s=70, c='gold', edgecolors='k', label='$\lambda =1$', zorder=3)
ax2 = plt.scatter(np.arange(0, 21), y2, s=70, c='purple', edgecolors='k', label='$\lambda =4$', zorder=4)
ax3 = plt.scatter(np.arange(0, 21), y3, s=70, c='cyan', edgecolors='k', label='$\lambda =10$', zorder=5)
plt.legend()
plt.yticks(np.arange(0, 0.50, 0.05))
plt.ylim([0, 0.4])
plt.show()

### Enough discrete distributions. Let's try a continuous one, the normal distribution

Begin by writing the *P(x)* equation

$
  P(x) = \frac{1}{\sqrt{2 \pi \sigma^{2}}}e^{-\frac{(x-\mu)^{2}}{2\sigma^{2}}}
$

### Write a function that implements *P(x)*

In [None]:
def normal_p(x, mu, sigma):
    coef = math.pow(2*math.pi*math.pow(sigma, 2), -0.5)
    probability = coef*math.exp(-math.pow(x-mu, 2)/(2*math.pow(sigma, 2)))
    return probability

In [None]:
normal_p(5, 0, 1)

Your anwser should be 1.4867195147342977e-06


### Call your function on a list of numbers

Hint: look at `np.linspace`

In [None]:
np.linspace(-10, 10, 11)

Plot normal_p(x, 0, 0.2), normal_p(x, 0, 1.0), normal_p(x, 0, 5.0), and normal_p(x, -2, 1.0), with x from -5 to 5.

In [None]:
y1 = [normal_p(x, 0, 0.2) for x in np.linspace(-5, 5, 100)]
y2 = [normal_p(x, 0, 1.0) for x in np.linspace(-5, 5, 100)]
y3 = [normal_p(x, 0, 5.0) for x in np.linspace(-5, 5, 100)]
y4 = [normal_p(x, -2, 1.0) for x in np.linspace(-5, 5, 100)]

fig = plt.figure(figsize=(7, 5))

ax1 = plt.plot(np.linspace(-5, 5, 100), y1, c='b', linewidth=3, label='$\mu$=0, $\sigma$=0.2')
ax2 = plt.plot(np.linspace(-5, 5, 100), y2, c='r', linewidth=3, label='$\mu$=0, $\sigma$=1.0')
ax3 = plt.plot(np.linspace(-5, 5, 100), y3, c='y', linewidth=3, label='$\mu$=0, $\sigma$=5.0')
ax4 = plt.plot(np.linspace(-5, 5, 100), y4, c='g', linewidth=3, label='$\mu$=0, $\sigma$=1.0')

plt.legend()
plt.yticks(np.linspace(0, 1.0, 6))
plt.ylim([-0.05, 1.05])
plt.xticks(np.linspace(-5, 5, 11))
plt.xlim([-5.2, 5.2])
plt.xlabel('$x$')
plt.ylabel('$\psi_{\mu,\sigma}$')
plt.grid(True)
plt.minorticks_on()
plt.tick_params(direction='in', length=10)
plt.tick_params(which='minor', direction='in', length=5)
plt.show()

### OK, now to CDFs

Let's look at the CDF for an empirical distribution. Recall this from our descriptive statistics notebook.

In [None]:
d1 = np.random.normal(loc=-6.4, scale=1.2, size=40000)
d2 = np.random.normal(loc=4, scale=10, size=16000)
d3 = np.random.normal(loc=22, scale=8, size=72000)
population = np.concatenate([d1, d2, d3])
pop = pd.DataFrame(data=population, columns=['population'])
pop.head()

### First, plot a histogram of the data

In [None]:
pop.hist(bins=100)

### Now let's make a function to interpolate our histogram

Call your function inverse_CDF and have it take two arguments:

* `data` which is a list of values that are the empirical distribution you want to construct the CDF from.
* `bins` which is the number of bins to use in the histogram used to interpolate the CDF.

The function should use np.histogram to create data objects (not plots) that contain binned data. Hint: your call will look something like:

`hist_data, bin_edges = np.histogram(data, bins=bins, density=True)`

Why are we using the `density=True` parameter?

Remember that the CDF is the cumulative sum of the probability density function. This means we can create a new list with an entry for each bin and use np.cumsum to sum across a list that is the histogram density * width of the bin. Here is what I came up with:

`cdf_bins = np.cumsum(hist_data * np.diff(bin_edges))
cdf_bins = np.insert(cdf_bins, 0, 0)`

Two questions for you:
* `np.diff` computes the bin width. Why?
* Why do I have the `np.insert`?

`scipy` has a nice interpolation family of functions.

`import scipy.interpolate

inv_cdf = scipy.interpolate.interp1d(cdf_bins, bin_edges)`

Google that (or use the documentation) for more information. Hint: make sure the function output both the CDF and the inverse CDF.

Note that the `scipy.interpolate.interp1d` interface returns something like a function that you can call and pass values e.g. `inv_cdf([0.1, 0.2, 0.3])`

In [None]:
def inverse_cdf(data, bins=100):
    hist_data, bin_edges = np.histogram(data, bins=100, density=True)
    cdf_bins = np.cumsum(hist_data * np.diff(bin_edges))
    cdf_bins = np.insert(cdf_bins, 0, 0)
    
    inv_cdf = scipy.interpolate.interp1d(cdf_bins, bin_edges)
    cdf = scipy.interpolate.interp1d(bin_edges, cdf_bins)
    
    return [cdf, inv_cdf]

If you did it right, the cell below should use your function to create interpolations of the CDF and the inverse CDF:

In [None]:
[cdf, inv_cdf] = inverse_cdf(pop, 100)

### Plot the CDF

In [None]:
x = np.linspace(np.min(pop), np.max(pop), num=100)

In [None]:
plt.figure(figsize=(5, 5))
plt.plot(x, cdf(x))

In [None]:
zero2one = np.linspace(0, .9999, num=100)

In [None]:
plt.figure(figsize=(5, 5))
plt.plot(zero2one, inv_cdf(zero2one))

### Now you can sample from your inverse CDF to generate values from our empirical distribution

Let's do that with `np.random.rand`. Why are we using this random function? What is special about how it works that makes it useful with inverse CDF?

OK, let's put this together with something like:

`sample = inv_cdf(np.random.rand(1000))`

In [None]:
sample = inv_cdf(np.random.rand(1000))

Now make a histogram of sample and set the bins to 100. How does it look? What happens when you increase the argument to `np.random.rand`?

In [None]:
sample = inv_cdf(np.random.rand(1000))
cdf_data = pd.DataFrame(data=sample, columns=['inv_cdf'])
cdf_data.hist(bins=100, density=True)
plt.xlim([-40, 60])
plt.ylim([0, 0.15])

In [None]:
sample = inv_cdf(np.random.rand(10000))
cdf_data = pd.DataFrame(data=sample, columns=['inv_cdf'])
cdf_data.hist(bins=100, density=True)
plt.xlim([-40, 60])
plt.ylim([0, 0.15])

In [None]:
sample = inv_cdf(np.random.rand(100000))
cdf_data = pd.DataFrame(data=sample, columns=['inv_cdf'])
cdf_data.hist(bins=100, density=True)
plt.xlim([-40, 60])
plt.ylim([0, 0.15])