In [7]:
%matplotlib inline
import os
from os import path
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats


# Binomial and Bernoulli Distributions
The Binomial and Bernoulli distributions both deal with the probability of events with binary outcomes. The Bernoulli distribution is just a special case of the Binomial distribution, where the event only occurs once.

### The Bernoulli Distribution
Say you toss a coin a single (n=1) time. The event X has exactly two possible outcomes. If $\theta$ is the probability of the coin landing on heads, then X follows a Bernoulli distribution $Ber(x|\theta)$, shown below.

$$
\begin{align}
Ber(x\ |\ \theta)   &= \theta^{I(x=1)}\ (1-\theta)^{I(x=0)} \\
                &= \left \lbrace
                       \begin{array}{ll}
                       \theta    & if\ x=1 \\
                       1 - \theta & if\ x=0
                       \end{array}
                   \right.
                \end{align}
$$

### The Binomial Distribution
Say you toss a coin several (n) times. Each event X has exactly two possible outcomes. If $\theta$ is the probability of the coin landing on heads, then X follows a Binomial distribution $X$ ~ $ Bin(n,\theta)$. The probability mass function for the binomial distribution is 

$$
Bin(k\ |\ n,\ \theta) = \binom nk\ \theta^k\ (1-\theta)^{n-k}
$$
The expression $\binom n k = \frac {n!} {(n-k)!k!}$ (referred to as "n choose k") is the number of different ways you can choose k items from n items.

The Binomial distribution has the following mean and variance:
$$
\begin{align}
    \mu &= \theta \\
    \sigma^2 &= n \theta (1 - \theta)
\end{align}
$$
The makes intuitive sense: through multiple trials with a binary outcome, the confidence you place in the positive outcome is always the probability of the positive outcome, regardless of how many times that trial has been done before. Furthermore, aside from the more intrinsic variance factor $\theta (1 - \theta)$, the variance in the samples as you take more and more increases linearly with how many samples you decide to take.

#### Example
You start tossing a biased coin, and the coin is biased to land on heads 65% of the time. Therefore, $\theta = 0.65$. Since the coin has a binary outcome (not considering it landing standing up), <strong>each</strong> toss of the coin follows a Bernoulli distribution with $\theta = 0.65$. However, say you toss the coin $N$ times. The probability distribution for any event that involves <strong>multiple</strong> tosses of the coin will be a Binomial distribution with $\theta = 0.65$, $n = N$.

Say $N = 7$, and we want to know the probability of the coin landing on heads 4 times. Using the Binomial distribution,
$$
\begin{align}
n = 7,\ \theta = 0.65,\ k = 4 \\
Bin(k\ |\ n,\ \theta) &= \binom n k\ \theta^k\ (1-\theta)^{n-k} \\
                      &= \binom 7 4\ (0.65)^4\ (0.35)^3 \\
                      &= (35)\ (0.1785)\ (0.0429) \\
                      &= 0.268
\end{align}
$$

Breaking line 3 down, we see that under the Binomial distribution, the expected number of positive outcomes (coin on heads) is the product of 3 terms:
- How many different ways the $k$ positive outcomes can happen across the $n$ trials
- The probability that the positive outcome indeed happens $k$ times
- The probability that the negative outcome indeed happens the remaining $n-k$ times



# Multinomial and Multinoulli Distributions
The Muiltinomial and Multinoulli distributions both deal with the probability of events with $k$ outcomes. The Multinoulli distribution is just a special case of the Multinomial distribution, where the event only occurs once. This is an obvious generalization of the Binomial and Bernoulli distributions, respectively, to the case of non-binary outcomes.

Let $\mathbf{x} = (x_1...x_K)$ be the random vector associated with an event with $K$ discrete outcomes. Here, $x_j$ holds the number of times the $j_{th}$ event occurs. The distribution for $N$ such events is as follows:

$$
Mu(\mathbf{x}|n,\mathbf{\theta}) = \binom n {x_1...x_K} \prod_{j=1}^K \theta_j^{x_j} \\
\theta_j:\ P(event\ j\ occurs) \\
\binom n {x_1...x_K}=\frac {n!} {x_1 ! x_2 ! ... x_K !}: multinomial\ coefficient
$$

The multinomial coefficient is the number of ways that we can divide a set of size n into subsets with sizes $x_1$ up to $x_k$. Once we know this, we simply weight that by the expectations of each outcome over the $n$ trials.

In the Multinoulli case (n=1 sample), the $x$ vector reduces to a one-hot encoding of the $k$ outcomes, since only one entry is non-zero. The multinomial coefficient reduces to 1, and the exponents in the product become a simple indicator function evaluation. This distribution is also commonly referred to as a "discrete" or "categorical" distribution.

$$
Mu(\mathbf{x}|1,\mathbf{\theta}) = Cat(\mathbf{x}|1,\mathbf{\theta}) = \prod_j^K \theta^{I(x_j=1)}
$$

# The Poisson Distribution
The Poisson distribution is a discrete distribution which models the number of times an event occurs within a given interval under the following conditions: <br/>
1) The event occurs with a known, constant rate. <br/>
2) Each occurence of the event is independent from other events in the interval. <br/>

Specifically, $X \in {0,1,2,...}$ follows a Poisson distribution with $\lambda > 0$ if its probability mass function is

$$
Poi(x|\lambda) = e^{-\lambda} \frac {\lambda^x} {x!}
$$

The first term is a normalization constant to make the expression a probability.
TODO more info.


# The Empirical Distribution
The Empirical distribution is a glorified counter: it simply represents the ratio of occurrences for each element in its set, and has zero probability mass for unseen set members.

$$
\begin{align}
    p_{emp}(A) &= \frac 1 N \sum_{i=1}^N \delta_{x_i}(A) \\
    \delta_x(A) &= \left \lbrace
                       \begin{array}{ll}
                       1    & if\ x\ \in A \\
                       0    & if\ x\ \notin A
                       \end{array}
                   \right.
\end{align}
$$