# The Multinomial Distribution: A Quick Review

The multinomial distribution is easiest illustrated through the use of a theoretical (and potentially unfair) $k$-sided die, which you roll $n$ times.  Letting $N_j \in \{0, \ldots, n\}$ represent the number of times that the $j$th side appears, the multinomial distribution models the probability distribution on the space of $k$-dimensional count vectors $N = (N_1, \ldots, N_k)$. 

**If you're comfortable with the multinomial distribution and don't want to get bogged down with derivation and more detailed notation, please feel free to skip to the next section.**

More generally, let $X$ denote a discrete random variable that takes on one of $k$ values $\{1, \ldots, k\}$.  Let $\vec{p} = (p_1, \ldots, p_k)$ denote the vector whose $i$th entry encodes the probability mass on the $i$th outcome, so 
$$
\mathbb{P}(X = i | \vec{p}) = \prod_{j=1}^k p_j^{\delta(i=j)}.
$$
Here, $\delta$ is the indicator function such that
$$
\delta(i=j)=
 \begin{cases} 
      1 & i=j \\
      0 & i \neq j  
   \end{cases}.
$$

Let $\vec{X} = (X_1, \ldots, X_n)$ denote the random sample vector obtained from taking n i.i.d. samples from $X$ (sampled values will be denoted $\vec{x} = (x_1, \ldots, x_n)$).  Then 
\begin{align*}
\mathbb{P}(\vec{X} = (x_1, \ldots, x_n) | \vec{p}) &= \prod_{i=1}^n \bigg( \prod_{j=1}^k p_j^{\delta(x_i=j)} \bigg) \\
&= \prod_{j=1}^k  \bigg( \prod_{i=1}^n p_j^{\delta(x_i=j)} \bigg)\\
&= \prod_{j=1}^k  p_j^{n_j} ,
\end{align*}
where $n_j = \sum_{i=1}^n\delta(x_i=j)$ is the number of occurrences of outcome $j$.

Let $\vec{N} = (N_1, \ldots, N_k)$ denote the random count vector obtained from taking $n$ i.i.d. samples from $X$ (sampled values of this count vector will be denoted $\vec{n} = (n_1, \ldots, n_k)$).  Then in order to obtain the probability distribution on this random vector, we need only take into account all possible permutations to yield
$$
\mathbb{P}(\vec{N} = (n_1, \ldots, n_k) | \vec{p}) = {n\choose n_1\ldots n_k} \prod_{j=1}^k  p_j^{n_j}.
$$
We have now arrived at the familiar multinomial distribution :).

Recall some special cases:
* Setting $n=1$ yields the **categorical distribution**.
* Setting $k=2$ yields the **binomial distribution**.

# Some Motivating Data

The Dirichlet distribution is best viewed as a probability distribution on the space of multinomial distributions.  To understand this, let's consider some simple data.

Say you're given a three-sided die, which you have rolled 20 times.  The corresponding count vector is given by $(n_1, \ldots, n_3) = (5, 10, 15)$, so we have rolled a "1" precisely five times, have produced "2" ten times, and have rolled a "3" fifteen times.  Using this information, you would like to provide an estimate for the true probabilities of rolling each side.  

Denote your estimate for the probability of rolling the $i$th side by $\hat{p}_i$.  Intuition perhaps tells us that one potential estimate is given by 
$$
\hat{p}_i = \frac{n_i}{n_1 + n_2 + n_3}.
$$
That is, one potential estimate is $\hat{p}_1 = \frac{5}{20}$, $\hat{p}_2 = \frac{10}{20}$, $\hat{p}_3 = \frac{15}{20}$.  This corresponds to the maximum likelihood estimate for the probabilities.  (_You should verify this, if it is not immediately obvious.  I have calculated a maximum likelihood estimate [in this blog post](https://alexisbcook.github.io/mle-map-and-coin-flips.html) for a similar case_.)

What if, independent of the data, we have some well-informed beliefs about the underlying probabilities?  That is, what if a friend has told us that she believes the die is fair (or that the probability of rolling each side is $1/3$), and we mostly believe her, but want to also consider what we have learned from the data?

Our first step is to figure out how to encode the beliefs that we had about the die prior to viewing the data, or to specify a **prior distribution** on the die-roll probabilities.  This prior distribution will be a prior distribution on the space  