In [1]:
#!/usr/bin/env python

In [2]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import beta
from numpy import mean
from scipy.stats import norm

# Parameter Estimation for Categorical Distribution<br>
In this example, we are going to estimate the parameters of a Categorical Distribution with 3 categories<br>
	where the data $\mathcal{D}$ consists of 1 red, 2 blue, and 4 green.

## Basic counting and recognize distribution<br>
Using this approach, we simply need to recognize that this situation has 3 outcomes.<br>
The Categorical Distribution is commonly used as the default structure of the pdf.<br>
In general, if we approximate p(x) with Categorical, then we assume a structure of <br>
$$p(x) = \theta_1^{x_1} \theta_2^{x_2} \theta_3^{x_3} \quad \text{where} \quad x_1,x_2,x_3 \in \{0,1\} \; and \; x_1 + x_2 + x_3 = 1.$$

According to MLE, $\theta_i$ is the probability of success for each category.<br>
Given 1 red, 2 blue, and 4 green, the p(x) would be:

$$p(x) = (1/7)^{x_1} (2/7)^{x_2} (4/7)^{x_3}$$

Note that $x_1, x_2, x_3$ here denotes the category and __not the sample id__. 

## Maximum A Posteriori Estimation (MAP)<br>
The MAP estimation is more complicated. Here, instead of maximizing the $p(X=\mathcal{D}|\theta)$, we want to find $\max \; p(\theta|X=\mathcal{D})$. <br>
In other word, we want to find the most likely $\theta$ giving the entire dataset $X=\mathcal{D}$.<br>
Take a quick second to distinguish the difference between MLE and MAP<br>
- MLE : <br>
$$ \max_{\theta} \; p(X=\mathcal{D}|\theta) $$<br>
- MAP : <br>
$$ \max_{\theta} \; p(\theta|X=\mathcal{D}) $$<br>
With this method, we use the Bayes' Theorem <br>
$$ p(\theta | X=\mathcal{D}) = \frac{p(X=\mathcal{D}|\theta) p(\theta)}{p(X=\mathcal{D})} $$<br>
From MLE, we knew tht <br>
$$p(X=\mathcal{D}|\theta) = \mathcal{L} = \prod_{i=1}^n \; \theta_1^{x_{i1}} \theta_2^{x_{i2}} \theta_3^{x_{i3}}$$<br>
Note that for $x_{ij}$, $i$ represents the sample id, and $j$ represents the category id. 

- With MLE, the likelihood function is sufficient. <br>
- With MAP, it allow us to use prior knowledge about the distribution of $\theta$. The MAP estimate consequently combines our prior knowledge with the data and come up with the best estimation. <br>
- In this particular example, we use a Dirichlet distribution with $\alpha_1 = 2, \alpha_2 = 2, \alpha_3 = 2$. <br>
$$ p(\theta) = \frac{\Gamma(\alpha_1 + \alpha_2 + \alpha_3)}{\Gamma(\alpha_1)\Gamma(\alpha_2)\Gamma(\alpha_3)} \theta_1^{\alpha_1 - 1} \theta_2^{\alpha_2 - 1} \theta_3^{\alpha_3 - 1} \quad \text{where if $n$ is an integer then} \quad \Gamma(n) = (n-1)!$$<br>
Note that the $\Gamma(z)$ function is much more complicated if $z$ is not an integer, specifically, it is <br>
$$ \Gamma(z) = \int_0^{\infty} \; t^{z - 1} e^{-t} \; dt $$

###	Applying the Conjugate Priors<br>
To obtain the posterior, we apply the conjugate prior of categorical distribution, which is a Dirichlet distribution<br>
$$ p(\theta) = \frac{\Gamma(\alpha_1 + \alpha_2 + \alpha_3)}{\Gamma(\alpha_1)\Gamma(\alpha_2)\Gamma(\alpha_3)} \theta_1^{\alpha_1 - 1} \theta_2^{\alpha_2 - 1} \theta_3^{\alpha_3 - 1} = B(\alpha) \theta_1^{\alpha_1 - 1} \theta_2^{\alpha_2 - 1} \theta_3^{\alpha_3 - 1} \quad \text{where if $n$ is an integer then} \quad \Gamma(n) = (n-1)!$$<br>
Therefore, the posterior is <br>
$$p(\theta|X=\mathcal{D}) = \frac{p(X=\mathcal{D}|\theta) p(\theta)}{p(X=\mathcal{D})}$$<br>
implying that<br>
$$p(\theta|X=\mathcal{D}) \propto \left( \prod_{i=1}^n \; \theta_1^{x_{i1}} \theta_2^{x_{i2}} \theta_3^{x_{i3}} \right) \left( B(\alpha) \theta_1^{\alpha_1 - 1} \theta_2^{\alpha_2 - 1} \theta_3^{\alpha_3 - 1}  \right)$$<br>
$$p(\theta|X=\mathcal{D}) \propto \left( \theta_1^{\sum_i x_{i1}} \theta_2^{\sum_i x_{i2}} \theta_3^{\sum_i x_{i3}} \right) \left( B(\alpha) \theta_1^{\alpha_1 - 1} \theta_2^{\alpha_2 - 1} \theta_3^{\alpha_3 - 1}  \right)$$<br>
$$p(\theta|X=\mathcal{D}) \propto B(\alpha) \left( \theta_1^{\sum_i x_{i1} + \alpha_1 - 1} \theta_2^{\sum_i x_{i2} + \alpha_2 - 1} \theta_3^{\sum_i x_{i3} + \alpha_3 - 1} \right).$$<br>
This tells us that the posterior is also a Dirichlet distribution where we let<br>
$$ \hat{\alpha_1} = \sum_i x{i1} + \alpha_1$$<br>
$$ \hat{\alpha_2} = \sum_i x{i2} + \alpha_2$$<br>
$$ \hat{\alpha_3} = \sum_i x{i3} + \alpha_3$$<br>
Then<br>
$$p(\theta|X=\mathcal{D}) = \frac{\Gamma(\hat{\alpha_1} + \hat{\alpha_2} + \hat{\alpha_3} )}{\Gamma(\hat{\alpha_1})\Gamma(\hat{\alpha_2})\Gamma(\hat{\alpha_3})} \left( \theta_1^{\hat{\alpha_1} - 1} \theta_2^{\hat{\alpha_2} - 1} \theta_3^{\hat{\alpha_3} - 1} \right).$$

###	The Predictive Posterior<br>
$$p(x|X=\mathcal{D}) = \int \; p(x|\theta) p(\theta|X = \mathcal{D}) \; d\theta $$<br>
$$p(x|X=\mathcal{D}) = \int \; \theta_1^{x_1} \theta_2^{x_2} \theta_3^{x_3} B(\hat{\alpha})  \theta_1^{\hat{\alpha_1} - 1} \theta_2^{\hat{\alpha_2} - 1} \theta_3^{\hat{\alpha_3} - 1} \; d\theta$$<br>
If we let <br>
$$\bar{\alpha}_1 = x_1 + \hat{\alpha_1}$$<br>
$$\bar{\alpha}_2 = x_2 + \hat{\alpha_2}$$<br>
$$\bar{\alpha}_3 = x_3 + \hat{\alpha_3}$$<br>
then we can rewrite the integral as <br>
$$\frac{B(\hat{\alpha})}{B(\bar{\alpha})} \; \int \; B(\bar{\alpha}) \theta_1^{\bar{\alpha_1} - 1} \theta_2^{\bar{\alpha_2} - 1} \theta_3^{\bar{\alpha_3} - 1} \; d\theta= \frac{B(\hat{\alpha})}{B(\bar{\alpha})}$$<br>
We can further simplify this into<br>
$$\frac{\Gamma(\hat{\alpha_1} + \hat{\alpha_2} + \hat{\alpha_3})}{\Gamma(\hat{\alpha_1})\Gamma(\hat{\alpha_2})\Gamma(\hat{\alpha_3})} \frac{\Gamma(\bar{\alpha_1})\Gamma(\bar{\alpha_2})\Gamma(\bar{\alpha_3})}{\Gamma(\bar{\alpha_1} + \bar{\alpha_2} + \bar{\alpha_3})} $$<br>
Remember that $x = [x_1\quad x_2\quad x_3]^{\top}$ and  $x_i \in \{0,1\}$ and $x_1 + x_2 + x_3 = 1$, this allows us to further simplify this into<br>
$$p(x|X=\mathcal{D}) = \frac{ \hat{\alpha_1}^{x_1} \hat{\alpha_2}^{x_2} \hat{\alpha_3}^{x_3}}{ \hat{\alpha_1} + \hat{\alpha_2} + \hat{\alpha_3} }$$