# What is probability?
1. Frequentist: Probabilities represent long run frequencies of events
2. Bayesian: 
  - Probability is used to quantify our uncertainty about something
  - It can be used to model our uncertainty about events that do not have long term frequencies

## Discrete random variables

For an event $A \in \mathcal{X}$, like "it will rain tomorrow".
 - **Probability mass function (PMF)**: p(A) means probability that event A is true
 - $0 \leq p(A) \leq 1$
 - $\sum_{a \in \mathcal{X}}p(a) = 1$


### Joint Probability
Given two events $A, B \in \mathcal{X}$, the probability of the joint event A and B:
\begin{align}
p(A, B) = p(A \cap B) = p(A | B)p(B) = p(B | A)p(A)
\end{align}

### Marginal distribution (边缘分布)

如果我们把每一个变量的概率分布称为一个概率分布，那么边缘分布就是若干个变量的概率加和所表现出的分布。

Given a joint distribution, the marginal distribution is defined as:
\begin{equation}
p(A) = \sum_{b \in \mathcal{X}}p(A, B=b) = \sum_{b \in \mathcal{X}}p(A|B=b)p(B=b)
\end{equation}

举个例子，假设$p(B),p(C),p(A|B),p(A|C)$已知，求P(A):
\begin{equation}
p(A) = \sum_{b \in [B, C]}p(A, B=b) = p(A, B) + p(A, C) = p(A|B)p(B) + P(A|C)p(C)
\end{equation}
### Conditional Probability

事件A在另外一个事件B已经发生条件下的发生概率。条件概率表示为P（A|B），读作“在B条件下A的概率”。

We define the conditional probabiltity of an event A, given that event B is true:
\begin{equation}
p(A|B) = \frac{p(A, B)}{p(B)} \; \textrm{if} \; p(B) \; \gt 0 
\end{equation}

### Bayes Rule

According to conditional and marginal probabilities, bayes rule is defined by: 
\begin{equation}
\underbrace{p(X=\mathcal{x}|Y=\mathcal{y}) = \frac{p(X=\mathcal{x}, Y=\mathcal{y})}{p(Y=\mathcal{y})}}_{\textrm{conditional probability}} \; \textrm{and} \; \underbrace{p(y) = \sum_{x'}p(X=\mathcal{x'})p(Y=\mathcal{y} | X=\mathcal{x'})}_{\textrm{marginal probability}}  \\ \Downarrow \\
p(X=\mathcal{x}|Y=\mathcal{y}) = \frac{p(X=\mathcal{x}, Y=\mathcal{y})}{\sum_{x'}p(X=\mathcal{x'})p(Y=\mathcal{y} | X=\mathcal{x'})}
\end{equation}

### A medical diagnoise problem using Bayes Rule

**Example**: Suppose you are a women in your 40s, and you decide to have medical test ($X = 1$) for breast cancer ($Y \in [0, 1]$), which is called mammogram. If the test is positive, what's probability you have cancer $p(Y=1 | X=1)$? 

\begin{proof}
Assume that $p(X=1|Y=1) = 0.8, p(X=1|Y=0) = 0.1, p(Y=1) = 0.004$
\begin{equation}
\begin{aligned}
p(Y=1|X=1) &= \frac{p(X=1, Y=1)}{p(X=1)} \\
&= \frac{p(X=1|Y=1)p(Y=1)}{p(X=1, Y=1) + p(X=1, Y=0)} \\
&= \frac{p(X=1|Y=1)p(Y=1)}{p(X=1|Y=1)p(Y=1) + p(X=1|Y=0)p(Y=0)} \\
&= \frac{0.8 * 0.004}{0.8 * 0.0004 + 0.1*0.0996} = 0.031 
\end{aligned}
\end{equation}
\end{proof}

### Conditional Independence

- If random variables $X$ and $Y$ are said to be independent if $p(X, Y) = p(X)p(Y)$, alternatively, $p(X|Y) = p(X), or p(Y|X) = p(Y)$, which is denoted by $X \perp Y$
- Conditional independent: $X \perp Y | Z \Leftrightarrow p(X, Y | Z) = p(X|Z)p(Y|Z)$ 

## Continuous Random Variable

- Probability density $p(x) (\geq 0)$
- The larger $p(x)$ for a variable $x$, the more likely that a variable around $x$ will be generated by this distribution. 
- The probability that $x$ falls into an interval $[a, b]$ can be computed as $\int_{a}^{b} p(x) \,dx$
- The likelihood that any variable drawn from $p(x)$ must fall between postive and negative infinity $\int_{-\infty}^{\infty}p(x)dx = 1$

## Describte stastistic Property

### Mean (a.k.a, expected value)

- Discrete Random Variable: $\mu = \mathbb{E}[X] = \sum_{\mathcal{x} \in X} x p(x)$
- Continuous Random Variable: $\mu = \mathbb{E}[X] = \int_{\mathcal{x}}xp(x)dx$ 

### Variance (方差, "spread" of a distribution)

\begin{equation}
\begin{aligned}
\textrm{var}[X] &= \mathbb{E}[(X-\mu)^2] = \mathbb{E}[X^2 - 2\mu X + \mu^2] \\
&= \mathbb{E}[X^2] - \mathbb{E}[2\mu X] + \mathbb{E}[\mu^2] \\
&= \mathbb{E}[X^2] - 2\mu \mathbb{E}[X] + \mathbb{E}[\mu^2] \\
&= \mathbb{E}[X^2] - 2\mu^2 + \mu^2 = \mathbb{E}[X^2] - \mu^2
\end{aligned}
\end{equation}

Alternatively, proof as follow:
\begin{proof}
\begin{equation}
\begin{aligned}
\textrm{var}[X] &= \mathbb{E}[(X-\mu)^2] = \int (x-\mu)^2p(x)dx \\
&= \int x^2p(x) dx - \int 2\mu x p(x) dx + \int \mu^2 p(x)dx \\
&= \int x^2p(x) dx - 2\mu \int x p(x) dx + \mu^2 \int p(x)dx \\
&= \mathbb{E}[X^2] - 2\mu \mathbb{E}[X] + \mu^2 * 1 \\
&= \mathbb{E}[X^2] - \mu^2
\end{aligned}
\end{equation}
\end{proof}

### Standard deviation

- $\textrm{std}[X] = \sqrt{\textrm{var}[X]} $

### Covariance (协方差)

For two jointly distributed real-valued random variables X and Y with finite second moments, the covariance is defined as the expected value (or mean) of the product of their deviations from their individual expected values:
\begin{equation}
\begin{aligned}
\textrm{cov}(X, Y) &= \mathbb{E}[(X-\mathbb{E}(X))(Y-\mathbb{E}(Y))] \\
&= \mathbb{E}[(X-\mu_X)(Y- \mu_Y)] \\
&= \mathbb{E}[XY-X*\mu_Y - Y*\mu_X + \mu_X*\mu_Y] \\
&= \mathbb{E}[XY]-\mathbb{E}[X*\mu_Y] - \mathbb{E}[Y*\mu_X] + \mathbb{E}[\mu_X*\mu_Y] \\
&= \mathbb{E}[XY]-\mathbb{E}[X]*\mu_Y - \mathbb{E}[Y]*\mu_X + \mu_X*\mu_Y \\
&= \mathbb{E}[XY]-\mu_X*\mu_Y - \mu_Y*\mu_X + \mu_X*\mu_Y \\
&= \mathbb{E}[XY]-\mu_X*\mu_Y \\
&= \mathbb{E}[XY]-\mathbb{E}[X]\mathbb{E}[Y]
\end{aligned}\end{equation}

If X is a d-dimnsional random vector, its covariance matrix is:
\begin{equation}
\begin{aligned}
\textrm{cov}[X] &= \mathbb{E}[(X-\mathbb{E}[X])(X-\mathbb{E}[X])^T] \\
&= \begin{bmatrix}
\color{red}{\textrm{cov}[X_1, X_1]}  & \textrm{cov}[X_1, X_2] & \cdots &\textrm{cov}[X_1, X_d]\\
\textrm{cov}[X_2, X_1] & \color{red}{\textrm{cov}[X_2, X_2]} & \cdots &\textrm{cov}[X_2, X_d]\\
\vdots & \vdots & \ddots & \vdots \\
\textrm{cov}[X_d, X_1]& \textrm{cov}[X_d, X_2] & \cdots &\color{red}{\textrm{cov}[X_d, X_d]}\\
\end{bmatrix} = \begin{bmatrix}
\color{red}{\textrm{var}[X_1]}  & \textrm{cov}[X_1, X_2] & \cdots &\textrm{cov}[X_1, X_d]\\
\textrm{cov}[X_2, X_1] & \color{red}{\textrm{var}[X_2]} & \cdots &\textrm{cov}[X_2, X_d]\\
\vdots & \vdots & \ddots & \vdots \\
\textrm{cov}[X_d, X_1]& \textrm{cov}[X_d, X_2] & \cdots &\color{red}{\textrm{var}[X_d]}\\
\end{bmatrix}
\end{aligned}
\end{equation}

### Correlation Coefficient

- The (Pearson) correlation coefficient between X and Y measures the linear relationship, which is defined by: 
\begin{equation}
-1 \leq \textrm{corr}[X, Y] = \frac{\textrm{cov}[X, Y]}{\sqrt{\textrm{var}[X]\textrm{var}[Y]}} \leq 1
\end{equation}

## Common distributions (TODO)

### Bernoulli Distribution (Discrete)
- Toss a coin only once
- Let X be a binary variable $X \in {0, 1}$

- What's the probability of X that a toss shows up as "head":
\begin{equation}
\textrm{Ber}(x|\theta) =  \left\{
	\begin{array}{ll}
		\theta  & \mbox{if } x = 0 \\
		1 - \theta & \mbox{if } x = 0
	\end{array}
\right.
\end{equation}

### Binomial Distribution (Discrete)


### Multinomial Distribution (Discrete)

### Gaussian Distribution (Continuous)

### Laplace Distribution  (Continuous)

### Beta Distribution  (Continuous)

# Bayesian Theory

所谓的贝叶斯方法源于他生前为解决一个“逆概”问题写的一篇文章，而这篇文章是在他死后才由他的一位朋友发表出来的。在贝叶斯写这篇文章之前，人们已经能够计算“正向概率”，如“假设袋子里面有N个白球，M个黑球，你伸手进去摸一把，摸出黑球的概率是多大”。而一个自然而然的问题是反过来：“如果我们事先并不知道袋子里面黑白球的比例，而是闭着眼睛摸出一个（或好几个）球，观察这些取出来的球的颜色之后，那么我们可以就此对袋子里面的黑白球的比例作出什么样的推测”。这个问题，就是所谓的逆概问题。


贝叶斯是机器学习的核心方法之一。这背后的深刻原因在于，现实世界本身就是不确定的，人类的观察能力是有局限性的，我们日常所观察到的只是事物表面上的结果，沿用刚才那个袋子里面取球的比方，我们往往只能知道从里面取出来的球是什么颜色，而并不能直接看到袋子里面实际的情况。这个时候，我们就需要提供一个假设（hypothesis）。所谓假设，当然就是不确定的（可能是有限个，也可能是无限多种），为了确定哪个假设是正确的，我们需要做两件事情：1、算出各种不同猜测的可能性大小。2、算出最靠谱的猜测是什么。第一个就是计算特定猜测的后验概率（Posterior），对于连续的猜测空间则是计算猜测的概率密度函数。第二个则是所谓的模型比较，模型比较如果不考虑先验概率（Prior）的话就是最大似然方法。

## Generative Classifer

对于一个训练数据集X，如何判断他属于哪个类型？ 

通用的方法就是求解在已经数据集X的情况下，求出每个分类针对于该数据集的概率，概率最大的就是最大的可能性。 其中$\theta$是关于这个模块的参数。

Given a dataset $X$, the probability of (Y=c) for this dataset is defined using the class conditional density ($p(X|Y=c)$) and the class prior $p(Y=c)$:
\begin{equation}
\label{eq:baysian_rule}
\begin{aligned}
p(Y=c|X, \theta) &= \frac{p(X, Y=c, \theta)}{p(X, \theta)} \\
& = \frac{p(X|Y=c, \theta)p(Y=c, \theta)}{\sum_{c'}p(Y=c', \theta)p(X|Y=c', \theta)} \\
&\propto p(X|Y=c, \theta)p(Y=c, \theta)
\end{aligned}
\end{equation}
- $p(Y=c | X, \theta)$: Posterior probability (后验概率)
- $f(\theta) = p(X|Y=c, \theta)$: likelihood probability, we also call this function likelihood function (似然函数)
- $p(Y=c, \theta)$: prior probability (先验概率)
- $p(X, \theta)$: Normalized factor(证据因子)

**Solution:** the posterior equals to the likelihood times the prior, up to a constant. 
\begin{equation}
\begin{aligned}
\textrm{Posterior (后验概率)} &= \frac{\textrm{prior (先验概率)} \times \textrm{likelihood (似然函数)}}{\textrm{证据因子}} \\
&\propto \textrm{prior (先验概率)} \times \textrm{likelihood (似然函数)}
\end{aligned}
\end{equation}

## Example: Number Game (from Josh Tenenbaum's PhD thesis)

Inferring abstract patterns from sequence of integers

**Problem:** There is a group of hypotheses $\mathcal{H}$: {'Primer number', 'a number between 1 and 10', $\ldots$}. In this task, a set of positive examples (training datasets) $D = \{x_1, x_2, \ldots, x_n\}$ drawn from a reasonable arithmetical concept $h \in \mathcal{H}$. Finally, I would ask that whether a new test case $x$ belong to $h$ (classifying $x$) 
- **Posterior predictive distribution**: what's probability of that $X \in h$ given the dataset $D$: $p(x \in h |D, \theta)$.
\begin{equation}
\begin{aligned}
p(x \in h | D, \theta) & = \frac{p(x \in h, D, \theta)}{p(D, \theta)} \\
&= \frac{p(D|x \in h, \theta)p(x \in h, \theta)}{p(D, \theta)} \\
& = \frac{p(D|x \in h, \theta)p(x \in h, \theta)}{\sum_{h' \in \mathcal{H}}p(D, D \in h', \theta)} \\
&\propto p(D|x \in h, \theta)p(x \in h, \theta)
\end{aligned}
\end{equation}

**Example**
- For simplicity, assume all numbers are integers between 1 and 100
- $D = \{2, 8, 16, 64\}$ which are samples from IID (independent and identically distributed) datasets

**Solution**

1. **Hypothesis space of concepts $\mathcal{H}$**

Usually, the models favors the simplest hypothesis consistent with the data (Occam's razor). In this case, we choose $\mathcal{H} = \{h_1, h_2, h_3\}$
- "powers of two", that's $h_1 = h_{two} = \{2, 4, 8, 16, 32, 64\}$, where $|h_{1}| = 6$ 
- "even numbers", that's $h_2 = h_{even} = \{2, 4, 6, \ldots, 100\}$, where $|h_{2}|=50$
- "powers of two except 32", that's $h_3 = h_{two-32} = \{2, 4, 8, 16, 64\}$, where $|h_{3}| = 5$ 

2. **Prior p(h)**

Usually, prior probability captures the background knowledge, domain knowledge, pre-existing biases. In our case, it looks like $h_{two-32}$ is much better fit than $h_{two}$, however, the former seems "conceptually unnatural". Therefore, we can capture such intuition by assigning lower prior probability to unnatural concepts. 
\[ p(\mathcal{H}) = \begin{bmatrix} p(h_1) = 0.19, \\ p(h_2) = 0.8, \\ p(h_3) = 0.01 \end{bmatrix}\]


3. **Likelihood p(D|h)**

In this case, we define likelihood function (statistical information in examples) as follows:
\[ p(D | h) = [\frac{1}{size(h)}]^n = [\frac{1}{|h|}]^n \; \textrm{if} x_1, \ldots, x_n \in h\] 
where $n$ means the total number of elements in $D$. Smaller hypotheses receive greater likelihood, and exponentially more so as n increases.

Therefore, we can get likelihood probability for $h_1$:  
\begin{equation}
\begin{aligned}
p(D|h_1) &= p(x_1|h_1) \times p(x_2|h_1) \times p(x_3|h_1) \times p(x_4|h_1) \;\;\; (\textrm{data from IID}) \\
&= \frac{1}{6} \times \frac{1}{6} \times \frac{1}{6} \times \frac{1}{6} \\
&= [\frac{1}{|h_1|}]^4  \;\;\; (|h_1| = 6)  \\
&= [\frac{1}{6}]^4
\end{aligned}
\end{equation}
Similarly, we can compute likelihood probabilities for $h_2$ and $h_3$:
\[p(D|h_2) = [\frac{1}{|h_2|}]^4 = [\frac{1}{50}]^4\]
\[p(D|h_3) = [\frac{1}{|h_3|}]^4 = [\frac{1}{5}]^4\]

4. **Posterior p(h|D)**

\begin{equation}
\begin{aligned}
p(x \in h_1 | D) & = \frac{p(x \in h_1, D)}{p(D)} \\
&= \frac{p(D|x \in h_1)p(x \in h_1)}{p(D)} \\
& = \frac{p(D|x \in h_1)p(x \in h_1)}{\sum_{h' \in \mathcal{H}}p(D, D \in h')} \\
&\propto p(D|x \in h_1, \theta)p(x \in h_1) \\
&= p(D|h_1)p(h_1)
\end{aligned}
\end{equation}


## Approaches to Estimate Parameters from IID data

Usually, the posterior is simply the liklihood times the prior, and then normalized:
\begin{equation}
\begin{aligned}
p(h|D, \theta) &= \frac{p(D, h, \theta)}{p(D, \theta)} \\
& = \frac{p(D|h, \theta)p(h, \theta)}{\sum_{h'}p(Y=h', \theta)p(D|h', \theta)} \\
&\propto p(D|h, \theta)p(h, \theta)
\end{aligned}
\end{equation}

Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP), are both a method for estimating some variable in the setting of probability distributions or graphical models. They are similar, as they compute a single estimate, instead of a full distribution.

### Maximum Likelihood Estimation (MLE)
MLE, as we, who have already indulge ourselves in Machine Learning, would be familiar with this method. Sometimes, we even use it without knowing it. Take for example, when fitting a Gaussian to our dataset, we immediately take the sample mean and sample variance, and use it as the parameter of our Gaussian. This is MLE, as, if we take the derivative of the Gaussian function with respect to the mean and variance, and maximizing it (i.e. setting the derivative to zero), what we get is functions that are calculating sample mean and sample variance. Another example, most of the optimization in Machine Learning and Deep Learning (neural net, etc), could be interpreted as MLE.

\begin{equation}
\begin{aligned}
\hat{h}^{MLE} &= \underset{h \in \mathcal{H}}{\textrm{argmax}} [p(h|D, \theta)] \\
&= \underset{h \in \mathcal{H}}{\textrm{argmax}} [\frac{p(D|h, \theta)p(h, \theta)}{\sum_{h'}p(Y=h', \theta)p(D|h', \theta)}] \\
&\propto \underset{h \in \mathcal{H}}{\textrm{argmax}}[\log(p(D|h, \theta)]\;\;\;(\text{If we ignore the prior} )
\end{aligned}
\end{equation}

Speaking in more abstract term, let’s say we have a likelihood function $P(X|\theta)$. Then, the MLE for $\theta$, the parameter we want to infer, is:
\begin{equation}
\begin{aligned}
\theta_{MLE} &= \underset{\theta}{\textrm{argmax}} P(X|\theta) \\
&= \underset{\theta}{\textrm{argmax}}\prod_{i} P(x_i|\theta) \;\;\; (x_i \in X \;\text{sampled from IID datasets})
\end{aligned}
\end{equation}

As taking a product of some numbers less than 1 ($ 0 \leq P(x_i|\theta) \leq 1$) would approaching 0 as the number of those numbers goes to infinity, it would be not practical to compute, because of computation underflow. Hence, we will instead work in the **log** space, as logarithm is monotonically increasing, so maximizing a function is equal to maximizing the log of that function.

\begin{equation}
\begin{aligned}
\theta_{MLE} &= \underset{\theta}{\textrm{argmax}} \log P(X|\theta) \\
&= \underset{\theta}{\textrm{argmax}} \log  \prod_{i} P(x_i|\theta) \;\;\; (x_i \in X \;\text{sampled from IID datasets}) \\
&= \underset{\theta}{\textrm{argmax}} \sum_{i} \log P(x_i|\theta)
\end{aligned}
\end{equation}
To use this framework, we just need to derive the log likelihood of our model, then maximizing it with regard of  $\theta$ using our favorite optimization algorithm like Gradient Descent.

### Maximum A Posteriori (MAP)

MAP usually comes up in **Bayesian** setting. Because, as the name suggests, it works on a posterior distribution, not only the likelihood. When we have enough data, the posterior becomes peaked on a single concept.

\begin{equation}
\begin{aligned}
\hat{h}^{MAP} &= \underset{h \in \mathcal{H}}{\textrm{argmax}} [p(h|D, \theta)] \\
&= \underset{h \in \mathcal{H}}{\textrm{argmax}} [\frac{p(D|h, \theta)p(h, \theta)}{\sum_{h'}p(Y=h', \theta)p(D|h', \theta)}] \\
&\propto \underset{h \in \mathcal{H}}{\textrm{argmax}}[\log(p(D|h, \theta)p(h, \theta))] \\
&= \underset{h \in \mathcal{H}}{\textrm{argmax}}[\log p(D|h, \theta)  + \log p(h, \theta)]
\end{aligned}
\end{equation}

在众多的$\mathcal{H}$中，哪个posterior probability是最大的，那么当前这个数据集最属于那个$h$


Recall, with Bayes’ rule, we could get the posterior as a product of likelihood and prior:
\begin{equation}
\begin{aligned}
P(\theta|X) &= \frac{P(X|\theta)P(\theta)}{P(X)} \\
&\propto P(X|\theta)P(\theta)
\end{aligned}
\end{equation}
We are ignoring the **normalizing constant** as we are strictly speaking about optimization here, so proportionality is sufficient. If we replace the likelihood in the MLE formula above with the posterior, we get:

\begin{equation}
\begin{aligned}
\theta_{MAP} &= \underset{\theta}{\textrm{argmax}}  \log P(X|\theta)P(\theta) \\
&= \underset{\theta}{\textrm{argmax}} [\log P(X|\theta) + \log P(\theta)] \\
&= \underset{\theta}{\textrm{argmax}} [\log \prod_{i}{P(x_i | \theta)} + \log P(\theta)]  \;\;\; (x_i \in X \;\text{sampled from IID datasets})\\
&= \underset{\theta}{\textrm{argmax}} \sum_{i} \log P(x_i | \theta) + \underset{\theta}{\textrm{argmax}}  \log P(\theta)
\end{aligned}
\end{equation}

### Comparison between MLE and MAP
Comparing both MLE and MAP equation, the only thing differs is the inclusion of prior $P(\theta)$ in MAP, otherwise they are identical. What it means is that, the likelihood is now weighted with some weight coming from the prior.
\begin{equation}
\begin{aligned}
\theta_{MLE} &= \underset{\theta}{\textrm{argmax}} \sum_{i} \log P(x_i | \theta) \\
\theta_{MAP} &= \underset{\theta}{\textrm{argmax}} \sum_{i} \log P(x_i | \theta) + \underset{\theta}{\textrm{argmax}}  \log P(\theta)
\end{aligned}
\end{equation}

There are several options for prior distribution:

- **Uniform Prior:** This means, we assign equal weights everywhere, on all possible values of the $\theta$. The implication is that the likelihood equivalently weighted by some constants. Being constant, we could be ignored from our MAP equation, as it will not contribute to the maximization. $\theta_{MLE} = \theta_{MAP}$
- **Gaussian Prior:** Depending on the region of the distribution, the probability is high or low, never always the same.
- Others, like Beta distribution etc. 

What we could conclude then, **is that MLE is a special case of MAP, where the prior is uniform!**

### Example 1: Bernoulli model

We perform following steps to estimate the parameters of the model: 

1. We observed $N$ IID coin tossing: $D = \{x_1, x_2,  \ldots, x_n\} \;\text{where}\; x_i \in \{0, 1\}$, For example, $D = \{1, 0, 1, \ldots, 0\}$,
2. The model for an instance $x \in D$ is defined as follow:
\begin{equation}
P(x|\theta) = \textrm{Ber}(x|\theta) =  \left\{
	\begin{array}{ll}
		\theta  & \mbox{if } x = 0 \\
		1 - \theta & \mbox{if } x = 0
	\end{array}
\right. \, \Rightarrow \, P(x|\theta) = \theta^x(1-\theta)^{1-x}
\end{equation}
3. **MLE:** We need to maximize the objective function - log likelihood of dataset $D$:
\begin{equation}
\begin{aligned}
L(\theta; D) &= \log P(D|\theta) \\
&= \log \prod_{i}P(x_i|\theta) = \log \prod_{i} \theta^{x_i}(1-\theta)^{1-x_i} \\
&= \log \theta^{\sum_{i}x_i}(1-\theta)^{\sum_i(1-x_i)} \\
&= \log \theta^{n_h}(1-\theta)^{n_t} \\
&= n_h log \theta + n_t \log (1 - \theta) \\
&= n_h log \theta + (N - n_h) \log (1 - \theta)
\end{aligned}
\,\Rightarrow\, s.t. \underset{\theta}{\textrm{argmax}} L(\theta; D)
\end{equation}

4. In order to get max value of this objective function, we take derivatives of $\theta$:
\begin{equation}
\frac{\partial L(\theta; D)}{\partial \theta} = \frac{n_h}{\theta} - \frac{N-n_h}{1 - \theta} = 0 \\ \Downarrow \\ \hat{\theta}_{MLE} = \frac{n_h}{N} = \frac{n_h}{n_h + n_t} \; \text{or} \; \hat{\theta}_{MLE} = \frac{1}{N}\sum_{i} x_i 
\end{equation}

**Overfiting Problem:** What if we tossed too few times so that we saw zero head? $\hat{\theta}_{MLE} = 0$. To overcome this problem, we make it more formal:
\begin{equation}
\hat{\theta}_{MLE} = \frac{n_h + n'}{n_h + n_t + n'} \; \text{where} \; n' \text{ is known as the pseudo (imaginary) count.}  
\end{equation}

5. ***MAP:*** If we take Beta distribution as the prior of model parameter $\theta$:
\begin{equation}
P(\theta; \alpha, \beta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{\alpha - 1}(1-\theta)^{\beta - 1} = B(\alpha, \beta)\theta^{\alpha - 1}(1-\theta)^{\beta - 1}
\end{equation}
Then, according Bayesian rule:
\begin{equation}
\begin{aligned}
P(\theta|D) &\propto P(D|\theta)P(\theta) \\
&= \underbrace{\theta^{n_h}(1-\theta)^{n_t}}_{P(D|\theta)} \times \underbrace{B(\alpha, \beta)\theta^{\alpha - 1}(1-\theta)^{\beta - 1}}_{P(\theta)} \\
&= B(\alpha, \beta)\theta^{n_h + \alpha - 1}(1-\theta)^{n_t + \beta - 1}
\end{aligned}
\end{equation}
Maximum a posterior (MAP) estimation:
\begin{equation}
\theta_{MAP} = \underset{\theta}{\textrm{argmax}}  \log P(\theta|D) \propto \underset{\theta}{\textrm{argmax}}  \log P(D|\theta)P(\theta)
\end{equation}
According to step 4 to get the corresponding derivatives of $\theta$:
\begin{equation}
\hat{\theta_{MAP}} = \hat{\theta_{Bays}} = \int \theta p(\theta|D)d\theta = C\int \theta \times \theta^{n_h + \alpha - 1}(1-\theta)^{n_t + \beta - 1}d\theta = \frac{n_h + \alpha}{N + \alpha + \beta}
\end{equation}
where $A = \alpha + \beta$ is prior strength, which can be interoperated as the size of an imaginary data set from which we obtain the **pseudo-counts**.

### Example 2: Univariate Normal (Gaussian)
We perform following steps to estimate the parameters of the model: 

1. We observed $N$ IID coin tossing: $D = \{x_1, x_2,  \ldots, x_n\}$, For example, $D = \{-0.1, 10, 1, \ldots, 3\}$,
2. The model with parameter $\mu, \delta$ for an instance $x \in D$ is defined as follow:
\begin{equation}
P(x|\mu, \delta) = \frac{1}{\sqrt{2\pi \delta^2}}e^{-\frac{(x-\mu)^2}{2\delta^2}}
\end{equation}
3. **MLE:** We need to maximize the objective function - log likelihood of dataset $D$:
 \begin{equation}
\begin{aligned}
L(\mu, \delta; D) &= log P(D|\mu, \delta) \\
&= \log \prod_{i}P(x_i|\theta) \\
&= \sum_{i} \log P(x_i | \theta) \\
&= \sum_{i} \log \frac{1}{\sqrt{2\pi \delta^2}}e^{-\frac{(x_i-\mu)^2}{2\delta^2}} \\
&= \sum_{i}(-\frac{1}{2} \log (2\pi \delta^2) - \frac{(x_i-\mu)^2}{2\delta^2}) \\
&= -\frac{N}{2} \log (2\pi \delta^2) - \sum_{i}(\frac{(x_i-\mu)^2}{2\delta^2})
\end{aligned}
\,\Rightarrow\, s.t. \underset{\theta}{\textrm{argmax}} L(\theta; D)
\end{equation}

4. In order to get max value of this objective function, we take derivatives of $\mu, \delta^2$:
\begin{equation}
\frac{\partial L}{\partial \mu} = \frac{\sum_{i}(x_i - \mu)}{\delta^2} = 0 \Rightarrow \mu_{MLE} = \frac{\sum_{i}{x_i}}{N} \\
\frac{\partial L}{\partial \delta^2} = -\frac{N}{2\delta^2} + \frac{1}{2\delta^2}\sum_{i}(x_i - \mu)^ = 0 \Rightarrow \delta^2_{MLE} = \frac{1}{N}\sum_{i}(x_i - \mu_{MLE})^2
\end{equation}

5. **MAP:** Similarly, we can assume normal prior for model parameter $\mu$: 
\[ P(u) = \frac{1}{\sqrt{2\pi \tau^2}}e^{-\frac{(x-\mu)^2}{2\tau^2}} \]
Maximum a posterior (MAP) estimation:
\begin{equation}
\theta_{MAP} = \underset{\theta}{\textrm{argmax}}  \log P(\mu, \delta |D) \propto \underset{\theta}{\textrm{argmax}}  \log P(D|\mu, \delta)P(\mu)
\end{equation}
According to step 4 to get the corresponding derivatives of $\mu$ and $\delta$:
TODO
