# Density Estimation

(insert preamble cell)
$\DeclareMathOperator{\trace}{\mathrm{Tr}}$
$\newcommand{\d}[1]{{\,\mathrm{d}#1}}$

### Why Density Estimation?

Why are we interested to build a density model $p(x|\theta)$ from data observations $D=\{x_1,\dotsc,x_N\}$? Some examples:

- **Outlier detection**. Suppose $D=\{x_n\}$ are benign mammogram images. Build $p(x | \theta)$ from $D$. Then low value for $p(x^\prime | \theta)$ indicates that $x^\prime$ is a risky mammogram.
- **Compression**. Code a new data item based on **entropy**, which is a functional of $p(x|\theta)$: 
$$
H[p] = -\sum_x p(x | \theta)\log p(x |\theta)
$$
- **Classification**. Let $p(x | \theta_1)$ be a model of attributes $x$ for credit-card holders that paid on time and $p(x | \theta_2)$ for clients that defaulted on payments. Then, assign a potential new client $x^\prime$ to either class based on the relative probability of $p(x^\prime | \theta_1)$ vs. $p(x^\prime|\theta_2)$.


### Some Useful Matrix Calculus

We define the **gradient** of a scalar function $f(A)$ w.r.t. an $n \times k$ matrix $A$ as

$$
\nabla_A f \triangleq
    \begin{bmatrix}
\frac{\partial{f}}{\partial a_{11}} & \frac{\partial{f}}{\partial a_{12}} & \cdots & \frac{\partial{f}}{\partial a_{1k}}\\
\frac{\partial{f}}{\partial a_{21}} & \frac{\partial{f}}{\partial a_{22}} & \cdots & \frac{\partial{f}}{\partial a_{2k}}\\
\vdots & \vdots & \cdots & \vdots\\
\frac{\partial{f}}{\partial a_{n1}} & \frac{\partial{f}}{\partial a_{n2}} & \cdots & \frac{\partial{f}}{\partial a_{nk}}
    \end{bmatrix}
$$
    
The following formulas are useful (see Bishop App.-C)

\begin{align}
|A^{-1}|&=|A|^{-1} \tag{B-C.4} \\
\nabla_A \log |A| &= (A^{T})^{-1} = (A^{-1})^T \tag{B-C.28} \\
\trace[ABC]&= \trace[CAB] = \trace[BCA] \tag{B-C.9} \\
\nabla_A \trace[AB] &=\nabla_A \trace[BA]= B^T \tag{B-C.25} \\
\nabla_A \trace[ABA^T] &= A(B+B^T)  \tag{B-C.27}\\
 \nabla_x x^TAx &= (A+A^T)x \tag{from B-C.27}\\
\nabla_X a^TXb &= \nabla_X \trace[ba^TX] = ab^T \notag
\end{align}


### Log-Likelihood for a Multivariate Gaussian (MVG)

Assume we are given a set of IID data points $D=\{x_1,\ldots,x_N\}$, where $x_n \in \Re^D$. We want to build a model for these data.

1. **Model specification**. Let's assume a MVG model $x_n=\mu+\epsilon_n$ with $\epsilon_n \sim \mathcal{N}(0,\Sigma)$, or equivalently,

$$
p(x|\mu,\Sigma) = |2 \pi \Sigma|^{-\frac{1}{2}} \mathrm{exp} \left\{-\frac{1}{2}(x_n-\mu)^T
\Sigma^{-1} (x_n-\mu) \right\}
$$

Since the data are IID, $p(D|\theta)$ factorizes as
$$  
p(D|\theta) = p(x_1,\ldots,x_N|\theta) \stackrel{\text{IID}}{=} \prod_n p(x_n|\theta)
$$

This choice of model yields the following log-likelihood (use (B-C.9) and (B-C.4)),

\begin{align}
 \log p(D|\theta) &= \log \prod_n p(x_n|\theta) = \sum_n \log p(x_n|\theta) \\
     &= N \cdot \log | 2\pi\Sigma |^{-1/2} - \frac{1}{2} \sum\nolimits_{n} (x_n-\mu)^T \Sigma^{-1} (x_n-\mu)
 \end{align}


### Maximum Likelihood estimation of mean of MVG

We want to maximize $\log p(D|\theta)$ wrt the parameters $\theta=\{\mu,\Sigma\}$. Let's take derivatives; first to mean $\mu$, (making use of  (B-C.25) and (B-C.27)),

\begin{align}
\nabla_\mu \log p(D|\theta) &= -\frac{1}{2}\sum_n \nabla_\mu \left[ (x_n-\mu)^T \Sigma^{-1} (x_n-\mu) \right] \notag \\
&= -\frac{1}{2}\sum_n \nabla_\mu \trace \left[ -2\mu^T\Sigma^{-1}x_n + \mu^T\Sigma^{-1}\mu \right] \notag \\
&= -\frac{1}{2}\sum_n \left( -2\Sigma^{-1}x_n + 2\Sigma^{-1}\mu \right) \notag \\
&= \Sigma^{-1}\,\sum_n \left( x_n-\mu \right)
\end{align}

Set to zero yields the **sample mean**
$$
\boxed{
\hat \mu = \frac{1}{N} \sum_n x_n
}
$$



### Maximum Likelihood estimation of variance of MVG

Now we take the gradient of the log-likelihood **wrt the precision matrix** $\Sigma^{-1}$ (making use of B-C.28 and B-C.24)

\begin{align}
\nabla_{\Sigma^{-1}}  \log p(D|\theta) &= \nabla_{\Sigma^{-1}} \left[ \frac{N}{2} \log |2\pi\Sigma|^{-1} - \frac{1}{2} \sum_{n=1}^N (x_n-\mu)^T \Sigma^{-1} (x_n-\mu)\right]\notag \\
&= \nabla_{\Sigma^{-1}} \left[ \frac{N}{2} \log |\Sigma^{-1}| - \frac{1}{2} \sum_{n=1}^N \trace \left[ (x_n-\mu) (x_n-\mu)^T \Sigma^{-1}\right] \right]\\
&= \frac{N}{2}\Sigma -\frac{1}{2}\sum_n (x_n-\mu)(x_n-\mu)^T
\end{align}

Get optimum by setting the gradient to zero,
$$
\boxed{
\hat \Sigma = \frac{1}{N} \sum_n (x_n-\hat\mu)(x_n - \hat\mu)^T}
$$
which is the **sample variance**.


### Sufficient Statistics

Note that the ML estimates can also be written as
$$
\hat \Sigma = \sum_n x_n x_n^T - \left( \sum_n x_n\right)\left( \sum_n x_n\right)^T, \quad \hat \mu = \frac{1}{N} \sum_n x_n
$$

I.o.w., the statistics $\sum_n x_n$ and $\sum_n x_n x_n^T$ are sufficient to estimate the parameters $\mu$ and $\Sigma$ from observations. In the literature, they are called **sufficient statistics**.

### Discrete Data: the 1-of-K Coding Scheme

Consider a coin-tossing experiment with outcomes $x \in\{0,1\}$ (corresponding to tail and head, resp.) and $0\leq \mu \leq 1$ the probability of heads. This model can written as a **Bernoulli distribution**:

$$ 
p(x|\mu) = \mu^{x}(1-\mu)^{1-x}
$$

- Note that in expression $\mu^{x}(1-\mu)^{1-x}$, the variable $x$ acts as a (binary) **selector** for the tail or head probabilities.

- **1-of-K scheme**. Now consider a $K$-sided coin, a.k.a. a _die_ (pl.: dice). It will be very convenient to code the outcomes by a vector $x=(x_1,\ldots,x_K)^T$ with **binary selection variables**

$$
x_k = \begin{cases} 1 & \text{if die landed on $k$th face}\\
0 & \text{otherwise} \end{cases}
$$

- E.g., For $K=6$, if the die lands on the 3rd face, we encode that as $x=(0,0,1,0,0,0)^T$.

- Assume the probabilities $p(x_k=1) = \mu_k$ with  $\sum_k \mu_k  = 1$. The data generating distribution is then (note the similarity to the Bernoulli distribution)

$$
p(x|\mu) = \mu_1^{x_1} \mu_2^{x_2} \cdots \mu_k^{x_k}=\prod_k \mu_k^{x_k}
$$

- This distribution is sometimes (but not consistently) called the 'multi-noulli' distribution.
- Note that $\sum_k x_k = 1$ and verify for yourself that $\mathrm{E}[x|\mu] = \mu$.

- In these notes, we use the superscript to indicate that we are working with a **binary selection variable** in a 1-of-$K$ scheme.



### Relation to The Multinomial Distribution

Observe a data set $D=\{x_1,\ldots,x_N\}$  of $N$ IID rolls of a $K$-sided die, with generating PDF
$$
p(D|\mu) = \prod_n \prod_k \mu_k^{x_{nk}} = \prod_k \mu_k^{\sum_n x_{nk}} = \prod_k \mu_k^{m_k}
$$
where $m_k= \sum_n x_{nk}$ is the total number of occurrences that we `threw' $k$ eyes.

- This distribution depends on the observations **only** through the quantities $\{m_k\}$, with generally $K \ll N$. 

- A related distribution is the distribution over $D_m=(m_1,\ldots,m_K)^T$, which is called the **multinomial distribution**,
$$
p(D_m|\mu) =\frac{N!}{m_1! m_2!\ldots m_K!} \,\prod_k \mu_k^{m_k}\,.
$$

- Note that $p(D|\mu)$ and $p(D_m|\mu)$ differ only in the normalization factor. Relate this to the fact that $D$ has $N$ components and $D_m$ has $K$ components.



### Maximum Likelihood Estimation for the Multinomial

Now let's find the ML estimate for $\mu$, based on $N$ throws of a $K$-sided die. 

- The log-likelihood with Lagrange multiplier to include the constraint is

\begin{align}
\mathrm{L}^\prime &\triangleq \log \prod_k \mu_k^{m_k}  + \lambda \cdot(1 - \sum_k \mu_k ) \\
	&= \sum_k m_k \log \mu_k  + \lambda \cdot (1 - \sum_k \mu_k )\,.
\end{align}

Set derivative to zero yields the **sample proportion** for $\mu_k$ %(get $\lambda$ from $\sum_k \hat\theta_k = 1$)

$$
\nabla_{\mu_k}   \mathrm{L}^\prime = \frac{m_k }
{\hat\mu_k } - \lambda  = 0 \; \Rightarrow \; \boxed{\hat\mu_k  = \frac{m_k }
{\lambda } = \frac{m_k }{N}}
$$

where we get $\lambda$ from the constraint $$\sum_k \hat \mu_k = \sum_k \frac{m_k}
{\lambda} = \frac{N}{\lambda} = 1$$

Compare this answer to Laplace's rule for predicting the next coin toss ($p(h|D)=(N_h+1)/(N+2)$)

- Interesting special case: **Binomial** (=$N$ coin tosses); 
$$p(x_n|\theta)= \theta^{x_n^h}(1-\theta)^{1-x_n^h}=\theta_h^{x_n^h} \theta_t^{x_n^t}
$$ 
yields $$ \hat \theta = \frac{N_h}{N_h +N_t} $$



### Recap ML for Density Estimation

Given $N$ IID observations $D=\{x_1,\dotsc,x_N\}$

- For MVG model, $p(x_n|\theta) = \mathcal{N}(x_n|\mu,\Sigma)$, we obtain ML estimates

\begin{align}
\hat \mu &= \frac{1}{N} \sum_n x_n \tag{sample mean} \\
\hat \Sigma &= \frac{1}{N} \sum_n (x_n-\hat\mu)(x_n - \hat \mu)^T \tag{sample variance}
\end{align}

- For discrete outcomes modeled by a 1-of-K multi-noulli distribution $p(x_n|\mu) = \prod_k \mu_k^{x_{nk}}$ (with $\sum_k \mu_k  = 1$), we find

\begin{align}
\hat\mu_k  = \frac{1}{N} \sum_n x_{nk} \quad \left(= \frac{m_k}{N} \right) \tag{sample proportion}
\end{align}


Note the similarity for the means between discrete and continuous data. 

- We didn't use a co-variance matrix for discrete data. Why?
