# Exponential Family (Optional)

## Exponential Family
Likelihood (pdf): for $x = (x_1, \ldots, x_m) \in X^m$ and $\boldsymbol { \theta } \in \Theta \subseteq \mathbb { R } ^ { d }$

$$ \begin{aligned} p ( \mathbf { x } | \boldsymbol { \theta } ) & = \frac { 1 } { Z ( \boldsymbol { \theta } ) } h ( \mathbf { x } ) \exp \left[ \boldsymbol { \theta } ^ { T } \boldsymbol { \phi } ( \mathbf { x } ) \right] \\ 
& = h ( \mathbf { x } ) \exp \left[ \boldsymbol { \theta } ^ { T } \boldsymbol { \phi } ( \mathbf { x } ) - A ( \boldsymbol { \theta } ) \right] \\
& = h ( \mathbf { x } ) \exp \left[ \eta ( \boldsymbol { \theta } ) ^ { T } \boldsymbol { \phi } ( \mathbf { x } ) - A ( \eta ( \boldsymbol { \theta } ) ) \right]
\end{aligned}$$

where: $$\begin{aligned} Z ( \boldsymbol { \theta } ) & = \int _ { \mathcal { X } ^ { m } } h ( \mathbf { x } ) \exp \left[ \boldsymbol { \theta } ^ { T } \boldsymbol { \phi } ( \mathbf { x } ) \right] d \mathbf { x } \\ A ( \boldsymbol { \theta } ) & = \log Z ( \boldsymbol { \theta } ) \end{aligned}$$

 + $\theta$: natural parameter or canonical parameters
 + $\phi(x) \in \mathbb{R}^d$: vector of sufficient statistics,
 + $Z(\theta)$: partition function
 + $A(\theta)$: log partition function or cumulant function
 + $h(x)$: scaling constant, often 1
 + $\eta$: function that maps the paramenters $\theta$ to the canonical parameters $\eta = \eta(\theta)$. 
 + If $\operatorname { dim } ( \boldsymbol { \theta } ) < \operatorname { dim } ( \eta ( \boldsymbol { \theta } ) )$, we call it as curved exponential family, which means we have more sufficient statistics than parameters
 + If $\eta(\theta) = \theta$: canonical form.
 + If $\phi(x) = x$: natural exponential family
 
Properties:
 + Has finite-sized sufficient statistics: we can compress the data into a fixed-sized summary without loss of information
 + Has conjugate priors
 + least set of assumptions subject to some user-chosen constraints
 + Core of generalized linear models
 + Core of variational inference
 
Examples: 
 + Continuous: Normal, Gamma (Chi-square, Exponential), Beta
 + Discrete: Bernoulli, Binomial, Categorical, Multinomial, Possion, Geometric
 

### Examples:
+ Bernoulli: 
$$\operatorname { Ber } ( x | \mu ) = \mu ^ { x } ( 1 - \mu ) ^ { 1 - x } = ( 1 - \mu ) \exp \left[ x \log \left( \frac { \mu } { 1 - \mu } \right) \right]$$

we have: $\phi ( x ) = x , \theta = \log \left( \frac { \mu } { 1 - \mu } \right), Z = 1 / ( 1 - \mu )$, we can recover the mean parameter $\mu$: $\mu = \operatorname { sigm } ( \theta ) = \frac { 1 } { 1 + e ^ { - \theta } }$

+ Multinoulli:
$$\operatorname { Cat } ( x | \boldsymbol { \mu } ) = \prod _ { k = 1 } ^ { K } \mu _ { k } ^ { x _ { k } } = \exp \left[ \sum _ { k = 1 } ^ { K } x _ { k } \log \mu _ { k } \right] = \exp \left[ \sum _ { k = 1 } ^ { K - 1 } x _ { k } \log \left( \frac { \mu _ { k } } { \mu _ { K } } \right) + \log \mu _ { K } \right]$$
where: $\mu _ { K } = 1 - \sum _ { k = 1 } ^ { K - 1 } \mu _ { k }$

$$\begin{aligned} \operatorname { Cat } ( x | \boldsymbol { \theta } ) & = \exp \left( \boldsymbol { \theta } ^ { T } \boldsymbol { \phi } ( \mathbf { x } ) - A ( \boldsymbol { \theta } ) \right) \\ \boldsymbol { \theta } & = \left[ \log \frac { \mu _ { 1 } } { \mu _ { K } } , \ldots , \log \frac { \mu _ { K - 1 } } { \mu _ { K } } \right] \\ \boldsymbol { \phi } ( x ) & = [ \mathbb { I } ( x = 1 ) , \ldots , \mathbb { I } ( x = K - 1 ) ] \\
A ( \boldsymbol { \theta } ) &= \log \left( 1 + \sum _ { k = 1 } ^ { K - 1 } e ^ { \theta _ { k } } \right) \\ \mu _ { k } & = \frac { e ^ { \theta _ { k } } } { 1 + \sum _ { j = 1 } ^ { K - 1 } e ^ { \theta _ { j } } }\end{aligned}$$

+ Univariate Gaussian:
$$\begin{aligned} \mathcal { N } ( x | \mu , \sigma ^ { 2 } ) & = \frac { 1 } { \left( 2 \pi \sigma ^ { 2 } \right) ^ { \frac { 1 } { 2 } } } \exp \left[ - \frac { 1 } { 2 \sigma ^ { 2 } } ( x - \mu ) ^ { 2 } \right] \\ & = \frac { 1 } { \left( 2 \pi \sigma ^ { 2 } \right) ^ { \frac { 1 } { 2 } } } \exp \left[ - \frac { 1 } { 2 \sigma ^ { 2 } } x ^ { 2 } + \frac { \mu } { \sigma ^ { 2 } } x - \frac { 1 } { 2 \sigma ^ { 2 } } \mu ^ { 2 } \right] \\ & = \frac { 1 } { Z ( \boldsymbol { \theta } ) } \exp \left( \boldsymbol { \theta } ^ { T } \boldsymbol { \phi } ( x ) \right) \end{aligned}$$

where:

$$\begin{aligned} \boldsymbol { \theta } & = \left( \begin{array} { c } { \mu / \sigma ^ { 2 } } \\ { \frac { 1 } { 2 \sigma ^ { 2 } } } \end{array} \right) \\ \phi ( x ) & = \left( \begin{array} { c } { x } \\ { x ^ { 2 } } \end{array} \right) \\ Z \left( \mu , \sigma ^ { 2 } \right) & = \sqrt { 2 \pi } \sigma \exp \left[ \frac { \mu ^ { 2 } } { 2 \sigma ^ { 2 } } \right] \\ A ( \boldsymbol { \theta } ) & = \frac { - \theta _ { 1 } ^ { 2 } } { 4 \theta _ { 2 } } - \frac { 1 } { 2 } \log \left( - 2 \theta _ { 2 } \right) - \frac { 1 } { 2 } \log ( 2 \pi ) \end{aligned}$$

### Log partition function:
$$\frac { d A } { d \theta } = \frac { d } { d \theta } \left( \log \int \exp ( \theta \phi ( x ) ) h ( x ) d x \right) = \mathbb { E } [ \phi ( x ) ] $$

$$\frac { d ^ { 2 } A } { d \theta ^ { 2 } } = \int \phi ( x ) \exp ( \theta \phi ( x ) - A ( \theta ) ) h ( x ) \left( \phi ( x ) - A ^ { \prime } ( \theta ) \right) d x = \mathbb { E } \left[ \phi ^ { 2 } ( X ) \right] - \mathbb { E } [ \phi ( x ) ] ^ { 2 } = \operatorname { var } [ \phi ( x ) ] $$

### MLE:
pdf (likelihood): $$p ( \mathcal { D } | \boldsymbol { \theta } ) = \left[ \prod _ { i = 1 } ^ { N } h \left( \mathbf { x } _ { i } \right) \right] g ( \boldsymbol { \theta } ) ^ { N } \exp \left( \boldsymbol { \eta } ( \boldsymbol { \theta } ) ^ { T } \left[ \sum _ { i = 1 } ^ { N } \boldsymbol { \phi } \left( \mathbf { x } _ { i } \right) \right] \right)$$

log-likelihood: $$\log p ( \mathcal { D } | \boldsymbol { \theta } ) = \boldsymbol { \theta } ^ { T } \boldsymbol { \phi } ( \mathcal { D } ) - N A ( \boldsymbol { \theta } )$$

derivative of log-likelihood w.r.t $\theta$: $$\nabla _ { \boldsymbol { \theta } } \log p ( \mathcal { D } | \boldsymbol { \theta } ) = \phi ( \mathcal { D } ) - N \mathbb { E } [ \boldsymbol { \phi } ( \mathbf { X } ) ]$$

$$\nabla _ { \boldsymbol { \theta } } \log z (\theta ) = E_\theta [\phi(x)]$$

Setting this gradient to zero, we see that at the MLE, the empirical average of the sufficient statistics must equal the model’s theoretical expected sufficient statistics, i.e., θ must satisfy:

$$\mathbb { E } [ \phi ( \mathbf { X } ) ] = \frac { 1 } { N } \sum _ { i = 1 } ^ { N } \phi \left( \mathbf { x } _ { i } \right)$$