#  EM as a Message Passing Algorithm


### Preliminaries

- Goals
  -
- Materials
  - Madatory
    - These lecture notes
  - Optional
    - [Dauwels et al. (2009)], pp. 1-5 (sections I,II and III)



### A Problem for the Multiplier Node

Consider the multiplier factor $f(x,y,\theta) = \delta(y-\theta x)$ with incoming Gaussian messages $\overrightarrow{\mu_X}(x) = \mathcal{N}(x|m_x,v_x)$ and $\overleftarrow{\mu_Y}(y) = \mathcal{N}(y|m_y,v_y)$. For simplicity's sake, we assume all variables are scalar. 

In a system identification setting, we are interested in computing the outgoing message $\overleftarrow{\mu_\Theta}(\theta)$. 

Let's compute the sum-product message:

$$\begin{align*}
\overleftarrow{\mu_\Theta}(\theta) &= \int \overrightarrow{\mu_X}(x) \, \overleftarrow{\mu_Y}(y) \, f(x,y,\theta) \, \mathrm{d}x \mathrm{d}y \\
  &= \int \mathcal{N}(x|m_x,v_x) \, \mathcal{N}(y|m_y,v_y) \, \delta(y-\theta x)\, \, \mathrm{d}x \mathrm{d}y \\
  &= \int \mathcal{N}(x|m_x,v_x) \,\mathcal{N}(\theta x|m_y,v_y) \, \mathrm{d}x \\
  &= \int \mathcal{N}(x|m_x,v_x) \,\mathcal{N}\left(x \bigg| \frac{m_y}{\theta},\frac{v_y}{\theta^2}\right) \, \mathrm{d}x \\
  &= \mathcal{N}\left(\frac{m_y}{\theta} \bigg| m_x, v_x + \frac{v_y}{\theta^2}\right) \cdot \int \mathcal{N}(x|m_*,v_*)\, \mathrm{d}x \tag{SRG-6} \\
  &= \mathcal{N}\left(\frac{m_y}{\theta} \bigg| m_x, v_x + \frac{v_y}{\theta^2}\right)
\end{align*}$$

This is **not** a Gaussian message for $\Theta$! Passing this message into the graph leads to very serious problems when trying to compute sum-product messages for other factors in the graph.  

The same problem occurs in a forward message passing schedule when we try to compute a message for $Y$ from incoming Gaussian messages for both $X$ and $\Theta$. 



### Limitations of Sum-Product Messages

The foregoing example shows that the sum-product message update rule will sometimes not do the job. For example:

- On large-dimensional **discrete** domains, the SP update rule maybe computationally intractable

- On **continuous** domains, the SP update rule may not have a closed-form solution or the rule may lead to a function that is incompatible with Gaussian message passing (the latter is the case for multiplication with an unknown coefficient). 

There are various ways to cope with 'intractable' SP update rules. In this lesson, we discuss how the EM-algorithm can be written as a message passing algorithm on factor graphs. Then, we will solve the 'multiplier node problem' with EM messages (rather than SP messages).

### EM as Message Passing

 
Consider first a general setting with function $f(x,\theta)$. Assume that we are interested in 

$$\begin{align*}
\hat{\theta} &= \arg\max_\theta \int f(x,\theta) \mathrm{d}x\,.
\end{align*}$$

This is a common setting where $x$ are hidden state variables and $\theta$ (yet to be determined) tuning parameters. 

If $\int f(x,\theta) \mathrm{d}x$ is intractible, we can try to apply the EM-algorithm to estimate $\hat{\theta}$, which leads to the following iterations:

$$
\hat{\theta}^{(k+1)} = \underbrace{\arg\max_\theta}_{\text{M-step}} \left( \underbrace{\int_x f(x,\hat{\theta}^{(k)})\,\log f(x,\theta)\,\mathrm{d}x}_{\text{E-step}} \right)
$$

It turns out that _for factorized functions_ $f(x,\theta)$, the EM-algorihm can be executed as a message passing algorithm on the factor graph. 

As an simple example, we consider the factorization <font color="red">[INSERT GRAPH]</font>

$$
f(x,\theta) = f_a(\theta)f_b(x,\theta)
$$

Applying the EM-algorithm to this graph leads to the following forward and backward messages over the $\theta$ edge <font color="red">[insert graph showing messages]</font>: 

$$\begin{align*}
\eta(\theta) &= \int p_b(x|\hat{\theta}^{(k)}) \log f_b(x,\theta) \,\mathrm{d}x \tag{E-step} \\
\hat{\theta}^{(k+1)}  &= \arg\max_\theta \left( f_a(\theta)\, e^{\eta(\theta)}\right) \tag{M-step}
\end{align*}$$


Proof:
$$\begin{align*}
\hat{\theta}^{(k+1)} &= \arg\max_\theta \, \int_x f(x,\hat{\theta}^{(k)}) \,\log f(x,\theta)\,\mathrm{d}x   \\
    &= \arg\max_\theta \, \int_x f_a(\theta)f_b(x,\hat{\theta}^{(k)}) \,\log \left( f_a(\theta)f_b(x,\theta) \right) \,\mathrm{d}x \\
    &= \arg\max_\theta \, \int_x f_b(x,\hat{\theta}^{(k)}) \cdot \left( \log f_a(\theta) + \log f_b(x,\theta) \right)  \,\mathrm{d}x \\
    &= \arg\max_\theta \left( \log f_a(\theta) +  \frac{\int f_b(x,\hat{\theta}^{(k)}) \log f_b(x,\theta) \,\mathrm{d}x }{\int f_b(x^\prime,\hat{\theta}^{(k)}) \,\mathrm{d}x^\prime} \right) \\
      &= \arg\max_\theta \left( \log f_a(\theta) +  \underbrace{\int p_b(x|\hat{\theta}^{(k)}) \log f_b(x,\theta) \,\mathrm{d}x}_{\eta(\theta)}  \right) \tag{log-domain} \\
       &= \underbrace{\arg\max_\theta}_{\text{M-step}} \left( f_a(\theta)\,\underbrace{e^{\eta(\theta)}}_{\text{E-step}}  \right) \tag{prob. domain}      
\end{align*}$$

where $p_b(x|\hat{\theta}^{(k)}) \triangleq \frac{f_b(x,\hat{\theta}^{(k)})}{\int f_b(x^\prime,\hat{\theta}^{(k)}) \,\mathrm{d}x^\prime}$. Note that the denominator $\int f_b(x^\prime,\hat{\theta}^{(k)}) \,\mathrm{d}x^\prime$ in $p_b$ is just a scaling factor that can usually be ignored, leading to a simpler message rule  $$\eta(\theta) = \int f_b(x,\hat{\theta}^{(k)}) \log f_b(x,\theta) \,\mathrm{d}x \,.$$

The quantity $\eta(\theta)$ (a.k.a. the E-log message) may be interpreted as a log-domain summary of $f_b$. The message $e^{\eta(\theta)}$ is the corresponding 'probability domain' message that is consistent with the semantics of messages as summaries of factors. In a software implementation, you can use either domain, as long as a consistent method is chosen.



### EM vs SP Message Passing

Recall that in a 'regular' (non-graphical) setting, the EM-algorithm is particularly useful when the _expectation_ (E-step)
$$
f(\theta,\hat{\theta}^{(k)}) = \int_x f(x,\hat{\theta}^{(k)})\,\log f(x,\theta)\,\mathrm{d}x
$$
leads to easier expressions than the _marginalization_ (which is what we really want)
$$
f(x) = \int f(x,\theta) \mathrm{d}x .
$$

In similar fashion, EM messages are particularly useful when the _expectation_ (executed by the E-log update rule)
$$
\eta(\theta) = \int p_b(x|\hat{\theta}^{(k)}) \log f_b(x,\theta) \,\mathrm{d}x
$$
leads to easier expressions than the _marginalization_ message (by sum-product rule, which is also what we really want)
$$
\mu(\theta) = \int f_b(x,\theta) \mathrm{d}x .
$$

Just as for the sum-product (SP) and max-product (MP) messages, we can work out the outgoing E-log message on the $Y$ edge for a _general_ node $g(x_1,\ldots,x_M,y)$ with given message inputs $\overrightarrow{\mu_{X_m}}(x_m)$ <font color="red">[graph of g w messages]</font>

$$\begin{align*}
\overrightarrow{\mu_Y}(y) &= \int \overrightarrow{\mu_{X_1}}(x_1) \cdots \overrightarrow{\mu_{X_M}}(x_M)\, g(x_1,\ldots,x_M,y) \, \mathrm{d}x_1 \ldots \mathrm{d}x_M \tag{SP} \\
\hat{y} &= \arg\max_{x_1,\ldots,x_M} \overrightarrow{\mu_{X_1}}(x_1) \cdots \overrightarrow{\mu_{X_M}}(x_M)\, g(x_1,\ldots,x_M,y) \tag{MP} \\
\overrightarrow{\eta}(y) &= \int p(x_1,\ldots,x_M | y^{(k)})\,\log g(x_1,\ldots,x_M,y) \, \mathrm{d}x_1 \ldots \mathrm{d}x_M \tag{E-log}
\end{align*}$$

where $p(x_1,\ldots,x_M | y^{(k)}) \triangleq \frac{\overrightarrow{\mu_{X_1}}(x_1) \cdots \overrightarrow{\mu_{X_M}}(x_M)\, g(x_1,\ldots,x_M,\hat{y}^{(k)})}{\int \overrightarrow{\mu_{X_1}}(x_1) \cdots \overrightarrow{\mu_{X_M}}(x_M)\, g(x_1,\ldots,x_M,\hat{y}^{(k)}) \, \mathrm{d}x_1 \ldots \mathrm{d}x_M}$.

- <font color="green">Exercise: proof the generic E-log message update rule.</font>


### A Snag for EM Message Passing on Deterministic Nodes

- The factors for deterministic nodes are (Dirac) delta functions, e.g., $\delta(y-\theta x)$ for the multiplier.

- Note that the outgoing E-log message for a deterministic node will also be a delta function, since the expectation of $\log \delta(.)$ is again a delta function. For details, consult [Dauwels et al. (2009)](.) pg.5, section F.  

- This would stall the iterative estimation process at the current estimate since the outgoing E-log message would express complete certainty about the estimate  

- This issue can be resolved by closing a box around a subgraph that includes $g$ and _at least one non-deterministic factor_. EM message passing may proceed with the newly created node.

- Example

### A Solution for the Multiplier Node with Unknown Coefficient

Consider again the (scalar) multiplier with unknown coefficient $f(x,y,\theta) = \delta(y-\theta x)$ and incoming messages $\overrightarrow{\mu_X}(x) = \mathcal{N}(x|m_x,v_x)$ and $\overleftarrow{\mu_Y}(y) = \mathcal{N}(y|m_y,v_y)$. We will now compute the outgoing E-log message for $\Theta$.

Since $f(x,y,\theta)$ is deterministic, we will first group $f$ with the (non-deterministic) node $\overleftarrow{\mu_Y}(y) = \mathcal{N}(y|m_y,v_y)$, leading (through sum-product rule) to <font color="red">[graph for g(x,theta) from f(x,y)] </font>

$$\begin{align*}
g(x,\theta) &\triangleq \int \overleftarrow{\mu_Y}(y)\, f(x,y,\theta) \,\mathrm{d}y \\
  &= \int \mathcal{N}(y|m_y,v_y)\, \delta(y-\theta x) \,\mathrm{d}y \\
  &= \mathcal{N}(\theta x|m_y,v_y)\,.
\end{align*}$$

The problem now is to pass an E-log message out of $g(x,\theta)$. Assume that $g$ has received an estimate $\hat{\theta}$ from the incoming message over the $\Theta$ edge. The E-log update rule then prescribes

$$\begin{align*}
\eta(\theta) &= \mathrm{E}[ \log g(x,\theta) ] \\
  &= \mathrm{E}[ \mathcal{N}(\theta x|m_y,v_y) ] \\
  &= \text{const.} - \frac{1}{2v_y}\, \left( \mathrm{E}[X^2] \theta^2 - 2 m_y \mathrm{E}[X] \theta + m_y^2\right) \\
  &\propto \mathcal{N}_{\xi} \left( \theta \,\bigg|\, \frac{m_y \mathrm{E}[X]}{v_y}, \frac{\mathrm{E}[X^2]}{v_y} \right) \tag{E-log msg.}
\end{align*}$$
where we used the 'canonical' parametrization of the Gaussian $\mathcal{N}_{\xi}(\theta|\xi,w) \propto \exp \left( \xi \theta- \frac{1}{2} w \theta^2\right)$. 

In the E-log message update rule, the expections $\mathrm{E}[X]$ and $\mathrm{E}[X^2]$ have to be taken w.r.t. $ p(x|\hat{\theta}) = \overrightarrow{\mu_X}(x)\,g(x,\hat{\theta})$ (consult the generic E-log update rule). A straightforward (but rather painful) derivation leads to 

$$\begin{align*}
p(x|\hat{\theta}) &= \overrightarrow{\mu_X}(x)\,g(x,\hat{\theta}) \\
  &= \mathcal{N}(x|m_x,v_x)\cdot \mathcal{N}(\hat{\theta} x|m_y,v_y) \\
  &= \mathcal{N}(x|m_x,v_x)\cdot \mathcal{N}\left(x \bigg| \frac{m_y}{\hat{\theta} },\frac{v_y}{\hat{\theta^2}} \right) \\
  &\propto \mathcal{N_\xi}( x | \xi_g  , w_g)
\end{align*}$$
where $w_g = \frac{1}{v_x} + \frac{\hat{\theta^2}}{v_y}$ and $\xi_g \triangleq w_g m_g = \frac{m_x}{v_x}+\frac{\hat{\theta}m_y}{v_y}$. It follows that  
$$
\mathrm{E}[X] = m_g \\
\mathrm{E}[X^2] = m_g^2 + w_g^{-1}
$$

The E-log update formula is not fun to derive, but the result is very pleasing: the **E-log message for the multiplier with unknown coefficient is a Gaussian message** with closed-form expressions for its parameters!






### Automating Inference

$\rightarrow$ It follows that, for a LDS with unknown coefficients, both state estimation and parameter learning can be achieved through Gaussian message passing based on SP and EM message update rules.

- These (SP, MP and EM) message update rules can be tabularized and implemented in software for a large set of factors that are common in probabilistic models.

- Tabulated EM  messages for frequently occuring factors facilitate the automated derivation of nontrivial EM algorithms.

- This makes it possible to automate inference in factor graphs. Here (in the SPS group at TU/e), we have built such a factor graph toolbox in Julia. 

<font color="red">[MARCO: let ForneyLab spit out some update formulas]</font>

- There is lots more to say about factor graphs. This is a very exciting area of research that promises both (1) to consolidate a wide range of signal processing and machine learning algorithms in one elegant framework and (2) to automate inference and learning in these models.



### Example: Linear Dynamical Systems

As before let us consider the linear dynamical system (LDS)

$$\begin{align*}
  z_n &= A z_{n-1} + w_n \\
  x_n &= C z_n + v_n \\
  w_n &\sim \mathcal{N}(0,\Sigma_w) \\
  v_n &\sim \mathcal{N}(0,\Sigma_v)
\end{align*}$$

Again, we will consider the case where $x_n$ is observed and $z_n$ is a hidden state. $C$, $\Sigma_w$ and $\Sigma_v$ are given parameters but in contrast to the previous section, we will assume that the value of parameter $A$ is unknown. 

Literally Dauwels:

- EM may be used to estimate unknown parameters in a
factor graph model.
- EM may be used to break cycles in a factor graph.
- The EM messages are tractable expressions in some
cases where the sum-product and max-product message
computation rules yields intractable expressions.

