# M&M model for fine-mapping

This is the 5th version of M&M. The formulation of this version was inspired by conditional regression commonly used in fine-mapping, as discussed in [T Flutre et al 2013](https://doi.org/10.1371/journal.pgen.1003486).

$\newcommand{\bs}[1]{\boldsymbol{#1}}$
$\DeclareMathOperator*{\diag}{diag}$
$\DeclareMathOperator*{\cov}{cov}$
$\DeclareMathOperator*{\rank}{rank}$
$\DeclareMathOperator*{\var}{var}$
$\DeclareMathOperator*{\tr}{tr}$
$\DeclareMathOperator*{\veco}{vec}$
$\DeclareMathOperator*{\uniform}{\mathcal{U}niform}$
$\DeclareMathOperator*{\argmin}{arg\ min}$
$\DeclareMathOperator*{\argmax}{arg\ max}$
$\DeclareMathOperator*{\N}{N}$
$\DeclareMathOperator*{\gm}{Gamma}$
$\DeclareMathOperator*{\dif}{d}$

## M&M ASH model

We assume the following multivariate, multiple regression model with $N$ samples, $J$ effects and $R$ conditions (and *without covariates, which will be discussed separately*)
\begin{align}
\bs{Y}_{N\times R} = \bs{X}_{N \times J}\bs{B}_{J \times R} + \bs{E}_{N \times R},
\end{align}
where
\begin{align}
\bs{E} &\sim \N_{N \times R}(\bs{0}, \bs{I}_N, \bs{\Sigma}),\\
\bs{\Sigma} &= \diag(\lambda_1^{-1},\ldots,\lambda_R^{-1}).
\end{align}

Let $\bs{\Lambda} = \bs{\Sigma}^{-1}$, we place Gamma prior on $\bs{\Lambda}$

$$\lambda_r \overset{iid}{\sim} \gm(\alpha, \beta),$$

and as a first path we set $\alpha = \beta = 0$ so that it is equivalent to estimating $\bs{\Sigma}$ via maximum likelihood.

Let $\omega_j := p(\zeta_j)$ be the prior probability that effect $j$ is non-zero,

\begin{align}
\zeta_j \sim \text{Bernoulli}(\omega_j)
\end{align}

We assume non-zero effects $\bs{b}_j$ (rows of $\bs{B}$) are iid with prior distribution of mixtures of multivariate normals

\begin{align}
p(\bs{b}_j|\zeta_j = 1) = \sum_{p = 0}^P\pi_{jp}\N_R(\bs{b}_j | \bs{0}, \bs{V}_p),
\end{align}

where the $\bs{V}_p$'s are $R \times R$ positive semi-definite covariance matrices and are known ([Urbut et al 2017](https://www.biorxiv.org/content/early/2017/05/09/096552)), with corresponding weights $\pi_{\cdot,p}$'s to be estimated. We can augment the prior of $\bs{b}_j$ by indicator vector $\bs{\gamma}_j \in \mathbb{R}^P$ denoting membership of $\bs{b}_j$ into one of the $P$ mixture groups and write

\begin{align}
p(\bs{b}_j|\zeta_j = 1, \bs{\gamma}_j) &= \prod_{p = 0}^P\left[\N(\bs{b}_j|\bs{0},\bs{V}_p)\right]^{\gamma_{jp}},
\end{align}

where

\begin{align}
p(\bs{\gamma}_j) &= \prod_{p = 0}^{P} \pi_{jp}^{\gamma_{jp}}
\end{align}

The densities involved are

\begin{align}
p(\bs{Y}, \bs{B},\bs{\zeta}, \bs{\Gamma} | \bs{\Sigma}, \bs{\omega}, \bs{\Pi}) & = 
p(\bs{Y}|\bs{B}, \bs{\zeta}, \bs{\Gamma}, \bs{\omega}, \bs{\Pi}, \bs{\Sigma}) p(\bs{B}|\bs{\zeta}, \bs{\Gamma})p(\bs{\zeta}, \bs{\Gamma}|\bs{\omega}, \bs{\Pi}) \\
&= p(\bs{Y}|\bs{B}, \bs{\Sigma}) p(\bs{B}|\bs{\zeta}, \bs{\Gamma})
p(\bs{\Gamma}|\bs{\zeta}, \bs{\Pi}) p(\bs{\zeta}|\bs{\omega})
\end{align}

## A variational approach to M&M

Inspired by the conditional regression approach for fine mapping, we reparameterize the model as

\begin{align}
\bs{Y}_{N\times R} = \bs{X}_{N \times J}\sum_l^L \diag(\bs{\zeta}_l) \bs{B}_{J \times R} + \bs{E}_{N \times R},
\end{align}


where $L$ is the number of non-zero effects, or the upper bound on the number of non-zero effects if prior distribution of $\bs{b}_j$ includes a point mass at zero. For each $l = 1,\ldots, L$ we assume that *exactly 1 of the $J$ effects is non-zero*, as indicated by $\zeta_{lj}$,

\begin{align}
\bs{\zeta}_l \sim \text{Multinomial}(1, \bs{\omega}_l)
\end{align}

The key idea behind this parameterization is to devise a fully-factorized variational approximation based on 

\begin{align}
q(\bs{B}, \bs{\zeta}, \bs{\Gamma}) &= \prod_l q(\bs{B}_l, \bs{\zeta}_l, \bs{\Gamma}_l) \\
&=  \prod_l q(\bs{B}_l|\bs{\zeta}_l, \bs{\Gamma}_l)q(\bs{\Gamma}_l|\bs{\zeta}_l)q(\bs{\zeta}_l) \\
&=  \prod_l \prod_j q(\bs{b}_{lj}|\bs{\zeta}_l, \bs{\gamma}_{lj})q(\bs{\gamma}_{lj}|\bs{\zeta}_l)q(\bs{\zeta}_l)
\end{align}

Crucially, we do not factorize $q(\bs{\zeta}_l)$ across its $J$ elements, that is, $\bs{\zeta}_l$ is a binary vector with exactly one non-zero element. We will reveal connection with conditional regression later as we develop the variational algorithm.

## Evidence lower bound (ELBO)
Following from [the VEM framework](https://gaow.github.io/mvarbvs/writeup/20171203_VEM.html) where $Z:=(\bs{B}, \bs{\zeta}, \bs{\Gamma})$ and $\theta:= (\bs{\Sigma}, \bs{\Pi}, \bs{\omega})$, 

\begin{align}
\log p(\bs{Y}|\bs{\Sigma}, \bs{\pi}, \bs{\omega}) & \ge  \mathcal{L}(q, \bs{\Sigma}, \bs{\Pi}, \bs{\omega}) \\
&= E_q[\log p(\bs{Y}|\bs{B}, \bs{\zeta}, \bs{\Gamma}, \bs{\omega}, \bs{\Pi}, \bs{\Sigma})] + 
E_q[\log\frac{p(\bs{B}, \bs{\zeta}, \bs{\Gamma}| \bs{\Sigma}, \bs{\Pi}, \bs{\omega})}{q(\bs{B}, \bs{\zeta}, \bs{\Gamma})}] \\
&= E_q[\log p(\bs{Y}|\bs{B}, \bs{\Sigma})] + 
E_q[\log\frac{p(\bs{B}|\bs{\zeta}, \bs{\Gamma})
p(\bs{\Gamma}|\bs{\zeta}, \bs{\Pi}) p(\bs{\zeta}|\bs{\omega})}{\prod_l \prod_j q(\bs{b}_{lj}|\bs{\zeta}_l, \bs{\gamma}_{lj})q(\bs{\gamma}_{lj}|\bs{\zeta}_l)q(\bs{\zeta}_l)}]
\end{align}

where $q(\cdot)$ can be factorized as previously discussed, thus performing mean-field variational inference. In discussions hereafter we use the same notations as the original for variational parameters unless in conflict. We use instead the tilde notation ($\tilde{}$) for posterior estimates.

## Derivation of variational M&M

### One effect model

To develop the variational algorithm for fine-mapping with M&M we first discuss the case when there is only one non-zero effect, then show that the results can be generalized to the case with multiple non-zero effects to natually yield fine-mapping solutions. 

In the event where only one row $\bs{B}_{1\cdot}$ is non-zero, that is, $\omega_1 = 1$, $\bs{\omega}_{-1} = \bs{0}$, M&M becomes

\begin{align}
\bs{Y}_{N \times R} = \bs{x}_1\bs{b}_1 + \bs{E}_{N \times R},
\end{align}

a multivariate, single regressor Bayesian regression with prior

\begin{align}
p(\bs{b}_1) = \sum_{p = 0}^P\pi_{1p}\N_R(\bs{b}_1 | \bs{0}, \bs{V}_p).
\end{align}

Let "$\propto$" denote equality up to an additive constant independent of $q$, we write ELBO of this model

\begin{align}
\mathcal{L}_1(q, \bs{\Sigma}, \bs{\pi}_1; \bs{Y}) &= 
E_q [\log p(\bs{Y} | \bs{b}_1, \bs{\Sigma})] + 
E_q[\log \frac{p(\bs{b}_1 | \bs{\gamma}_1)p(\bs{\gamma}_1 | \bs{\pi}_1)}{q(\bs{b}_1 | \bs{\gamma}_1)q(\bs{\gamma}_1 | \bs{\pi}_1)}] \\
&= -\frac{NR}{2}\log(2\pi) - 
\frac{N}{2}E_q[\log\det (\bs{\Sigma})] - 
\frac{1}{2}E_q \{ \tr[(\bs{Y} - \bs{x}_1 \bs{b}_1) \bs{\Sigma}^{-1} (\bs{Y} - \bs{x}_1 \bs{b}_1)^\intercal] \} +
E_q[\log \frac{p(\bs{b}_1 | \bs{\gamma}_1)p(\bs{\gamma}_1 | \bs{\pi}_1)}{q(\bs{b}_1 | \bs{\gamma}_1)q(\bs{\gamma}_1 | \bs{\pi}_1)}]
\end{align}

We focus on $E_q \{ \tr[(\bs{Y} - \bs{x}_1\bs{b}_1) \bs{\Sigma}^{-1} (\bs{Y} - \bs{x}_1\bs{b}_1)^\intercal] \}$,

\begin{align}
E_q \{ \tr[(\bs{Y} - \bs{x}_1\bs{b}_1) \bs{\Sigma}^{-1} (\bs{Y} - \bs{x}_1\bs{b}_1)^\intercal] \} &= 
\tr\{\bs{Y} E_q[\bs{\Sigma}^{-1}] \bs{Y}^\intercal\} - 
2\tr\{\bs{Y}E_q[\bs{\Sigma}^{-1}]E_q[\bs{b}_1]^{\intercal}\bs{x}_1^\intercal\} +
E_q[\tr(\bs{x}_1\bs{b}_1\bs{\Sigma}^{-1}\bs{b}^\intercal \bs{x}_1^\intercal)] \\
& \propto E_q[\tr(\bs{x}_1\bs{b}_1\bs{\Sigma}^{-1}\bs{b}^\intercal \bs{x}_1^\intercal)] - 2\tr(\bs{Y}\bs{\Sigma}^{-1}E_q[\bs{b}_1]^\intercal\bs{x}_1^\intercal)
\end{align}

Therefore, 

\begin{align}
\mathcal{L}_1(q, \bs{\Sigma}, \bs{\pi}_1; \bs{Y}) & \propto
E_q[\tr(\bs{x}_1\bs{b}_1\bs{\Sigma}^{-1}\bs{b}^\intercal \bs{x}_1^\intercal)] - 2\tr(\bs{Y}\bs{\Sigma}^{-1}E_q[\bs{b}_1]^\intercal\bs{x}_1^\intercal) +
E_q[\log \frac{p(\bs{b}_1 | \bs{\gamma}_1)p(\bs{\gamma}_1 | \bs{\pi}_1)}{q(\bs{b}_1 | \bs{\gamma}_1)q(\bs{\gamma}_1 | \bs{\pi}_1)}]
\end{align}

In this case, we maximize $\mathcal{L}_1(q, \bs{\Sigma}, \bs{\pi}_1; \bs{Y})$ by simply setting variational distribution $q(\cdot)$ to the posterior, 

\begin{align}
p(\bs{b}_1, \bs{\gamma}_1 | \bs{Y}, \bs{\Sigma}, \bs{\pi}_1) = \argmax \mathcal{L}_1(q, \bs{\Sigma}, \bs{\pi}_1; \bs{Y})
\end{align}

since it can be calculated relatively easily due to established work (FIXME: detail the computation here for mixture model parameters, posterior and BF, along the lines of Urbut et al 2017).

### Two effects model

In the event where two rows of $\bs{B}$ is non-zero, that is, $\omega_1 = \omega_2 = 1$, $\bs{\omega}_{-1, -2} = \bs{0}$, M&M becomes

\begin{align}
\bs{Y}_{N \times R} = \bs{x}_1\bs{b}_1 + \bs{x}_2\bs{b}_2 + \bs{E}_{N \times R},
\end{align}

a multivariate, two regressor Bayesian regression with independent priors $p(\bs{b}_1, \bs{b}_2) = p(\bs{b}_1)p(\bs{b}_2)$ where 

\begin{align}
p(\bs{b}_\cdot) = \sum_{p = 0}^P\pi_p\N_R(\bs{b}_\cdot | \bs{0}, \bs{V}_p).
\end{align}

we write ELBO of this model

\begin{align}
\mathcal{L}_2(q, \bs{\Sigma}, \bs{\pi}_1, \bs{\pi}_2; \bs{Y}) &= 
E_q [\log p(\bs{Y} | \bs{b}_1, \bs{b}_2, \bs{\Sigma})] + 
E_q[\log \frac{p(\bs{b}_1, \bs{\gamma}_1, \bs{b}_2, \bs{\gamma}_2 | \bs{\pi_1}, \bs{\pi_2})}{q(\bs{b}_1, \bs{\gamma}_1, \bs{b}_2, \bs{\gamma}_2| \bs{\pi_1}, \bs{\pi_2})}] \\
&= -\frac{NR}{2}\log(2\pi) - 
\frac{N}{2}E_q[\log\det (\bs{\Sigma})] - 
\frac{1}{2}E_q \{ \tr[(\bs{Y} - \bs{x}_1\bs{b}_1 - \bs{x}_2\bs{b}_2) \bs{\Sigma}^{-1} (\bs{Y} - \bs{x}_1\bs{b}_1 - \bs{x}_2\bs{b}_2)^\intercal] \} +
E_q[\log \frac{p(\bs{b}_1, \bs{\gamma}_1, \bs{b}_2, \bs{\gamma}_2 | \bs{\pi_1}, \bs{\pi_2})}{q(\bs{b}_1, \bs{\gamma}_1, \bs{b}_2, \bs{\gamma}_2 | \bs{\pi_1}, \bs{\pi_2})}]
\end{align}

where we choose $q(\bs{b}_1, \bs{\gamma}_1, \bs{b}_2, \bs{\gamma}_2|\bs{\pi_1}, \bs{\pi_2}) = q_1(\bs{b}_1, \bs{\gamma}_1|\bs{\pi_1})q_2(\bs{b}_2, \bs{\gamma}_2|\bs{\pi_2})$ be the variational approximation to the posterior $p(\bs{b}_1, \bs{\gamma}_1, \bs{b}_2, \bs{\gamma}_2|\bs{Y}, \bs{\pi_1}, \bs{\pi_2}, \bs{\Sigma})$, ie, a "fully factorized" variational approximation. This allows us to use an iterative approach to maximize the ELBO.

#### Maximize over $q_2$ with $q_1$ fixed

Similar to the "one effect model" we focus on

\begin{align}
E_q \{ \tr[(\bs{Y} - \bs{x}_1\bs{b}_1 - \bs{x}_2\bs{b}_2) \bs{\Sigma}^{-1} (\bs{Y} - \bs{x}_1\bs{b}_1 - \bs{x}_2\bs{b}_2)^\intercal] \},
\end{align}

and analogous to the setup of "conditional regression", we treat $q_1$ fixed and maximize over $q_2$ only,

\begin{align}
E_q \{ \tr[(\bs{Y} - \bs{x}_1\bs{b}_1 - \bs{x}_2\bs{b}_2) \bs{\Sigma}^{-1} (\bs{Y} - \bs{x}_1\bs{b}_1 - \bs{x}_2\bs{b}_2)^\intercal] \} & \propto
E_{q_2}\{\tr(\bs{x}_2\bs{b}_2\bs{\Sigma}^{-1}\bs{b}_2^\intercal\bs{x}_2^\intercal)\} -
2\tr\{(\bs{Y} - \bs{x}_1E_{q_1}[\bs{b}_1])\bs{\Sigma}^{-1}E_{q_2}[\bs{b}_2]^\intercal \bs{x}_2^\intercal\}
\end{align}

Let $\bs{\xi}_1 = \bs{Y} - \bs{x}_1E_{q_1}[\bs{b}_1]$, we have
\begin{align}
\mathcal{L}_2(q_2, \bs{\Sigma}, \bs{\pi}_2; \bs{Y}) & \propto E_{q_2}\{\tr(\bs{x}_2\bs{b}_2\bs{\Sigma}^{-1}\bs{b}_2^\intercal\bs{x}_2^\intercal)\} -
2\tr\{\bs{\xi}_1\bs{\Sigma}^{-1}E_{q_2}[\bs{b}_2]^\intercal \bs{x}_2^\intercal\} +
E_{q_2}[\log \frac{p(\bs{b}_2 | \bs{\gamma}_2)p(\bs{\gamma}_2|\bs{\pi}_2)}{q_2(\bs{b}_2 | \bs{\gamma}_2)q_2(\bs{\gamma}_2 | \bs{\pi}_2)}],
\end{align}

and similar to the case with "one effect model", $p(\bs{b}_2, \bs{\gamma}_2|\bs{\xi}_1, \bs{\pi}_2, \bs{\Sigma}) = \argmax \mathcal{L}_2(q_2, \bs{\Sigma}, \bs{\pi}_2; \bs{Y})$. In other words we maximize $\mathcal{L}_2$ over $q_2$ with $q_1$ fixed, by applying the same posterior computation for maximizing $\mathcal{L}_1$ over $q_1$ but using residualized response $\bs{\xi}_1$ rather than the original response $\bs{Y}$:

\begin{align}
\mathcal{L}_2(q_2, \bs{\Sigma}, \bs{\pi}_2; \bs{Y}) \propto \mathcal{L}_1(q_2, \bs{\Sigma}, \bs{\pi}_2; \bs{\xi}_1)
\end{align}

#### Maximize over $q_1$ with $q_2$ fixed

Similarly we can maximize $\mathcal{L}_1$ over $q_1$ with $q_2$ fixed. The algorithm iterates until convergence.

### Generalization to arbitary number of effects

The arguments above can be generalized to having $L$ non-zero effects. For the $l$-th effect we optimize $\mathcal{L}_l$ over $q_l$ with all other $q_{-l}$ fixed, by applying Bayesian multivariate regression to $\bs{\xi}_{-l} = \bs{x}_l \bs{b}_l + E $. The algorithm iterates until convergence.

### Variational updates for quantities of interest



## Derivation of ELBO

As shown before, ELBO for the full variational M&M model is

\begin{align}
\mathcal{L}(q, \bs{\omega}, \bs{\Pi}, \bs{\Sigma}; \bs{Y}) & = E_q[\log p(\bs{Y} | \bs{B}, \bs{\zeta}, \bs{\Lambda}, \bs{\omega}, \bs{\Pi}, \bs{\Sigma})] + 
E_q[\log\frac{p(Z|\Theta)}{q(Z)}] \\
& \propto E_q\{\tr(\bs{X}\bs{B}\bs{\Sigma}^{-1}\bs{B}^\intercal\bs{X}^\intercal)\} - 2\tr\{\bs{Y}\bs{\Sigma}^{-1}\bar{\bs{B}}^\intercal\bs{X}^\intercal\} +
E_q[\log\frac{p(Z|\Theta)}{q(Z)}]
\end{align}

where $\bar{\bs{B}}:= E_q[\bs{B}]$, $E_q[\log\frac{p(Z|\Theta)}{q(Z)}]:= E_q[\log\frac{p(\bs{B}|\bs{\zeta}, \bs{\Gamma})
p(\bs{\Gamma}|\bs{\zeta}, \bs{\Pi}) p(\bs{\zeta}|\bs{\omega})}{\prod_l \prod_j q(\bs{b}_{lj}|\bs{\zeta}_l, \bs{\gamma}_{lj})q(\bs{\gamma}_{lj}|\bs{\zeta}_l)q(\bs{\zeta}_l)}]$. As will be shown later, we estimate $\bar{\bs{B}}$ using posterior mean $\tilde{\bs{B}}$. We need to work out $E_q\{\tr(\bs{X}\bs{B}\bs{\Sigma}^{-1}\bs{B}^\intercal\bs{X}^\intercal)\}$ and $E_q[\log\frac{p(Z|\Theta)}{q(Z)}]$ next.

### $E_q\{\tr(\bs{X}\bs{B}\bs{\Sigma}^{-1}\bs{B}^\intercal\bs{X}^\intercal)\}$

\begin{align}
E_q\{\tr(\bs{X}\bs{B}\bs{\Sigma}^{-1}\bs{B}^\intercal\bs{X}^\intercal)\} &= 
E_q\{\tr(\bs{B}\bs{\Sigma}^{-1}\bs{B}^\intercal\bs{X}^\intercal\bs{X})\} \quad \text{Cyclic permutation of trace} \\
&= E_q\{\tr(\bs{\Sigma}^{-1}\bs{B}^\intercal \bs{S} \bs{B})\} \\
&= \tr\{\bs{\Sigma}^{-1}E_q[\bs{B}^\intercal \bs{S} \bs{B}]\},
\end{align}

where $\bs{S}:=\bs{X}^\intercal\bs{X}$. Now we focus on element-wise computations for $\big(\bs{\Sigma}^{-1}E_q[\bs{B}^\intercal \bs{S} \bs{B}]\big)$. Recall that $\bs{B} \in \mathbb{R}^{J\times R}$, $\bs{S} \in \mathbb{R}^{J\times J}$, $\bs{\Sigma}^{-1} \in \mathbb{R}^{R\times R}$,

\begin{align}
[\bs{B}^\intercal\bs{S}]_{rj'} &= \sum_j^J B_{jr} S_{jj'}, \\
[\bs{B}^\intercal\bs{S}\bs{B}]_{rr'} &= \sum_j \sum_{j'} B_{jr} S_{jj'} B_{j'r'}, \\
E_q[\bs{B}^\intercal\bs{S}\bs{B}]_{rr'} &= \sum_j \sum_{j'} S_{jj'} E_q[B_{jr} B_{j'r'}] \\
&= \sum_j \sum_{j'} S_{jj'} E_q[B_{jr}] E_q[B_{j'r'}] + \sum_j \sum_{j'} S_{jj'} \cov(B_{jr}, B_{j'r'}) \\
&= \sum_j \sum_{j'} S_{jj'} \bar{B}_{jr} \bar{B}_{j'r'} + \sum_j \sum_{j'} S_{jj'} \rho_{jrr'}\mathbb{1}(j=j'),
\end{align}

where $\rho_{jrr'}:= \cov(b_{jr}, b_{jr'})$ is non-zero for the $j$-th effect at conditions $r$ and $r'$, and can be estimated by posterior covariance for $\bs{b}_j$. For $j \ne j'$, due to the model assumption of independent effects, the correlations are zero.

The $rr'$ element of $\big(\bs{\Sigma}^{-1}E_q[\bs{B}^\intercal \bs{S} \bs{B}]\big)$ is thus

\begin{align}
\big(\bs{\Sigma}^{-1}E_q[\bs{B}^\intercal \bs{S} \bs{B}]\big)_{rr'} &= 
\sum_l^R \lambda_{rl}\big(\sum_j \sum_{j'}S_{jj'}\bar{B}_{jl}\bar{B}_{j'r'} + \sum_j S_{jj}\rho_{jlr'}\big) \\
& = \sum_l^R\sum_j\sum_{j'}\lambda_{rl}S_{jj'}\bar{B}_{jl}\bar{B}_{j'r'} + 
\sum_l^R\sum_j \lambda_{rl} S_{jj} \rho_{jlr'},
\end{align}

and finally

\begin{align}
E_q\{\tr(\bs{X}\bs{B}\bs{\Sigma}^{-1}\bs{B}^\intercal\bs{X}^\intercal)\} &= 
\tr\{\bs{\Sigma}^{-1}E_q[\bs{B}^\intercal \bs{S} \bs{B}]\} \\
&= \sum_r^R\{\bs{\Sigma}^{-1}E_q[\bs{B}^\intercal \bs{S} \bs{B}]\} \\
&= \sum_r \sum_l^R \sum_j \sum_{j'} \lambda_{rl} S_{jj'} \bar{B}_{jl} \bar{B}_{j'r} + 
\sum_r \sum_l^R \sum_j \lambda_{rl} S_{jj} \rho_{jlr} \\
&= \sum_r \sum_{r'} \sum_j \sum_{j'} \lambda_{rr'} S_{jj'} \bar{B}_{jr'} \bar{B}_{j'r} + 
\sum_r \sum_{r'} \sum_j \lambda_{rr'} S_{jj} \rho_{jrr'} \\
&= \tr\{\bs{\Sigma}^{-1}\bar{\bs{B}}^\intercal \bs{S} \bar{\bs{B}}\} + 
\sum_r \sum_{r'} \sum_j \lambda_{rr'} S_{jj} \rho_{jrr'}
\end{align}

### $E_q[\log\frac{p(Z|\Theta)}{q(Z)}]$

By definition,

\begin{align}
E_q[\log\frac{p(Z|\Theta)}{q(Z)}] &= E_q[\log p(Z|\Theta)] - E_q[\log q(Z)] \\
&= E_q[\log p(\bs{B}, \bs{\zeta}, \bs{\Gamma}|\bs{\omega}, \bs{\Pi})] - 
E_q[\prod_l \prod_j q(\bs{b}_{lj}|\bs{\zeta}_l, \bs{\gamma}_{lj})q(\bs{\gamma}_{lj}|\bs{\zeta}_l)q(\bs{\zeta}_l)]
\end{align}

#### $E_q[\log p(\bs{B}, \bs{\zeta}, \bs{\Gamma}|\bs{\omega}, \bs{\Pi})]$

By Bishop 2006 Equation 9.36, 

\begin{align}
E_q[\log p(\bs{B}, \bs{\zeta}, \bs{\Gamma}|\bs{\omega}, \bs{\Pi})] &=
E_q\big[ \sum_j \zeta_j \big( \sum_p \gamma_{jp} \{\log(\pi_p) - \frac{R}{2}\log(2\pi) - \frac{1}{2}\log \det(\bs{V}_p) - \frac{1}{2} \bs{b}_j^\intercal \bs{V}_p^{-1} \bs{b}_j\} \big) | \bs{V}, \bs{\omega}, \bs{\pi}\big] \\
& = E_q\big[ \sum_j \sum_p \zeta_j \gamma_{jp} \{ \log(\pi_p) - \frac{R}{2} \log(2\pi) - \frac{1}{2} \log \det (\bs{V}_p) \} - \frac{1}{2} \sum_j \zeta_j (\bs{V}_p \gamma_{jp} \bs{b}_j^\intercal \bs{V}_p^{-1} \bs{b}_j) \big] \\
& = \sum_j \sum_p \omega_j \psi_{jp} \{ \log(\pi_p) - \frac{R}{2} \log(2\pi) - \frac{1}{2} \log \det (\bs{V}_p) \} - \frac{1}{2} \sum_j \sum_p E_q[\zeta_j \gamma_{jp} \bs{b}_j^\intercal \bs{V}_p^{-1} \bs{b}_j],
\end{align}

where $\psi_{jp} := \frac{\pi_p N(\bs{b_j}; \bar{\bs{b}}_j, \bs{U}_{jp})}{\sum_p \pi_p N(\bs{b_j}; \bar{\bs{b}}_j, \bs{U}_{jp})}$, and $\omega_j$ is estimated by weight of the $j$-th Bayes Factor $\hat{\omega}_j = \frac{\text{BF}_j}{\sum_j \text{BF}_j}$. $\hat{\pi}_p$ is weight for the $p$-th component of prior covariance matrix, $\bar{\bs{b}}_j = E_q[\bs{b}_j]$ are estimated by posterior mean $\tilde{\bs{b}}_j$ and $\bs{U}_{jp}$ are estimated by posterior covariance for component $p$.

Now we are left to work on $E_q[\zeta_j \gamma_{jp} \bs{b}_j^\intercal \bs{V}_p^{-1} \bs{b}_j]$ which is a scalar,

\begin{align}
E_q[\zeta_j \gamma_{jp} \bs{b}_j^\intercal \bs{V}_p^{-1} \bs{b}_j] &= 
E_q\big[E_q[\zeta_j \gamma_{jp} \bs{b}_j^\intercal \bs{V}_p^{-1} \bs{b}_j | \zeta_j,\gamma_{jp}] \big] \\
&= E_q\big[E_q[\zeta_j \gamma_{jp} \tr(\bs{b}_j^\intercal \bs{V}_p^{-1} \bs{b}_j) | \zeta_j,\gamma_{jp}] \big] \\
&= E_q\big[E_q[\zeta_j \gamma_{jp} \tr(\bs{V}_p^{-1} \bs{b}_j \bs{b}_j^\intercal) | \zeta_j,\gamma_{jp}] \big] \\
&= \tr \big \{ E_q \big [ \bs{V}_p^{-1} E_q[\zeta_j \gamma_{jp} \bs{b}_j \bs{b}_j^\intercal | \zeta_j \gamma_{jp}] \big ] \big \} \\
&= \tr \big \{ E_q \big [ \bs{V}_p^{-1} \omega_j \psi_{jp} (\bar{\bs{b}}_j \bar{\bs{b}}_j^\intercal + \bs{U}_{jp})] \big ] \big \} \\
&= \omega_j\psi_{jp} \tr[\bs{V}_p^{-1}(\bar{\bs{b}}_j \bar{\bs{b}}_j^\intercal + \bs{U}_{jp})]
\end{align}

Hence,
\begin{align}
E_q[\log p(\bs{B}, \bs{\zeta}, \bs{\Gamma}|\bs{V}, \bs{\omega}, \bs{\pi})] &= 
\sum_j \sum_p \omega_j \psi_{jp} \{ \log(\pi_p) - \frac{R}{2} \log(2\pi) - \frac{1}{2} \log \det (\bs{V}_p) \} -
\frac{1}{2} \sum_j \sum_p \omega_j\psi_{jp} \tr[\bs{V}_p^{-1}(\bar{\bs{b}}_j \bar{\bs{b}}_j^\intercal + \bs{U}_{jp})]
\end{align}

#### $E_q[\log q(\bs{B}, \bs{\zeta}, \bs{\Gamma}|\bs{V}, \bs{\omega}, \bs{\pi})]$

\begin{align}
E_q[\log q(\bs{B}, \bs{\zeta}, \bs{\Gamma}|\bs{V}, \bs{\omega}, \bs{\pi})] &= 
E_q\big[\sum_j \zeta_j \sum_p \gamma_{jp} \{ \log\psi_{jp} - \frac{R}{2} \log(2\pi) - \frac{1}{2} \log \det(\bs{U}_{jp}) - \frac{1}{2} (\bs{b}_j - \bar{\bs{b}}_j)^\intercal \bs{U}_{jp}^{-1} (\bs{b}_j - \bar{\bs{b}}_j) \} \big] \\
& = \sum_j \omega_j \sum_p \psi_{jp} \{ \log\psi_{jp} - \frac{R}{2} \log(2\pi) - \frac{1}{2} \log \det(\bs{U}_{jp}) \} -
\frac{1}{2} \sum_j \sum_p E_q\big[ \zeta_j \gamma_{jp} (\bs{b}_j - \bar{\bs{b}}_j)^\intercal \bs{U}_{jp}^{-1} (\bs{b}_j - \bar{\bs{b}}_j)\big],
\end{align}

where the expectation term

\begin{align}
E_q\big[ \zeta_j \gamma_{jp} (\bs{b}_j - \bar{\bs{b}}_j)^\intercal \bs{U}_{jp}^{-1} (\bs{b}_j - \bar{\bs{b}}_j)\big] &= 
E_q \big [ E_q[\zeta_j \gamma_{jp} (\bs{b}_j - \bar{\bs{b}}_j)^\intercal \bs{U}_{jp}^{-1} (\bs{b}_j - \bar{\bs{b}}_j) | \zeta_j, \gamma_{jp} ] \big] \\
&= E_q \big [ \zeta_j \gamma_{jp} E_q[(\bs{b}_j - \bar{\bs{b}}_j)^\intercal \bs{U}_{jp}^{-1} (\bs{b}_j - \bar{\bs{b}}_j) | \zeta_j, \gamma_{jp} ] \big] \\
&= E_q[R\zeta_j\gamma_{jp}] \quad \text{since } \bs{b}_j | \zeta_j, \gamma_{jp} \sim N_R(\bar{\bs{b}}_j, \bs{U}_{jp})\\ 
&= R\omega_j\psi_{jp}
\end{align}

Hence 

\begin{align}
E_q[\log q(\bs{B}, \bs{\zeta}, \bs{\Gamma}|\bs{V}, \bs{\omega}, \bs{\pi})] &=
\sum_j\omega_j \sum_p\psi_{jp} \big( \log\psi_{jp} - \frac{R}{2}\log(2\pi) - \frac{1}{2}\log\det(\bs{U}_{jp}) - \frac{R}{2}\big)
\end{align}