# M&M model for fine-mapping

This is the 5th version of M&M attempt. The formulation of this version was inspired by conditional regression for fine-mapping.

$\newcommand{\bs}[1]{\boldsymbol{#1}}$
$\DeclareMathOperator*{\diag}{diag}$
$\DeclareMathOperator*{\cov}{cov}$
$\DeclareMathOperator*{\rank}{rank}$
$\DeclareMathOperator*{\var}{var}$
$\DeclareMathOperator*{\tr}{tr}$
$\DeclareMathOperator*{\veco}{vec}$
$\DeclareMathOperator*{\uniform}{\mathcal{U}niform}$
$\DeclareMathOperator*{\argmin}{arg\ min}$
$\DeclareMathOperator*{\argmax}{arg\ max}$
$\DeclareMathOperator*{\N}{N}$
$\DeclareMathOperator*{\gm}{Gamma}$
$\DeclareMathOperator*{\dif}{d}$

## M&M ASH model

We assume the following multivariate, multiple regression model with $N$ samples, $J$ effects and $R$ conditions (and **without covariates, for the time being**)
\begin{align}
\bs{Y}_{N\times R} = \bs{X}_{N \times J}\bs{B}_{J \times R} + \bs{E}_{N \times R},
\end{align}
where
\begin{align}
\bs{E} &\sim \N_{N \times R}(\bs{0}, \bs{I}_N, \bs{\Sigma}),\\
\bs{\Sigma} &= \diag(\sigma_1,\ldots,\sigma_R).
\end{align}

Let $\bs{\Lambda} = \bs{\Sigma}^{-1}$, we place Gamma prior on $\bs{\Lambda}$

$$\lambda_r \overset{iid}{\sim} \gm(\alpha, \beta),$$

and as a first path we set $\alpha = \beta = 0$ so that it is equivalent to estimating $\bs{\Sigma}$ via maximum likelihood.

Let $\omega_j := p(\zeta_j)$ be the prior probability that effect $j$ is non-zero. We assume non-zero effects $\bs{b}_j$ (rows of $\bs{B}$) are iid with prior distribution of mixtures of multivariate normals

\begin{align}
p(\bs{b}_j|\zeta_j = 1) = \sum_{p = 0}^P\pi_p\N_R(\bs{b}_j | \bs{0}, \bs{V}_p),
\end{align}

where the $\bs{V}_p$'s are $R \times R$ positive semi-definite covariance matrices and the $\pi_p$'s are their weights. We can augment the prior of $\bs{b}_j$ by indicator vector $\bs{\gamma}_j \in \mathbb{R}^P$ denoting membership of $\bs{b}_j$ into one of the $P$ mixture groups and write

\begin{align}
p(\bs{b}_j|\zeta_j = 1, \bs{\gamma}_j) &= \prod_{p = 0}^P\left[\N(\bs{b}_j|\bs{0},\bs{V}_p)\right]^{\gamma_{jp}},
\end{align}

where

\begin{align}
p(\bs{\gamma}_j) &= \prod_{p = 0}^{P} \pi_p^{\gamma_{jp}}
\end{align}

Let $Z := (\bs{B}, \bs{\zeta}, \bs{\Gamma})$, $\Theta := (\bs{\omega}, \bs{\pi}, \bs{V})$.
The densities involved are

\begin{align}
p(\bs{Y}, \bs{\Sigma}, \bs{B},\bs{\zeta}, \bs{\Gamma}, \bs{\omega}, \bs{\pi}, \bs{V}) &= 
p(\bs{Y}, \bs{\Sigma}, Z, \Theta) \\
& = p(\bs{Y}|\bs{\Sigma}, Z, \Theta) p(Z|\Theta) p(\bs{\Sigma})
\end{align}

We are interested in computing the posterior $p(Z|\Theta, \bs{Y}, \bs{\Sigma})$. Note that in solving the M&M model below we assume $V_p$'s and their corresponding $\pi_p$'s are known; in practice we use `mashr` to estimate these quantities and provide them to M&M.

## Evidence lower bound (ELBO)
Following from [the VEM framework](https://gaow.github.io/mvarbvs/writeup/20171203_VEM.html), 

\begin{align}
\log p(\bs{Y}|\bs{\Sigma}, \Theta) & = E_q[\log p(\bs{Y} | Z, \bs{\Sigma}, \Theta)] + 
E_q[\log\frac{p(Z|\Theta)}{q(Z)}] - E_q[\log \frac{p(Z|\bs{Y}, \bs{\Sigma}, \Theta)}{q(Z)}] \\
& = \mathcal{L}(q, \bs{\Sigma}, \Theta) + D_{kl}[q(Z) || p(Z|\bs{Y}, \bs{\Sigma}, \Theta)]
\end{align}

where $D_{kl}(\cdot)$ is the Kullback–Leibler (KL) divergence, and $\mathcal{L}(q, \bs{\Sigma}, \Theta) \ge 0$ is Evidence lower bound (ELBO).

## One effect model

To derive the variational algorithm for fine-mapping with M&M we first discuss the case when there is only one non-zero effect, then show that the results can be generalized to the case with multiple non-zero effects to natually yield fine-mapping solutions. 

In the event where only one row $\bs{B}_{1\cdot}$ is non-zero, that is, $\zeta_1 = 1$, $\bs{\zeta}_{-1} = 0$, M&M becomes


\begin{align}
\bs{Y}_{N \times R} = \bs{x}_1\bs{b}_1 + \bs{E}_{N \times R},
\end{align}

a multivariate, single regressor Bayesian regression with prior

\begin{align}
p(\bs{b}_1) = \sum_{p = 0}^P\pi_p\N_R(\bs{b}_1 | \bs{0}, \bs{V}_p).
\end{align}

Let "$\propto$" denote equality up to an additive constant independent of $q$, we write ELBO of this model

\begin{align}
\mathcal{L}_1(q, \bs{\Sigma}, \Theta; \bs{Y}) &= 
E_q [\log p(\bs{Y} | \bs{b}_1, \bs{\gamma}_1, \bs{\Sigma}, \Theta)] + 
E_q[\log \frac{p(\bs{b}_1, \bs{\gamma}_1 | \Theta)}{q(\bs{b}_1, \bs{\gamma}_1)}] \\
&= -\frac{NR}{2}\log(2\pi) - 
\frac{N}{2}E_q[\log\det (\bs{\Sigma})] - 
\frac{1}{2}E_q \{ \tr[(\bs{Y} - \bs{x}_1 \bs{b}_1) \bs{\Sigma}^{-1} (\bs{Y} - \bs{x}_1 \bs{b}_1)^\intercal] \} +
E_q[\log \frac{p(\bs{b}_1 | \bs{\gamma}_1, \Theta)p(\bs{\gamma}_1)}{q(\bs{b}_1 | \bs{\gamma}_1)q(\bs{\gamma_1})}]
\end{align}

We focus on $E_q \{ \tr[(\bs{Y} - \bs{x}_1\bs{b}_1) \bs{\Sigma}^{-1} (\bs{Y} - \bs{x}_1\bs{b}_1)^\intercal] \}$,

\begin{align}
E_q \{ \tr[(\bs{Y} - \bs{x}_1\bs{b}_1) \bs{\Sigma}^{-1} (\bs{Y} - \bs{x}_1\bs{b}_1)^\intercal] \} &= 
\tr\{\bs{Y} E_q[\bs{\Sigma}^{-1}] \bs{Y}^\intercal\} - 
2\tr\{\bs{Y}E_q[\bs{\Sigma}^{-1}]E_q[\bs{b}_1]^{\intercal}\bs{x}_1^\intercal\} +
E_q[\tr(\bs{x}_1\bs{b}_1\bs{\Sigma}^{-1}\bs{b}^\intercal \bs{x}_1^\intercal)] \\
& \propto E_q[\tr(\bs{x}_1\bs{b}_1\bs{\Sigma}^{-1}\bs{b}^\intercal \bs{x}_1^\intercal)] - 2\tr(\bs{Y}\bs{\Sigma}^{-1}E_q[\bs{b}_1]^\intercal\bs{x}_1^\intercal)
\end{align}

Therefore, 

\begin{align}
\mathcal{L}_1(q, \bs{\Sigma}, \Theta; \bs{Y}) & \propto
E_q[\tr(\bs{x}_1\bs{b}_1\bs{\Sigma}^{-1}\bs{b}^\intercal \bs{x}_1^\intercal)] - 2\tr(\bs{Y}\bs{\Sigma}^{-1}E_q[\bs{b}_1]^\intercal\bs{x}_1^\intercal) +
E_q[\log \frac{p(\bs{b}_1 | \bs{\gamma}_1, \Theta)p(\bs{\gamma}_1)}{q(\bs{b}_1 | \bs{\gamma}_1)q(\bs{\gamma_1})}]
\end{align}

In this case, we maximize $\mathcal{L}_1(q, \bs{\Sigma}, \Theta; \bs{Y})$ by setting variational distribution to the posterior, since it can be calculated easily (FIXME: detail the computation here, along the lines of Wakefield 2009, also possibly as implemented in `mashr` package). That is, $p(Z|\Theta, \bs{Y}, \bs{\Sigma}) = \argmax \mathcal{L}_1(q, \bs{\Sigma}, \Theta; \bs{Y})$.

## Two effects model

In the event where two rows of $\bs{B}$ is non-zero, that is, $\zeta_1 = \zeta_2 = 1$, $\bs{\zeta}_{-1, -2} = 0$, M&M becomes

\begin{align}
\bs{Y}_{N \times R} = \bs{x}_1\bs{b}_1 + \bs{x}_2\bs{b}_2 + \bs{E}_{N \times R},
\end{align}

a multivariate, two regressor Bayesian regression with independent priors $p(\bs{b}_1, \bs{b}_2) = p(\bs{b}_1)p(\bs{b}_2)$ where 

\begin{align}
p(\bs{b}_\cdot) = \sum_{p = 0}^P\pi_p\N_R(\bs{b}_\cdot | \bs{0}, \bs{V}_p).
\end{align}

we write ELBO of this model

\begin{align}
\mathcal{L}_2(q, \bs{\Sigma}, \Theta; \bs{Y}) &= 
E_q [\log p(\bs{Y} | \bs{b}_1, \bs{\gamma}_1, \bs{b}_2, \bs{\gamma}_2, \bs{\Sigma}, \Theta)] + 
E_q[\log \frac{p(\bs{b}_1, \bs{\gamma}_1, \bs{b}_2, \bs{\gamma}_2 | \Theta)}{q(\bs{b}_1, \bs{\gamma}_1, \bs{b}_2, \bs{\gamma}_2)}] \\
&= -\frac{NR}{2}\log(2\pi) - 
\frac{N}{2}E_q[\log\det (\bs{\Sigma})] - 
\frac{1}{2}E_q \{ \tr[(\bs{Y} - \bs{x}_1\bs{b}_1 - \bs{x}_2\bs{b}_2) \bs{\Sigma}^{-1} (\bs{Y} - \bs{x}_1\bs{b}_1 - \bs{x}_2\bs{b}_2)^\intercal] \} +
E_q[\log \frac{p(\bs{b}_1, \bs{\gamma}_1, \bs{b}_2, \bs{\gamma}_2 | \Theta)}{q(\bs{b}_1, \bs{\gamma}_1, \bs{b}_2, \bs{\gamma}_2)}]
\end{align}

where we choose $q(Z_1, Z_2) = q_1(Z_1)q_2(Z_2)$ be the variational approximation to the posterior $p(Z|\Theta, \bs{\Sigma})$, ie, a "fully factorized" variational approximation. This allows us to use an iterative approach to maximize the ELBO.

### Maximize over $q_2$ with $q_1$ fixed

Similar to the "one effect model" we focus on

\begin{align}
E_q \{ \tr[(\bs{Y} - \bs{x}_1\bs{b}_1 - \bs{x}_2\bs{b}_2) \bs{\Sigma}^{-1} (\bs{Y} - \bs{x}_1\bs{b}_1 - \bs{x}_2\bs{b}_2)^\intercal] \},
\end{align}

and analogous to the setup of "conditional regression", we treat $q_1$ fixed and maximize over $q_2$ only,

\begin{align}
E_q \{ \tr[(\bs{Y} - \bs{x}_1\bs{b}_1 - \bs{x}_2\bs{b}_2) \bs{\Sigma}^{-1} (\bs{Y} - \bs{x}_1\bs{b}_1 - \bs{x}_2\bs{b}_2)^\intercal] \} & \propto
E_{q_2}\{\tr(\bs{x}_2\bs{b}_2\bs{\Sigma}^{-1}\bs{b}_2^\intercal\bs{x}_2^\intercal)\} -
2\tr\{(\bs{Y} - \bs{x}_1E_{q_1}[\bs{b}_1])\bs{\Sigma}^{-1}E_{q_2}[\bs{b}_2]^\intercal \bs{x}_2^\intercal\}
\end{align}

Let $\bs{\xi}_1 = \bs{Y} - \bs{x}_1E_{q_1}[\bs{b}_1]$, we have
\begin{align}
\mathcal{L}_2(q_2, \bs{\Sigma}, \Theta; \bs{Y}) & \propto E_{q_2}\{\tr(\bs{x}_2\bs{b}_2\bs{\Sigma}^{-1}\bs{b}_2^\intercal\bs{x}_2^\intercal)\} -
2\tr\{\bs{\xi}_1\bs{\Sigma}^{-1}E_{q_2}[\bs{b}_2]^\intercal \bs{x}_2^\intercal\} +
E_{q_2}[\log \frac{p(\bs{b}_2 | \bs{\gamma}_2, \Theta)p(\bs{\gamma}_2)}{q_2(\bs{b}_2 | \bs{\gamma}_2)q_2(\bs{\gamma_2})}],
\end{align}

and similar to the case with "one effect model", $p(Z|\Theta, \bs{\xi}_1, \bs{\Sigma}) = \argmax \mathcal{L}_2(q_2, \bs{\Sigma}, \Theta; \bs{Y})$. In other words we maximize $\mathcal{L}_2$ over $q_2$ with $q_1$ fixed, by applying the same posterior computation for maximizing $\mathcal{L}_1$ over $q_1$ but using residualized response $\bs{\xi}_1$ rather than the original response $\bs{Y}$:

\begin{align}
\mathcal{L}_2(q_2, \bs{\Sigma}, \Theta; \bs{Y}) \propto \mathcal{L}_1(q_2, \bs{\Sigma}, \Theta; \bs{\xi}_1)
\end{align}

### Maximize over $q_1$ with $q_2$ fixed

Similarly we can maximize $\mathcal{L}_1$ over $q_1$ with $q_2$ fixed. The algorithm iterates until convergence.

## Generalization to arbitary number of effects

The arguments above can be generalized to having $L$ non-zero effects. For the $l$-th effect we optimize $\mathcal{L}_l$ over $q_l$ with all other $q_{-l}$ fixed, by applying Bayesian multivariate regression to $\bs{\xi}_{-l} = \bs{x}_l \bs{b}_l + E $. The algorithm iterates until convergence.

## Derivation of ELBO

As shown before, ELBO for the full variational M&M model is

\begin{align}
\mathcal{L}(q, \bs{\Sigma}, \Theta; \bs{Y}) & = E_q[\log p(\bs{Y} | Z, \bs{\Sigma}, \Theta)] + 
E_q[\log\frac{p(Z|\Theta)}{q(Z)}] \\
& \propto E_q\{\tr(\bs{X}\bs{B}\bs{\Sigma}^{-1}\bs{B}^\intercal\bs{X}^\intercal)\} - 2\tr\{\bs{Y}\bs{\Sigma}^{-1}E_q[\bs{B}]^\intercal\bs{X}^\intercal\} +
E_q[\log\frac{p(Z|\Theta)}{q(Z)}]
\end{align}

We will work out each of the 3 terms.

### Compute $\tr\{\bs{Y}\bs{\Sigma}^{-1}E_q[\bs{B}]^\intercal\bs{X}^\intercal\}$

We can use posterior mean $\tilde{\bs{B}}$ to estimate $E_q[\bs{B}]$, thus

\begin{align}
\tr\{\bs{Y}\bs{\Sigma}^{-1}E_q[\bs{B}]^\intercal\bs{X}^\intercal\} &= 
\tr\{\bs{Y}\bs{\Sigma}^{-1}\tilde{\bs{B}}^\intercal\bs{X}^\intercal\}
\end{align}


### Compute $E_q\{\tr(\bs{X}\bs{B}\bs{\Sigma}^{-1}\bs{B}^\intercal\bs{X}^\intercal)\}$

\begin{align}
E_q\{\tr(\bs{X}\bs{B}\bs{\Sigma}^{-1}\bs{B}^\intercal\bs{X}^\intercal)\} &= 
E_q\{\tr(\bs{B}\bs{\Sigma}^{-1}\bs{B}^\intercal\bs{X}^\intercal\bs{X})\} \quad \text{Cyclic permutation of trace} \\
&= E_q\{\tr(\bs{\Sigma}^{-1}\bs{B}^\intercal \bs{S} \bs{B})\} \\
&= \tr\{\bs{\Sigma}^{-1}E_q[\bs{B}^\intercal \bs{S} \bs{B}]\},
\end{align}

where $\bs{S}:=\bs{X}^\intercal\bs{X}$. Now we focus on element-wise computations for $\big(\bs{\Sigma}^{-1}E_q[\bs{B}^\intercal \bs{S} \bs{B}]\big)$. Recall that $\bs{B} \in \mathbb{R}^{J\times R}$, $\bs{S} \in \mathbb{R}^{J\times J}$, $\bs{\Sigma}^{-1} \in \mathbb{R}^{R\times R}$,

\begin{align}
[\bs{B}^\intercal\bs{S}]_{rj'} &= \sum_j^J B_{jr} S_{jj'}, \\
[\bs{B}^\intercal\bs{S}\bs{B}]_{rr'} &= \sum_j \sum_{j'} B_{jr} S_{jj'} B_{j'r'}, \\
E_q[\bs{B}^\intercal\bs{S}\bs{B}]_{rr'} &= \sum_j \sum_{j'} S_{jj'} E_q[B_{jr} B_{j'r'}] \\
&= \sum_j \sum_{j'} S_{jj'} E_q[B_{jr}] E_q[B_{j'r'}] + \sum_j \sum_{j'} S_{jj'} \cov(B_{jr}, B_{j'r'}) \\
&= \sum_j \sum_{j'} S_{jj'} \tilde{B}_{jr} \tilde{B}_{j'r'} + \sum_j \sum_{j'} S_{jj'} \rho_{jrr'}\mathbb{1}(j=j'),
\end{align}

where $\rho_{jrr'}:= \cov(B_{jr}, B_{jr'})$ is non-zero for the $j$-th effect at conditions $r$ and $r'$. For $j \ne j'$, due to the model assumption of independent effects, the correlations are zero.

The $rr'$ element of $\big(\bs{\Sigma}^{-1}E_q[\bs{B}^\intercal \bs{S} \bs{B}]\big)$ is thus

\begin{align}
\big(\bs{\Sigma}^{-1}E_q[\bs{B}^\intercal \bs{S} \bs{B}]\big)_{rr'} &= 
\sum_l^R \lambda_{rl}\big(\sum_j \sum_{j'}S_{jj'}\tilde{B}_{jl}\tilde{B}_{j'r'} + \sum_j S_{jj}\rho_{jlr'}\big) \\
& = \sum_l^R\sum_j\sum_{j'}\lambda_{rl}S_{jj'}\tilde{B}_{jl}\tilde{B}_{j'r'} + 
\sum_l^R\sum_j \lambda_{rl} S_{jj} \rho_{jlr'},
\end{align}

and finally

\begin{align}
E_q\{\tr(\bs{X}\bs{B}\bs{\Sigma}^{-1}\bs{B}^\intercal\bs{X}^\intercal)\} &= 
\tr\{\bs{\Sigma}^{-1}E_q[\bs{B}^\intercal \bs{S} \bs{B}]\} \\
&= \sum_r^R\{\bs{\Sigma}^{-1}E_q[\bs{B}^\intercal \bs{S} \bs{B}]\} \\
&= \sum_r \sum_l^R \sum_j \sum_{j'} \lambda_{rl} S_{jj'} \tilde{B}_{jl} \tilde{B}_{j'r} + 
\sum_r \sum_l^R \sum_j \lambda_{rl} S_{jj} \rho_{jlr} \\
&= \sum_r \sum_{r'} \sum_j \sum_{j'} \lambda_{rr'} S_{jj'} \tilde{B}_{jr'} \tilde{B}_{j'r} + 
\sum_r \sum_{r'} \sum_j \lambda_{rr'} S_{jj} \rho_{jrr'} \\
&= \tr\{\bs{\Sigma}^{-1}\tilde{\bs{B}}^\intercal \bs{S} \tilde{\bs{B}}\} + 
\sum_r \sum_{r'} \sum_j \lambda_{rr'} S_{jj} \rho_{jrr'}
\end{align}

### Compute $E_q[\log\frac{p(Z|\Theta)}{q(Z)}]$