# Normal approximations

**authors:** Joseph Marcus, Hussein Al-Asadi

Here we explore a computationally efficient Empircal Bayes for modeling low-coverage sequence data. We are uncertain this approach will work but will explore it as it provides a tractable path forward for including read-level emissions in our work.

## Generative model

Consider the following generative model for read data in single individual $i$ at SNP $j$. To generate the data first we simulate an allele frequency trajectory under the Wright-Fisher model. We assume our inviduals are observed at different time-points stored in a $n$-vector $\mathbf{t}$. Let $\mathbf{f}_j(\mathbf{t})$ be the latent allele frequencies observed at these time points. Furthermore let $\mu_j$ be the mean of the process (the starting allele frequency of the Markov Chain) and $N_e$ be the effective population size. Given these frequencies we sample genotypes in an individual assuming Hardy-Weinberg equilibrium. Finally, given the genotypes we simulate read data which is the count of the derived allele. Here $c_{ij}$ is the total coverage.

$$
\begin{aligned}
\mathbf{f}_j(\mathbf{t}) | \mu_j, N_e &\sim WF(\mu_j, N_e) \\
g_{ij} | f_{ij}(t_i) &\sim Binomial\big(2, f_{ij}(t_i)\big) \\
y_{ij} | g_{ij} &\sim Binomial\Big(c_{ij}, \frac{g_{ij}}{2}\Big)
\end{aligned}
$$

## Approximation

Here we consider an approximation to the above generative model where we using normal approximations for each of the conditional distributions. 

$$
\begin{aligned}
\mathbf{f}_j(\mathbf{t}) | \mu_j, N_e &\sim \mathcal{N}(\mu_j, \mathbf{\Sigma})\\
\mathbf{g}_j | \mathbf{f}_j(\mathbf{t}) &\sim \mathcal{N}\Big(2\mathbf{f}_j(\mathbf{t}), 2diag\big\{\mathbf{f}_j(\mathbf{t}) \cdot \big(\mathbf{1}-\mathbf{f}_j(\mathbf{t})\big)\big\}\Big) \\
\mathbf{y}_j | \mathbf{g}_j &\sim \mathcal{N}\Bigg(\mathbf{c}_j \cdot \frac{\mathbf{g}_j}{2}, diag\Big\{\mathbf{c}_j \cdot \frac{\mathbf{g}_j}{2} \cdot \Big(\mathbf{1}-\frac{\mathbf{g}_j}{2}\Big)\Big\}\Bigg)
\end{aligned}
$$

If we integrate out $\mathbf{f}_j(\mathbf{t})$ 


$$
\begin{aligned}
\mathbb{E}(\mathbf{g}_j) &= \mathbb{E}\Big(\mathbb{E}\big(\mathbf{g}_j | \mathbf{f}_j(\mathbf{t})\big)\Big) \\
&= \mathbb{E}\big(2\mathbf{f}_j(\mathbf{t})\big) \\
&= 2\mu_j\mathbf{1}
\end{aligned}
$$

$$
\begin{aligned}
Var(\mathbf{g}_j) &= Var\Big(\mathbb{E}\big(\mathbf{g}_j | \mathbf{f}_j(\mathbf{t})\big)\Big) + \mathbb{E}\Big(Var\big(\mathbf{g}_j | \mathbf{f}_j(\mathbf{t})\big)\Big) \\
&= 4Var\big(\mathbf{f}_j(\mathbf{t})\big) + \mathbb{E}\Big(2diag\big\{\mathbf{f}_j(\mathbf{t}) \cdot \big(\mathbf{1}-\mathbf{f}_j(\mathbf{t})\big)\big\}\Big)\\ 
&= \dots \\
&= \mu_j(1-\mu_j)\big(\mathbf{\Sigma} + diag\{\mathbf{\Sigma}\} + 2\mathbf{I}\big)
\end{aligned}
$$

Thus we have

$$
\mathbf{g}_j | \mu_j, N_e \sim \mathcal{N}\Big(2\mu_j\mathbf{1}, \mu_j(1-\mu_j)\big(\mathbf{\Sigma} + diag\{\mathbf{\Sigma}\} + 2\mathbf{I}\big)\Big)
$$

$$
\mathbf{y}_j | \mathbf{g}_j \sim \mathcal{N}\Big(\mathbf{c}_j \cdot \frac{\mathbf{g}_j}{2}, diag\Big\{\mathbf{c}_j \cdot \frac{\mathbf{g}_j}{2} \cdot \Big(\mathbf{1}-\frac{\mathbf{g}_j}{2}\Big)\Big\}\Big)
$$

Our idea is to fix the variance in the likelihood ($\mathbf{y}_j | \mathbf{g}_j$) by computing an estimate of $\mathbf{g}_j$ from the data. Let $\mathbf{\Lambda}^{(j)} = diag\Big\{\mathbf{c}_j \cdot \frac{\hat{\mathbf{g}}_j}{2} \cdot \Big(\mathbf{1}-\frac{\hat{\mathbf{g}}_j}{2}\Big)\Big\}$ then we can rewrite the model as

$$
\mathbf{g}_j | \mu, N_e \sim \mathcal{N}\Big(2\mu_j\mathbf{1}, \mu_j(1-\mu_j)\big(\mathbf{\Sigma} + diag\{\mathbf{\Sigma}\} + 2\mathbf{I}\big)\Big)
$$

$$
\mathbf{y}_j | \mathbf{g}_j \sim \mathcal{N}\Big(\mathbf{c}_j \cdot \frac{\mathbf{g}_j}{2}, \mathbf{\Lambda}^{(j)} \Big)
$$

to be clear $\mathbf{\Lambda}^{(j)}$ is fixed! Next we can integrate out $\mathbf{g}_j$ to obtain the marginal distribution of $\mathbf{y}$ conditional on $\mu_j, N_e$

$$
\begin{aligned}
\mathbb{E}(\mathbf{y}_j) &= \mathbb{E}\big(\mathbb{E}(\mathbf{y}_j | \mathbf{g}_j) \big) \\
&= \mathbf{c}_j \cdot \mathbb{E}\Big(\frac{\mathbf{g}_j}{2}\Big) \\
&=\mu_j\mathbf{c}_j 
\end{aligned}
$$

$$
\begin{aligned}
Var(\mathbf{y}_j) &= Var\big(\mathbb{E}(\mathbf{y}_j | \mathbf{g}_j)\big) + \mathbb{E}\big(Var(\mathbf{y}_j | \mathbf{g}_j)\big) \\
&= \mathbf{c}_j Var\big(\mathbf{g}_j\big)\mathbf{c}^T_j + \mathbb{E}\big(Var(\mathbf{y}_j | \mathbf{g}_j)\big) \\
&= \mathbf{c}_j\Big(\mu_j(1-\mu_j)\big(\mathbf{\Sigma} + diag\{\mathbf{\Sigma}\} + 2\mathbf{I}\big)\Big)\mathbf{c}^T_j + \mathbf{\Lambda}^{(j)}
\end{aligned}
$$

Thus our marginal likelihood for $\mathbf{y}_j$

$$
\mathbf{y}_j | \mu_j, N_e \sim \mathcal{N}\Bigg(\mu_j \mathbf{c}_j, \mathbf{c}_j\Big(\mu_j(1-\mu_j)\big(\mathbf{\Sigma} + diag\{\mathbf{\Sigma}\} + 2\mathbf{I}\big)\Big)\mathbf{c}^T_j + \mathbf{\Lambda}^{(j)}\Bigg)
$$

We estimate $\mu_j$ and $N_e$ using maximum likelihood and plug them in to the full model to compute the posterior of $\mathbf{g}_j$ given $\mathbf{y}_j$ analytically!