# Normal approximations

**authors:** Joseph Marcus, Hussein Al-Asadi

Here we explore a computationally efficient Empircal Bayes for modeling low-coverage sequence data. We are uncertain this approach will work but will explore it as it provides a tractable path forward for including read-level emissions in our work.

## Generative model

Consider the following generative model for read data in single individual $i$ at SNP $j$. To generate the data first we simulate an allele frequency trajectory under the Wright-Fisher model. We assume our individuals are observed at different time-points stored in a $n$-vector $\mathbf{t}$. Let $\mathbf{x}_{j}$ be a vector of latent allele counts in a single population at these time points. Furthermore, let $\mathbf{f}_{j} = \mathbf{x}_{j} \odot \frac{1}{2N_e}$ be the allele frequencies and $N_e$ be the effective population size. Finally, Let $\mu_j$ be the mean of the process (the starting allele frequency of the Markov Chain). Given these frequencies we sample genotypes in an individual assuming Hardy-Weinberg equilibrium. Finally, given the genotypes we simulate read data which is the count of the derived allele. Here $c_{ij}$ is the total coverage.

$$
\begin{aligned}
\mathbf{f}_j | \mu_j, N_e &\sim WF(\mu_j, N_e) \\
g_{ij} | f_{ij} &\sim Binomial\big(2, f_{ij}\big) \\
y_{ij} | g_{ij} &\sim Binomial\Big(c_{ij}, \frac{g_{ij}}{2}\Big)
\end{aligned}
$$

To motivate the normal approximation used later, we derive the mean and covariance matrix implied by the above model 

$$
\begin{aligned}
x_{j,t} | x_{j,t-1}, N_e &\sim Binomial\Big(2N_e, \frac{x_{j,t-1}}{2N_e}\Big) \\
E(x_{j,t}) &= \mu_j \\
Var(x_{j,t}) &= \mu_j(1-\mu_j)\Bigg(1 - \Big(1 - \frac{1}{N_e}\Big)^t \Bigg) \approx \mu_j(1-\mu_j)\big(1-e^{\frac{-t}{2N_e}}\big) \\
\Rightarrow \\
E(f_{j,t}) &= \frac{\mu_j}{2N_e} \\
Var(f_{j,t}) &= \frac{\mu_j(1-\mu_j)\big(1-e^{\frac{-t}{2N_e}}\big)}{2N_e} \approx \frac{\mu_j(1-\mu_j)}{2N_e} t \\
Cov(f_{j,s}, f_{j,t}) &= Cov\big(f_{j,s}, (f_{j,t} - f_{j,s}) + f_{j,s}\big) \\
&= Cov(f_{j,s}, f_{j,t} - f_{j,s}) + Cov(f_{j,s}, f_{j,s}) \\
&= Var(f_{j,s}) \\
&\approx \frac{\mu_j(1-\mu_j)}{2N_e} s
\end{aligned}
$$

Let $\mathbf{T}$ be a $n \times n$ matrix with $\mathbf{T}_{ij} = min(t_i, t_j)$ storing the minimum times between each pair of individuals. We let the variance-covariance matrix $\mathbf{\Sigma} = \frac{\mu_j(1-\mu_j)}{2N_e} \mathbf{T}$

## Normal approximation

Here we consider an approximation to the above generative model where we using normal approximations for each of the conditional distributions. Note that we switch to continuous time as we use a Brownian Motion approximation to the discrete time Wright-Fisher Markov Chain.

$$
\begin{aligned}
\mathbf{f}_j(\mathbf{t}) | \mu_j, N_e &\sim \mathcal{N}\Big(\mu_j, \frac{\mu_j(1-\mu_j)}{2N_e}\mathbf{T}\Big) \\
\mathbf{g}_j | \mathbf{f}_j(\mathbf{t}) &\sim \mathcal{N}\Big(2\mathbf{f}_j(\mathbf{t}), diag\big\{2\mathbf{f}_j(\mathbf{t}) \odot \big(\mathbf{1}-\mathbf{f}_j(\mathbf{t})\big)\big\}\Big) \\
\mathbf{y}_j | \mathbf{g}_j &\sim \mathcal{N}\Bigg(\mathbf{c}_j \odot \frac{\mathbf{g}_j}{2}, diag\Big\{\mathbf{c}_j \odot \frac{\mathbf{g}_j}{2} \odot \Big(\mathbf{1}-\frac{\mathbf{g}_j}{2}\Big)\Big\}\Bigg)
\end{aligned}
$$

If we integrate out $\mathbf{f}_j(\mathbf{t})$ 

$$
\begin{aligned}
\mathbb{E}(\mathbf{g}_j) &= \mathbb{E}\Big(\mathbb{E}\big(\mathbf{g}_j | \mathbf{f}_j(\mathbf{t})\big)\Big) \\
&= \mathbb{E}\big(2\mathbf{f}_j(\mathbf{t})\big) \\
&= 2\mu_j\mathbf{1} \\
Var(\mathbf{g}_j) &= Var\Big(\mathbb{E}\big(\mathbf{g}_j | \mathbf{f}_j(\mathbf{t})\big)\Big) + \mathbb{E}\Big(Var\big(\mathbf{g}_j | \mathbf{f}_j(\mathbf{t})\big)\Big) \\
&= 4Var\big(\mathbf{f}_j(\mathbf{t})\big) + \mathbb{E}\Big(diag\big\{2\mathbf{f}_j(\mathbf{t}) \odot \big(\mathbf{1}-\mathbf{f}_j(\mathbf{t})\big)\big\}\Big)\\ 
&= 4\frac{\mu_j(1-\mu_j)}{2N_e}\mathbf{T} + diag\Big\{2\mu_j(1-\mu_j)\Big(\mathbf{1}-\frac{\mathbf{t}}{2N_e}\Big)\Big\} \\
&= 4\frac{\mu_j(1-\mu_j)}{2N_e}\mathbf{T} + 2\mu_j(1-\mu_j)\mathbf{I} - \frac{2\mu_j(1-\mu_j)}{2N_e}diag(\mathbf{T}) \\
&= \frac{2\mu_j(1-\mu_j)}{2N_e}\Big(2\mathbf{T} - diag(\mathbf{T})\Big) + 2\mu_j(1-\mu_j)\mathbf{I} \\
&= \frac{\mu_j(1-\mu_j)}{2N_e} \Big(4\mathbf{T} - 2diag(\mathbf{T})\Big) + 2\mu_j(1-\mu_j)\mathbf{I} \\
\end{aligned}
$$

Thus we have 

$$
\mathbf{g}_j | \mu_j, N_e \sim \mathcal{N}\Bigg(2\mu_j\mathbf{1}, \frac{\mu_j(1-\mu_j)}{2N_e} \Big(4\mathbf{T} - 2diag(\mathbf{T})\Big) + 2\mu_j(1-\mu_j)\mathbf{I}\Bigg)
$$

$$
\mathbf{y}_j | \mathbf{g}_j \sim \mathcal{N}\Big(\mathbf{c}_j \odot \frac{\mathbf{g}_j}{2}, diag\Big\{\mathbf{c}_j \odot \frac{\mathbf{g}_j}{2} \odot \Big(\mathbf{1}-\frac{\mathbf{g}_j}{2}\Big)\Big\}\Big)
$$

---

Next we can integrate out $\mathbf{g}_j$ to obtain the marginal distribution of $\mathbf{y}_j$ conditional on $\mu_j, N_e$

$$
\begin{aligned}
\mathbb{E}(\mathbf{y}_j) &= \mathbb{E}\big(\mathbb{E}(\mathbf{y}_j | \mathbf{g}_j) \big) \\
&= \mathbf{c}_j \odot \mathbb{E}\Big(\frac{\mathbf{g}_j}{2}\Big) \\
&=\mathbf{c}_j\mu_j \\
Var(\mathbf{y}_j) &= Var\big(\mathbb{E}(\mathbf{y}_j | \mathbf{g}_j)\big) + \mathbb{E}\big(Var(\mathbf{y}_j | \mathbf{g}_j)\big) \\
&= Var\Big(\mathbf{c}_j \odot \frac{\mathbf{g}_j}{2}\Big)\\
&= \frac{\mathbf{c}_j\mathbf{c}^T_j}{4} \odot Var\big(\mathbf{g}_j\big)+ \mathbb{E}\big(Var(\mathbf{y}_j | \mathbf{g}_j)\big) \\
&= \frac{\mathbf{c}_j \mathbf{c}^T_j}{4} \odot \Bigg(\frac{\mu_j(1-\mu_j)}{2N_e} \Big(4\mathbf{T} - 2diag(\mathbf{T})\Big) + 2\mu_j(1-\mu_j)\mathbf{I}\big)\Bigg) + \mathbb{E}\big(Var(\mathbf{y}_j | \mathbf{g}_j)\big) \\
\Rightarrow
\end{aligned}
$$


$$
\begin{aligned}
\mathbb{E}\big(Var(\mathbf{y}_j | \mathbf{g}_j)\big) &= \mathbb{E}\Bigg(diag\Big\{\mathbf{c}_j \odot \frac{\mathbf{g}_j}{2} \odot \Big(\mathbf{1}-\frac{\mathbf{g}_j}{2}\Big)\Big\}\Bigg) \\
\end{aligned}
$$

We can consider each individual separately here thus ... 

$$
\begin{aligned}
\mathbb{E}\Big(c_{ij}\frac{g_{ij}}{2}\big(1 - \frac{g_{ij}}{2}\big)\Big) &= c_{ij} \mathbb{E}\Big(\frac{g_{ij}}{2} - \frac{g^2_{ij}}{4}\Big) \\
&= c_{ij}\Big(\frac{1}{2}E(g_{ij}) - \frac{1}{4}E(g^2_{ij})\Big) \\
&= c_{ij}\Big(\frac{1}{2}2\mu_j - \frac{1}{4}\big(Var(g_{ij}) + E(g_{ij})^2\big)\Big) \\
&= c_{ij}\Big(\mu_j - \frac{1}{4}\big(Var(g_{ij}) + 4\mu^2_j\big)\Big) \\
&= c_{ij}\Big(\mu_j(1-\mu_j) - \frac{1}{4}Var(g_{ij}) \Big) \\
&= c_{ij}\Bigg(\mu_j(1-\mu_j) - \frac{1}{4}\Big(\frac{\mu_j(1 - \mu_j)}{2N_e}(4t_i - 2t_i) + 2\mu_j(1-\mu_j) \Big)\Bigg) \\
&= c_{ij}\Bigg(\mu_j(1-\mu_j) - \frac{1}{4}\Big(2\frac{\mu_j(1 - \mu_j)}{2N_e}t_i + 2\mu_j(1-\mu_j) \Big)\Bigg) \\
&= c_{ij}\Bigg(\mu_j(1-\mu_j) - \frac{1}{2}\frac{\mu_j(1 - \mu_j)}{2N_e}t_i - \frac{1}{2}\mu_j(1-\mu_j)\Bigg) \\
&= \frac{c_{ij}}{2}\Bigg(\mu_j(1-\mu_j) - \frac{\mu_j(1 - \mu_j)}{2N_e}t_i \Bigg)
\end{aligned}
$$

Thus our marginal variance is

$$
Var(\mathbf{y}_j) = \frac{\mathbf{c}_j \mathbf{c}^T_j}{4} \odot \Bigg(\frac{\mu_j(1-\mu_j)}{2N_e} \Big(4\mathbf{T} - 2diag(\mathbf{T})\Big) + 2\mu_j(1-\mu_j)\mathbf{I}\big)\Bigg) + \frac{\mathbf{c}_j}{2} \odot \Bigg(\mu_j(1-\mu_j)\mathbf{I} - \frac{\mu_j(1-\mu_j)}{2N_e}diag(\mathbf{T}) \Bigg)
$$

Finally our marginal distribution is


$$
\mathbf{y}_j | \mu_j, N_e \sim \mathcal{N}\Bigg(\mathbf{c}_j\mu_j, \frac{\mathbf{c}_j \mathbf{c}^T_j}{4} \odot \Bigg(\frac{\mu_j(1-\mu_j)}{2N_e} \Big(4\mathbf{T} - 2diag(\mathbf{T})\Big) + 2\mu_j(1-\mu_j)\mathbf{I}\big)\Bigg) + \frac{\mathbf{c}_j}{2} \odot \Bigg(\mu_j(1-\mu_j)\mathbf{I} - \frac{\mu_j(1-\mu_j)}{2N_e}diag(\mathbf{T}) \Bigg) \Bigg)
$$