# Normal approximations pt 2

**authors:** Joseph Marcus, Hussein Al-Asadi

... *not so normal* ;)

## Generative model

Consider the following generative model for read data in single individual $i$ at SNP $j$. To generate the data first we simulate an allele frequency trajectory under the Wright-Fisher model. We assume our individuals are observed at different time-points stored in a $n$-vector $\mathbf{t}$. Let $\mathbf{x}_{j}$ be a vector of latent allele counts in a single population at these time points. Furthermore, let $\mathbf{f}_{j} = \mathbf{x}_{j} \odot \frac{1}{2N_e}$ be the allele frequencies and $N_e$ be the effective population size. Finally, Let $\mu_j$ be the mean of the process (the starting allele frequency of the Markov Chain). Given these frequencies we sample genotypes in an individual assuming Hardy-Weinberg equilibrium. Finally, given the genotypes we simulate read data which is the count of the derived allele. Here $c_{ij}$ is the total coverage.

$$
\begin{aligned}
\mathbf{f}_j | \mu_j, N_e &\sim WF(\mu_j, N_e) \\
g_{ij} | f_{ij} &\sim Binomial\big(2, f_{ij}\big) \\
y_{ij} | g_{ij} &\sim Binomial\Big(c_{ij}, \frac{g_{ij}}{2}\Big)
\end{aligned}
$$

To motivate the normal approximation used later, we derive the mean and covariance matrix implied by the above model 

$$
\begin{aligned}
x_{j,t} | x_{j,t-1}, N_e &\sim Binomial\Big(2N_e, \frac{x_{j,t-1}}{2N_e}\Big) \\
f_{j,t} = \frac{x_{j,t}}{2N_e}
\Rightarrow \\
E(f_{j,t}) &= E\big(\frac{x_{j,t}}{2N_e}\big) = \mu_j \\
Var(f_{j,t}) &= \mu_j(1-\mu_j)\big(1-e^{\frac{-t}{2N_e}}\big)\approx \frac{\mu_j(1-\mu_j)}{2N_e} t \\
Cov(f_{j,s}, f_{j,t}) &= Cov\big(f_{j,s}, (f_{j,t} - f_{j,s}) + f_{j,s}\big) \\
&= Cov(f_{j,s}, f_{j,t} - f_{j,s}) + Cov(f_{j,s}, f_{j,s}) \\
&= Var(f_{j,s}) \\
&\approx \frac{\mu_j(1-\mu_j)}{2N_e} s
\end{aligned}
$$

Let $\mathbf{T}$ be a $n \times n$ matrix with $\mathbf{T}_{ij} = min(t_i, t_j)$ storing the minimum times between each pair of individuals. We let the variance-covariance matrix $\mathbf{\Sigma} = \frac{\mu_j(1-\mu_j)}{2N_e} \mathbf{T}$

## Normal approximation

Here we consider an approximation to the above generative model where we using normal approximations for each of the conditional distributions except for the data emission layer. Note that we switch to continuous time as we use a Brownian Motion approximation to the discrete time Wright-Fisher Markov Chain.

$$
\begin{aligned}
\mathbf{f}_j(\mathbf{t}) | \mu_j, N_e &\sim \mathcal{N}\Big(\mu_j, \frac{\mu_j(1-\mu_j)}{2N_e}\mathbf{T}\Big) \\
\mathbf{g}_j | \mathbf{f}_j(\mathbf{t}) &\sim \mathcal{N}\Big(2\mathbf{f}_j(\mathbf{t}), diag\big\{2\mathbf{f}_j(\mathbf{t}) \odot \big(\mathbf{1}-\mathbf{f}_j(\mathbf{t})\big)\big\}\Big) \\
y_{ij} | g_{ij} &\sim Binomial\Big(c_{ij}, \frac{[g_{ij}]}{2}\Big)
\end{aligned}
$$

One might ask how can we use $g_{ij}$ as a frequency if we assume its normally distributed? Here we make the simplifying computational convenience to round $g_{ij}$, represented as $[g_{ij}]$, to be in $\{0,1,2\}$. First lets find the marginal distribution on of $\mathbf{g}_j$

$$
\begin{aligned}
\mathbb{E}(\mathbf{g}_j) &= \mathbb{E}\Big(\mathbb{E}\big(\mathbf{g}_j | \mathbf{f}_j(\mathbf{t})\big)\Big) \\
&= \mathbb{E}\big(2\mathbf{f}_j(\mathbf{t})\big) \\
&= 2\mu_j\mathbf{1} \\
Var(\mathbf{g}_j) &= Var\Big(\mathbb{E}\big(\mathbf{g}_j | \mathbf{f}_j(\mathbf{t})\big)\Big) + \mathbb{E}\Big(Var\big(\mathbf{g}_j | \mathbf{f}_j(\mathbf{t})\big)\Big) \\
&= 4Var\big(\mathbf{f}_j(\mathbf{t})\big) + \mathbb{E}\Big(diag\big\{2\mathbf{f}_j(\mathbf{t}) \odot \big(\mathbf{1}-\mathbf{f}_j(\mathbf{t})\big)\big\}\Big)\\ 
&= 4\frac{\mu_j(1-\mu_j)}{2N_e}\mathbf{T} + diag\Big\{2\mu_j(1-\mu_j)\Big(\mathbf{1}-\frac{\mathbf{t}}{2N_e}\Big)\Big\} \\
&= 4\frac{\mu_j(1-\mu_j)}{2N_e}\mathbf{T} + 2\mu_j(1-\mu_j)\mathbf{I} - \frac{2\mu_j(1-\mu_j)}{2N_e}diag(\mathbf{T}) \\
&= \frac{2\mu_j(1-\mu_j)}{2N_e}\Big(2\mathbf{T} - diag(\mathbf{T})\Big) + 2\mu_j(1-\mu_j)\mathbf{I} \\
&= \frac{\mu_j(1-\mu_j)}{2N_e} \Big(4\mathbf{T} - 2diag(\mathbf{T})\Big) + 2\mu_j(1-\mu_j)\mathbf{I} \\
\end{aligned}
$$

Thus we have 

$$
\mathbf{g}_j | \mu_j, N_e \sim \mathcal{N}\Bigg(2\mu_j\mathbf{1}, \frac{\mu_j(1-\mu_j)}{2N_e} \Big(4\mathbf{T} - 2diag(\mathbf{T})\Big) + 2\mu_j(1-\mu_j)\mathbf{I}\Bigg)
$$

Next we write the maginal distribution of each individual's genotype 

$$
g_{ij} \sim \mathcal{N}\Big(2\mu_j, 2\mu_j(1-\mu_j) \big(\frac{t_i}{2 N_e} + 1)\big)\Big)
$$