# Motivation

**authors:** Joseph Marcus, Hussein Al-Asadi

We're interested in addressing population continuity through time in population genetic / ancient DNA data. A great starting place to tackle this problem is to visualize the data! To this end we'd like to visualize the variogram for our data. Ideally, for each individual we would observe the age of the sample $t_i$ and their genotype at a single position in the genome $g_i$. The variogram is a plot of the expected genetic distance versus time lag for a pair of samples 

$$ 
E\big((g_i - g_j)^2\big) \text{   vs.   } |t_i - t_j|
$$

$$
g_i \in \{0,1,2\}
$$

to be clear we assume there are two alleles ($A$, $a$) at this location and $g_i$ denotes the count of the $A$ allele in sample $i$.

However, in our data we don't observe genotypes, we observe sequence reads. To start, here we show a overly simple model to generate data

$$
\begin{aligned}
p(g_i = k) &= \frac{1}{3}, k = \{0, 1, 2\} \\
y_i | g_i &\sim Binomial\big(c_{i}, \frac{g_i}{2}\big)
\end{aligned}
$$

Here $y_i$ denotes the count of reads carrying the $A$ allele and $c_i$ is the observed total number of reads. Typically $c_i$ is 0 i.e. there is a lot of missing data. Now we can get some traction in visualizing the variogram by conditioning on the data we observe

$$
E\big((g_i - g_j)^2 | y_i, y_j\big) = \sum_{g_i \in \{0, 1, 2\}} \sum_{g_j \in \{0, 1, 2\}} (g_i - g_j)^2 p(g_i, g_j) = \sum_{g_i \in \{0, 1, 2\}} \sum_{g_j \in \{0, 1, 2\}} (g_i - g_j)^2 p(g_i) p(g_j)
$$

**show plots!**

Of course, we'd like to use our biological knowledge of the generative process, particularly the fact that two individuals sampled close in time should have more similar genotypes than two people sampled far apart in time. Now, let $\mathbf{f}(\mathbf{t})$ be the frequency of the $A$ allele for the $n$ samples observed at different time points. 

A standard model for the allele frequencies is 

$$
\mathbf{f}(\mathbf{t}) | \mu, \theta \sim \mathcal{N}\big(\mathbf{\mu}, \mathbf{\Sigma}(\mathbf{t}; \theta)\big)
$$

Given these frequencies we can make some simplifying approximations to estimate the variogram

$$
\begin{aligned}
\mathbf{g} | \mathbf{f}(\mathbf{t}) &\sim \mathcal{N}\Bigg(2\mathbf{f}(\mathbf{t}), diag\Big(2\mathbf{f}(\mathbf{t}) \odot \big(\mathbf{1}-\mathbf{f}(\mathbf{t})\big)\Big)\Bigg) \\ 
y_i | \tilde{g}_i &\sim Binomial\Big(c_i, \frac{\tilde{g}_i}{2}\Big)
\end{aligned}
$$

where $\tilde{g}_i \in \{0, 1, 2\}$ denotes the rounded genotype and $\odot$ represents element wise multiplication. *See Wen and Stephens 2010 for inspiration on the normal approximation to the genotypes and rounding trick.* Recall, we are not interested in estimating $\mathbf{f}(\mathbf{t})$ so we integrate out of the model

$$
\mathbf{g} | \mu, \theta \sim \mathcal{N}\Big(2\mu, \Phi(\mathbf{t}; \theta)\Big)
$$

where $\Phi(\mathbf{t}; \theta) = Var\Big(\mathbb{E}\big(\mathbf{g} | \mathbf{f}(\mathbf{t})\big)\Big) + \mathbb{E}\Big(Var\big(\mathbf{g} | \mathbf{f}(\mathbf{t})\big)\Big)$. Rewriting our marginal model 

$$
\begin{aligned}
\mathbf{g} | \mu, \theta &\sim \mathcal{N}\Big(2\mu, \Phi(\mathbf{t}; \theta)\Big) \\
y_i | \tilde{g}_i &\sim Binomial\Big(c_i, \frac{\tilde{g}_i}{2}\Big)
\end{aligned}
$$

Now we would like to estimate our model parameters $\mu$ and $\theta$ by maximum likelihood. We can maximize the marginal likelihood, however this requires a sum over the discretized genotypes which is computationally intractable 

$$
\begin{aligned}
p(\mathbf{y} | \mu, \theta) &= \sum_\tilde{\mathbf{g}} p(\mathbf{y} | \tilde{\mathbf{g}}, \mu, \theta) p(\tilde{\mathbf{g}} | \mu, \theta) \\
&= \sum_\tilde{\mathbf{g}} \prod_{i=1}^n p(y_i|\tilde{g}_i, \mu, \theta)p(\tilde{\mathbf{g}} | \mu, \theta) 
\end{aligned}
$$

Note that the sum over the rounded genotype will have $3^n$ terms! Thus it is intractable for any reasonable data. Therefore we need to come up with a another strategy to compute or approximate the marginal likelihood. Note that

$$
\sum_\tilde{\mathbf{g}} \prod_{i=1}^n p(y_i|\tilde{g}_i, \mu, \theta)p(\tilde{\mathbf{g}} | \mu, \theta) = E_{\tilde{\mathbf{g}}}\Big(\prod_{i=1}^n p(y_i|\tilde{g}_i, \mu, \theta)\Big)
$$

Perhaps this motivates a monte carlo approach to estimate this expectation.