## Setup

To access material for this workbook please execute the two notebook cells immediately below (e.g. use the shortcut <b>&lt;shift&gt;+&lt;return&gt;</b>). The first cell can be skipped if you are running this notebook locally and have already installed all the necessary packages. The second cell should print out "Your notebook is ready to go!"

In [None]:
if 'pyodide_kernel' in str(get_ipython()):  # specify packages to install under JupyterLite
    %pip install -q -r jlite-requirements.txt
elif 'google.colab' in str(get_ipython()):  # specify package location for loading in Colab
    from google.colab import drive
    drive.mount('/content/drive')
    %run /content/drive/MyDrive/GARG_workshop/Notebooks/add_module_path.py
else:  # install packages on your local machine (-q = "quiet": don't print out installation steps)
    !python -m pip install -q -r https://github.com/ebp-nor/GARG/raw/main/jlite/requirements.txt

In [None]:
# Load questions etc for this workbook
import ARG_workshop
workbook = ARG_workshop.Workbook1B()
display(workbook.setup)

### Using this workbook

This workbook is intended to be used by executing each cell as you go along. Code cells (like those above) can be modified and re-executed to perform different behaviour or additional analysis. You can use this to complete various programming exercises, some of which have associated questions to test your understanding. Exercises are marked like this:
<dl class="exercise"><dt>Exercise XXX</dt>
<dd>Here is an exercise: normally there will be a code cell below this box for you to work in</dd>
</dl>

# Workbook 1-B: Coalescent theory

We will use the simulator we wrote in the previous workbook to illustrate some fundamental principles behind non-recombining genetic genealogies. This mostly comes under the topic of coalescent theory (for a detailed introduction see e.g. [chapter 3](https://people.eecs.berkeley.edu/~jordan/sail/readings/wakeley-chapter3.pdf) of Wakeley's book of the same name).

First, we'll simulate forward in time, to create a tree sequence representing a single coalescent tree:

In [None]:
from IPython.display import SVG
N = 10  # diploid population size
ts = ARG_workshop.FwdWrightFisherSimulator(population_size=N).run(gens=50)
SVG(ts.draw_svg(size=(500, 250), y_axis=True))

## Trees in a tree sequence

Normally, tree sequences encode multiple correlated trees along the genome. Here, all the base pairs from 0..1000 share the same ancestry, so there's only one tree. This tree can be extracted using `ts.first()`:

<dl class="exercise"><dt>Exercise 1</dt>
    <dd>Use <code>ts.num_trees</code> to check that there's only one tree in this tree sequence, then show a tabular summary using <code>display(ts.first())</code></dd></dl>

In [None]:
# Use this cell to print out the number of trees and a tabular summary of the first tree


In [None]:
workbook.question("tree")

Like a tree sequence, a _tskit_ `tree` has lots of useful methods. Given a set of node IDs, the `mrca()` and `tmrca()` methods return the ID of the most recent common ancestor (MRCA) node and the time of that node ($T_{\mathrm{MRCA}}$) respectively. For instance, let's look at the MRCA of nodes 0 and 1:

In [None]:
tree = ts.first()
mrca_id = tree.mrca(0, 1)
mrca_time = tree.tmrca(0, 1)
print(
    f"Node {mrca_id}, the most recent common ancestor of nodes 0 and 1",
    f"lived {mrca_time} {ts.time_units} ago",
)

# Highlight it!
SVG(ts.draw_svg(size=(500, 250), y_axis=True, style=f".n{mrca_id}>.sym {{fill: red; r: 8px}}"))

Note, however, if we remove all the ancestry older than a given time (known as `decapitating` a tree sequence!), we might not have an MRCA between two nodes, in which case the `mrca` function will return a null value (_tskit_ uses -1 as the null ID). The resulting tree sequence will have [multiple roots](https://tskit.dev/tskit/docs/stable/data-model.html#roots), visualised as below:

In [None]:
ts_recent = ts.decapitate(time=10.0)  # delete ancestry older than 10 generations ago
print(
    "The MRCA ID between node 0 and 1 in the decapitated tree sequence is",
    ts_recent.first().mrca(0, 1)
)
SVG(ts_recent.draw_svg(size=(500, 250), y_axis=True))

In [None]:
# The .tmrca() function will fail with a helpful message in this case
print(ts_recent.first().tmrca(0, 1))

## $T_\mathrm{MRCA}$ distributions

The expected time to the most recent common ancestor  for a randomly chosen pair of genomes is a classic result from coalescent theory. In a Wright-Fisher model, if the population is not tiny, the pairwise $T_{MRCA}$ is well approximated by the negative exponential distribution with mean and standard deviation equal to the number of genomes in the population (i.e. twice the diploid population size).


$$\frac{1}{2N_e}e^{-\frac{T_\mathrm{MRCA}}{2N}}$$


To see how well this theoretical approximation holds, one (inefficient) way to sample from the true MRCA distribution is to loop over the forward-time simulator repeatedly with different seeds.

Since this can take a few seconds, we'll wrap the loop using the [tqdm library](https://tqdm.github.io/), so that we get a progress bar:

In [None]:
import numpy as np
from tqdm.auto import tqdm

max_gens = 175  # need to run long enough to ensure there *is* an MRCA
fwdsim_tMRCAs = []
for seed in tqdm(np.arange(1, 300)):
  ts = ARG_workshop.FwdWrightFisherSimulator(N, random_seed=seed).run(gens=max_gens)
  # (arbitrarily) pick the MRCA between the first 2 nodes
  first_two_samples = ts.samples()[0:2]
  fwdsim_tMRCAs.append(ts.first().tmrca(*first_two_samples))

<dl class="exercise"><dt>Exercise 2</dt>
<dd>Use <code>np.mean()</code> and <code>np.std()</code> to print out the mean and standard deviation of the <code>fwdsim_tMRCAs</code> array</dd></dl>

In [None]:
# Exercise 1: print the mean and standard deviation of the fwdsim_tMRCAs array


In [None]:
# Execute code block with <shift>+Return to display question; press on one of the buttons to answer
workbook.question("expected_tmrca")

Instead of just using the mean, we can compare the _entire distribution_ to the predicted negative exponential. Below we do this as a histogram (left), and using the cumulative distribution function (right). 

<div class="alert alert-block alert-info"><b>Note:</b>As the standard deviation in $T_{MRCA}$ is proportional to the mean, coalescent times are normally best plotted on a log scale, as on the right below. This stabilises the variance, and means that that older times are not given undue weight</div>

In [None]:
from matplotlib import pyplot as plt
import scipy
T = np.arange(1, max_gens)
fig, (ax_pdf, ax_cdf) = plt.subplots(1, 2, figsize=(15, 5))

# Histogram
ax_pdf.hist(fwdsim_tMRCAs, density=True, label="WF simulation", color="lightblue")
y = 1/(2 * N) * np.exp(-T/(2 * N))
formula = r"$\frac{1}{2N} e^{−T_{MRCA}/2N}$"
ax_pdf.plot(T, y, c="tab:orange", label=formula)
ax_pdf.axvline(np.mean(fwdsim_tMRCAs), ls=":", c="tab:blue")
ax_pdf.axvline(2 * N, ls=":", c="tab:orange")
ax_pdf.text(2 * N, y.max(), "Means", ha="right", va="top", rotation=90)
ax_pdf.set_xlabel(r"$T_{MRCA}$ (generations)")
ax_pdf.legend(fontsize=14)
ax_pdf.set_ylabel(r"Coalescent density")

# Empirical CDF
ax_cdf.plot(np.sort(fwdsim_tMRCAs), np.linspace(0, 1, len(fwdsim_tMRCAs), endpoint=False))
formula = r"$1 - e^{−T_{MRCA}/2N}$"
ax_cdf.plot(T, 1 - np.exp(-T/(2 * N)), c="tab:orange", label=formula)
ax_cdf.set_xlabel(r"$T_{MRCA}$ (generations)")
ax_cdf.legend(fontsize=14)
ax_cdf.set_ylabel(r"Probability")
ax_cdf.set_xscale("log")

plt.suptitle(f"Time to MRCA between two samples in a population of size $N={N}$");


The fit is impressive, even though this is a very small population, when we would expect the approximation to be potentially poor. It's only the first 5 generations that are a sligihtly worse fit, as a result of simulating using discrete generations. As we'll see later, in general, results are surprisingly robust to small deviations from the basic assumptions of coalescent theory.

## The incredible depth of genetic ancestry

It is often estimated that the average $T_{MRCA}$ between a random pair of human genomes (say from your mother vs your father) roughly matches that found in a constant-sized Wright-Fisher population of size 15 000 diploids.

In [None]:
# Execute code block with <shift>+Return to display question; press on one of the buttons to answer
workbook.question("Human_Ne_TRMCA")

### A popgen taster

To many people, these times might seem surprisingly old. For example, they predate the appearance of modern humans in the fossil record, around 300 000 years ago. We can perform a sanity check using real data, but instead of using node dates directly - which requires us to infer branch lengths - we'll use a proxy measure: the average genetic difference per base pair between all sample pairs (the "genetic diversity", $\pi$). If we know the mutation rate $\mu$, we can simply divide the genetic diversity by $\mu$ to estimate the branch length distance between the two samples, then divide that by 2 to get an estimate of the $T_{MRCA}$. A reasonable estimate of the nucleotide mutation rate in humans is $\mu=1.29\times10^{-8}$ per base pair per generation.

To illustrate the speed of tree sequence calculations, we'll load up a tree-sequence-encoded copy of the well-known thousand genomes project (tgp). For speed and simplicity, we'll use a tree sequence that only contains 5 megabases of chromosome 20, but it's easy to do this for larger examples.

In [None]:
# Download the compressed tree sequence for 5-10 MB of human chromosome 20 arm
import tszip  # Tree sequences can be further compressed for efficient storage using "tszip"
try:
    tgp_ts = tszip.decompress("data/1kgp_chr20_5-10MB.tsz")
except FileNotFoundError:
    import urllib.request
    url = "https://raw.githubusercontent.com/ebp-nor/GARG/main/jlite/content/data/1kgp_chr20_5-10MB.tsz"
    print(f"Downloading '{url}'")
    urllib.request.urlretrieve(url)
    ts = tszip.decompress("1kgp_chr20_5-10MB.tsz")

<dl class="exercise"><dt>Exercise 3</dt>
<dd>Use <code>tgp_ts.num_samples</code> and <code>tgp_ts.num_individuals</code> to check on the number of genomes and number of individuals in this dataset. Then use <code>tgp_ts.diversity()</code> to print out the average sitewise difference (i.e. number of variable sites) between a random pair of genomes (this calculation should be instantaneous). Divide this by twice the mutation rate, then multiply by 27 years per generation to get a rough estimate of the average pairwise $T_\mathrm{MRCA}$ in years</dd></dl>


In [None]:
# Use this cell to print the number of samples and individuals in tgp_ts
# Then calculate the average pairwise genetic diversity, and use it to find the average TMRCA in years,
# assuming a mutation rate of 1.29e-8 per base pair per generation, and a generation time of 27 years


In [None]:
workbook.question("Human_mu_TRMCA")

Note that in this example, we are not making multiple independant estimates of the $T_{MRCA}$. Instead, we are going along the genome and taking ***

## Advantages of backward simulation

The forward-time Wright-Fisher simulator we built is flexible, but slow. For the rest of this workbook, we'll investigate theoretical predictions by simulating backward in time instead. In backward-time simulations, we define an initial set of sample genomes and randomly choose parents for those genomes, then parents for the parents, and so on. As well as avoiding the need to `simplify()` the genealogy, we can:

* avoid time simulating extinct lineages,

* efficiently simulate a small set of samples taken from a much larger population, and

* stop simulating as soon as the lineages from all the samples merge into a single MRCA

The major _downside_ is that genomes in the simulation must be [exchangable](https://en.wikipedia.org/wiki/Exchangeable_random_variables). Although this can include subdivided populations and cases where population size changes over time, backward simulation cannot deal with populations distributed in continuous space, or most forms of natural selection (although there are some workarounds for simple selective sweeps).

_Tskit_ also makes it possible to combine both backward and forward time simulation in a process known as [recapitation](https://tskit.dev/pyslim/docs/latest/tutorial.html#sec-tutorial-recapitation). We will encounter this in later workbooks.

### Msprime: a backward-time simulator

[_Msprime_](https://tskit.dev/msprime/docs/stable/intro.html) is a widely-used and highly efficient backward-time simulator that outputs tree sequences. Its [discrete-time-Wright-Fisher](https://tskit.dev/msprime/docs/stable/ancestry.html#using-the-dtwf-model) (`dtwf`) capability will simulate in backward time the same process we simulated in our forward-time model, but will allow much larger population sizes:

In [None]:
import msprime
Ne = 1_000   # Much faster to simulate a small sample taken from a larger populations
ts = msprime.sim_ancestry(samples=10, population_size=Ne, model="dtwf", random_seed=123)
SVG(ts.draw_svg(size=(500, 250), y_axis=True))

#### Continuous-time models: a major speedup

However, the "`dtwf`" model is still fairly slow because it has to go generation-by-generation, backwards in time. If we assume that only one coalescence occurs in any generation, we can simply sample a time back to the next coalescent event from the appropriate distribution, without having to step one generation at a time. The faster method is the default used by msprime, and it means that the size of the population no longer has an effect on the efficiency of the simulation. We can easily simulate a small sample from a population of a billion individuals: all that happens is that the time scale will change (of course, given a generation time of a few years, a constant population of a billion is not a biologically plausible assumption: the MRCAs would be older than the origin of life on earth)

Note that the continuous-time approach means that we never get "polytomies (i.e. multifurcations). Theerefore the trees produced by the default msprime model are always strictly bifurcating.

### Variation in coalescent tree depths and topologies

To demonstrate the efficiency of this approach, we'll sample 20 individuals from a population of ten thousand, until all the samples have coalesced into a single root, which should take some thousands of generations. We'll do it 30 times, to get a feel for the wide variation in coalescence times. _Msprime_ will run this in a millisecond or two.

<div class="alert alert-block alert-info"><b>Note:</b> If the <code>num_replicates</code> parameter is given to <code>sim_ancestry()</code>, the function will return a (lazy) iterator over the returned tree sequences. To turn this into a fixed list of tree sequences, we apply Python's built-in <code>list()</code> function.</div>

In [None]:
# To get a feel of the variation in coalescent tree time & topologies

import msprime
from IPython.display import HTML  # we can draw multiple tree sequences using HTML
diploid_pop_size = 10_000
tree_seqs = list(
    msprime.sim_ancestry(
        20, population_size=diploid_pop_size, sequence_length=1000, num_replicates=30, random_seed=123,
    )
)

root_times = [ts.max_time for ts in tree_seqs]
ticks = np.linspace(0, 10 ** (np.ceil(np.log10(max(root_times)))), num=6)
ticks = {t: f"{t/1_000:.0f}k" for t in ticks}

HTML("".join(
    ts.draw_svg(
        max_time=max(root_times),
        size=(200, 350),
        node_labels={},
        symbol_size=1,
        y_axis=True,
        y_ticks=ticks,
        root_svg_attributes={"style":"display: inline-block"},
    )
    for ts in tree_seqs
))

<dl class="exercise"><dt>Exercise 4</dt>
<dd>As mentioned above, it can be helpful to plot coalescence times on a log scale. Add <code>time_scale="log_time"</code> to the <code>draw_svg()</code> parameters above to see what this looks like. Then use <code>plt.hist()</code> below to plot the distribution of <code>root_time</code>s.</dd></dl>

In [None]:
# Use this cell to plot a histogram of root_times

In [None]:
workbook.question("mean_tmrca")

## Information from additional samples

Increasing the number of samples from 2 to a large number is expected to at most double the time to oldest coalescence in the tree. Moreover, as the number of tips increases, there is a rapid decrease in the average time at which each sample coalesces into the rest of the tree. Specifically, the <em>n</em>th sample will coalesce with the rest of the tree at a time approximately $1/n$ of the mean pairwise $T_{MRCA}$ (as demonstrated below)

In [None]:
import msprime
import numpy as np
from matplotlib import pyplot as plt

n_tips = np.arange(2, 100)
coalescence_time_above_sample = {n: [] for n in n_tips}
for seed in range(1, 10):
    for n in n_tips:
        ts = msprime.sim_ancestry(n, population_size=diploid_pop_size, random_seed=seed)
        sample_parent_ids = ts.first().parent_array[ts.samples()]
        sample_parent_times = ts.nodes_time[sample_parent_ids]
        coalescence_time_above_sample[n].append(np.mean(sample_parent_times))
plt.plot(n_tips, [np.mean(v) for v in coalescence_time_above_sample.values()], label="Simulated")
plt.plot(n_tips, 1/n_tips * 2*diploid_pop_size, label=r"$1/n \times 2T^\mathrm{pairwise}_\mathrm{MRCA}$")
plt.xlabel("Number of sample tips ($n$)")
plt.ylabel("Av. coalescence time above a tip (generations)")
plt.yscale("log")
plt.legend();

In [None]:
workbook.question("additional_samples")

## Coalescence times

A simple visualisation is to plot the tskit edges by the time of their parent. As long as the tree sequence is fully simplified, the parent times should reflect the density of coalescences through time. However, as we saw above, the distribution of such times will change as the number of samples increases; in particular, the mean of the times will decrease.

In [None]:
ARG_workshop.edge_plot(ts, alpha=0.2, xaxis=False, plot_hist=True)

### Pairwise times

Rather than treat each coalescent point with equal weight, as we have done above, it sometimes makes sense to weight by the number of *pairs* of samples that coalesce at each point. Not only does this result in a distribution that does not depend on the sample size, but it gives an estimate of the average *pairwise* $T_{MRCA}$ which, as we previously saw, is directly proportional to the population size.

We have already encountered a way to measure pairwise distances, using `ts.diversity()`. We applied this function to the thousand-genomes data to calculate *site-based* distances (i.e. the pairwise density of variable sites in a tree sequence). We can use the same function to calculate *branch-based* distances between every pair of samples, by specifying `mode="branch"`. In fact, each genetic distance measure has an equivalent branch-based measure, something we will see in later workbook. For the moment we can simply halve the branch-based diversity to get average pairwise coalescence times, e.g. as follows:

In [None]:
pairwise_coal_times = []
num_replicates = 250
for seed in range(num_replicates):
    ts = msprime.sim_ancestry(50, population_size=diploid_pop_size, random_seed=seed+1)
    # diversity(mode="branch") calculates the mean distance between pairs: 
    pairwise_coal_times.append(ts.diversity(mode="branch")/2)
plt.title(
    f"Mean coalescence times between all pairs of {ts.num_samples} genomes\n"
    f"({num_replicates} replicate simulations, diploid Ne = {diploid_pop_size})"
)
plt.hist(pairwise_coal_times, bins=25)
plt.xlabel("Time (generations)");

Notice that even though we are taking average coalescence times over all pairs of 100 genomes, there is still a lot of variation around the expected value of 20 000. In other words, inference of population history from a single tree is fairly imprecise.

### Coalescence rates

In a fixed-size panmictic population, there is a constant probability that any two lineages will coalesce in a given generation. This means that the average pairwise coalescence times directly reflect the population size, $N$. Of course, populations are unlikely to be entirely unstructured, or remain the same size over time. In this case, coalescence times are sometimes taken as a measure of "effective population size" ($N_e$). However, it is better to think directly terms of varying probabilities of coalescence. More specifically, pairs of lineages are subject to an "instantaneous coalescence rate" which can vary over time and place. It is only in the trivial case of a fixed-size population that this rate is constant (and equal to the inverse of the number of genomes, $1/2N$).

_Tskit_ can be used to estimate the pairwise coalescent rate from a tree sequence as follows:

In [None]:
def pair_coalescence_rates(ts, time_breaks=None, quantiles=None):
    # NB: in the next tskit release (0.5.9), there will be an API change such that
    # this function will be directly available as `ts.pair_coalescence_rates(time_breaks)`
    d = ts.coalescence_time_distribution(weight_func="pair_coalescence_events")
    if time_breaks is None:
        time_breaks = ts.coalescence_time_distribution().quantile(np.array(quantiles)).flatten()
    return d.coalescence_rate_in_intervals(np.array(time_breaks)), time_breaks

single_timeslice = [0, ts.max_time]
display(HTML("<h3>Mean rate over all time</h3>"))
print(
    "Av. pairwise rate over all coalescences:",
    pair_coalescence_rates(ts, single_timeslice)[0]
)
print(
    "Should be equal to the reciprocal of half the branch length diversity:",
    1/(ts.diversity(mode="branch")/2)
)

More interestingly, it is possible to estimate different rates over time, to examine the variable demographic histories of a sample. This is done by taking timeslices of the Empirical CDF (as we plotted on the right a number of cells previously) and assuming a constant rate within each timeslice. This is unlikely to work well on a single tree, but will come into its own when analysing multiple trees along the genome, and is a similar approach to plots generated by pairwise inference methods such as PSMC:

In [None]:
num_timeslices = 5
multiple_timeslices = np.ceil(np.logspace(1, np.log10(ts.max_time), num_timeslices+1))
multiple_timeslices[0] = 0  # set the fist timeslice to start at 0

display(HTML(f"<h3>Piecewise rate over {num_timeslices} log-spaced timeslices</h3>"))
rates, times = pair_coalescence_rates(ts, multiple_timeslices)
rates = rates.flatten()

for i in range(num_timeslices):
    print(
        f"From {times[i]} to {times[i+1]}:",
        f"pairwise coalescent rate={rates[i]:.3g}"
    )

display(SVG(ts.draw_svg(
    y_ticks=times,
    y_gridlines=True,
    size=(1000, 500),
    symbol_size=0,
    time_scale="log_time",
    y_axis=True,
    node_labels={},
    style=".y-axis .tick .grid {stroke: red}",
    x_axis=False
)))

fig, (ax_rate, ax_IICR) = plt.subplots(1, 2, figsize=(15, 5))

ax_rate.stairs(rates, times, baseline=None)
ax_rate.set_xscale("log")
ax_rate.axhline(y=1/(2*diploid_pop_size), c="tab:orange")
ax_rate.set_xlabel("Time (generations ago)")
ax_rate.set_ylabel("Instantaneous Coalescence Rate")
ax_rate.set_xlim(1, None)

ax_IICR.stairs(1/rates, multiple_timeslices, baseline=None)
ax_IICR.set_xscale("log")
ax_IICR.axhline(y=2*diploid_pop_size, c="tab:orange")
ax_IICR.set_xlabel("Time (generations ago)")
ax_IICR.set_ylabel("Inverse Instantaneous Coalescence Rate")
ax_IICR.set_xlim(1, None);
plt.suptitle("Estimated coalescence rates over time");

The right hand plot is simply the reciprocal of that on the left, giving the Inverse Instantaneous (pairwise) Coalescence Rate, or IICR. This can be numerically easier to interpret because in an unstructured population, it is a measure of (haploid) population size. However, changes in the IICR can either reflect changes in population size *or* changes in population substructure. It is therefore less misleading to label this as the IICR, rather than the effective haploid population size.

### Bonus exercise

From the tree plot, you can see that the oldest bin contains very few coalescence points, so is expected to be estimated with greater error. It is possible to choose quantiles of the coalescence times, so that an equal number of coalescent points are taken in each timeslice:

In [None]:
# Exercise: repeat the plots above providing `quantiles=[0, 0.2, 0.4, 0.6, 0.8, 1.0]` rather than the direct
# multiple_timeslices argument to the pair_coalescence_rates() function. Is there less variability in tne IICR estimates?
