## Setup

To access material for this workbook please execute the two notebook cells immediately below (e.g. use the shortcut <b>&lt;shift&gt;+&lt;return&gt;</b>). The first cell can be skipped if you are running this notebook locally and have already installed all the necessary packages. The second cell should print out "Your notebook is ready to go!"

In [None]:
if 'pyodide_kernel' in str(get_ipython()):  # specify packages to install under JupyterLite
    %pip install -q -r jlite-requirements.txt
elif 'google.colab' in str(get_ipython()):  # specify package location for loading in Colab
    from google.colab import drive
    drive.mount('/content/drive')
    %run /content/drive/MyDrive/GARG_workshop/Notebooks/add_module_path.py
else:  # install packages on your local machine (-q = "quiet": don't print out installation steps)
    !python -m pip install -q -r https://github.com/ebp-nor/GARG/raw/main/jlite/requirements.txt

In [None]:
# Load questions etc for this workbook
import ARG_workshop
workbook = ARG_workshop.Workbook1B()
display(workbook.setup)

### Using this workbook

This workbook is intended to be used by executing each cell as you go along. Code cells (like those above) can be modified and re-executed to perform different behaviour or additional analysis. You can use this to complete various programming exercises, some of which have associated questions to test your understanding. Exercises are marked like this:
<dl class="exercise"><dt>Exercise XXX</dt>
<dd>Here is an exercise: normally there will be a code cell below this box for you to work in</dd>
</dl>

# Workbook 1-B: Coalescent theory

We will use the simulator we wrote in the previous workbook to illustrate some fundamental principles behind non-recombining genetic genealogies. The main body of theory in this area is the (backward-time) coalescent. For a detailed introduction see e.g. chapter 3 of Wakeley's book "Coalescent Theory".

First, we'll simulate a tree sequence forward in time:

In [None]:
from IPython.display import SVG
ts = ARG_workshop.FwdWrightFisherSimulator(population_size=10).run(gens=50)
SVG(ts.draw_svg(size=(500, 250), y_axis=True))

## Trees in a tree sequence

Normally, tree sequences encode multiple correlated trees along the genome. Howevever, here all the base pairs, from 0..1000, share the same ancestry so there's only one tree. This tree can be extracted using `ts.first()`:

<dl class="exercise"><dt>Exercise 1</dt>
<dd>Use <code>ts.num_trees</code> to check that there's only one tree in this tree sequence, then show a tabular summary using <code>display(ts.first())</code></dd></dl>

In [None]:
# Print out the number of trees and a tabular summary of the first tree


In [None]:
workbook.question("tree")

Like a tree sequence, a _tskit_ `tree` has lots of useful methods. Given a set of node IDs, the `mrca()` and `tmrca()` methods return the ID of the most recent common ancestor (MRCA) node and the time of that node ($T_{\mathrm{MRCA}}$) respectively. For instance, let's look at the MRCA of nodes 0 and 1:

In [None]:
mrca_id = ts.first().mrca(0, 1)
mrca_time = ts.first().tmrca(0, 1)
print(
    f"Node {mrca_id}, the most recent common ancestor of nodes 0 and 1",
    f"lived {mrca_time} {ts.time_units} ago",
)

# Highlight it!
SVG(ts.draw_svg(size=(500, 250), y_axis=True, style=f".n{mrca_id}>.sym {{fill: red; r: 8px}}"))

Note, however, if we remove all the ancestry older than a given time (known as `decapitating` a tree sequence!), we might not have an MRCA between two nodes, in which case the `mrca` function will return a null value (_tskit_ uses -1 as the null ID). The resulting tree sequence will have [multiple roots](https://tskit.dev/tskit/docs/stable/data-model.html#roots), visualised as below:

In [None]:
ts_recent = ts.decapitate(time=10.0)  # delete ancestry older than 10 generations ago
print(
    "The MRCA ID between node 0 and 1 in the decapitated tree sequence is",
    ts_recent.first().mrca(0, 1)
)
SVG(ts_recent.draw_svg(size=(500, 250), y_axis=True))

In [None]:
# The .tmrca() function will fail with a helpful message in this case
print(ts_recent.first().tmrca(0, 1))

## Expected $T_\mathrm{MRCA}$

The expected time to the most recent common ancestor  for a randomly chosen pair of genomes is a classic result from coalescent theory. In a Wright-Fisher model, if the population is not tiny, it is well approximated by the negative exponential distribution with mean and variance equal to $2N_e$.


$$\frac{1}{2N_e}e^{-\frac{T_\mathrm{MRCA}}{2N_e}}$$


To see how well this theoretical approximation holds, one (inefficient) way to sample from the true MRCA distribution is to loop over the forward-time simulator repeatedly with different seeds.

Since this can take some time, we'll wrap the loop using the [tqdm library](https://tqdm.github.io/), so that we get a progress bar

In [None]:
import numpy as np
from tqdm.auto import tqdm

pop_size = 10
max_gens = 160  # need to run long enough to ensure there *is* an MRCA
fwdsim_tMRCAs = []
for seed in tqdm(np.arange(1, 300)):
  ts = ARG_workshop.FwdWrightFisherSimulator(pop_size, random_seed=seed).run(gens=max_gens)
  # (arbitrarily) pick the MRCA between the first 2 nodes
  first_two_samples = ts.samples()[0:2]
  fwdsim_tMRCAs.append(ts.first().tmrca(*first_two_samples))

<dl class="exercise"><dt>Exercise 2</dt>
<dd>use <code>np.mean</code> to print out the mean of the <code>fwdsim_tMRCAs</code> array</dd></dl>

In [None]:
# Exercise 2: print the mean of fwdsim_tMRCAs


In [None]:
# Execute code block with <shift>+Return to display question; press on one of the buttons to answer
workbook.question("expected_tmrca")

We can plot out the entire distribution, not just the mean

In [None]:
from matplotlib import pyplot as plt
plt.hist(fwdsim_tMRCAs, bins=25, density=True, label="WF simulation", color="lightblue")
T = np.arange(1, max_gens)
y = 1/(2 * pop_size) * np.exp(-T/(2 * pop_size))
formula = r"$1/2N_e e^{−TMRCA/2N_e}$"
plt.plot(T, y, c="tab:orange", label=formula)
plt.axvline(np.mean(fwdsim_tMRCAs), ls=":", c="tab:blue")
plt.axvline(2 * pop_size, ls=":", c="tab:orange")
plt.text(2 * pop_size, y.max(), "Means", ha="right", va="top", rotation=90)
plt.xlabel(r"TMRCA (generations)")
plt.legend(fontsize=14)
plt.title(f"Time to MRCA between two samples in a population of size $N_e={pop_size}$");

The fit is pretty impressive, even though this is a very small population, when we would expect the approximation to be potentially poor. As we'll see later, results from coalescent theory are surprisingly robust to changes in the details of the underlying model.

## The incredible depth of human ancestry

It is often estimated that the average TMRCA between a random pair of human genomes (say from you mother vs you father) roughly matches that from a constant-sized Wright-Fisher population of size 15 000 diploids.

In [None]:
# Execute code block with <shift>+Return to display question; press on one of the buttons to answer
workbook.question("Human_Ne_TRMCA")

### A popgen taster

To a casual observer, these times might seem too old. We can use tskit to perform a quick check on some real data, using the "genetic diversity" (or $\pi$), which measures the average genetic difference per base pair between all sample pairs. The $T_{\mathrm MRCA}$ is simply this divided by twice the mutation rate (the factor of two accounts for mutations on both lineages from the common ancestor). A reasonable nucleotide mutation rate in humans is $\mu=1.29\times10^{-8}$ per base pair per generation.

To illustrate the speed of tree sequence calculations, we'll load up a tree-sequence-encoded copy of the well-known thousand genomes project (tgp). For speed and simplicity, we'll use a tree sequence that only contains 5 megabases of chromosome 20.

In [None]:
# Download the compressed tree sequence for 5-10 MB of human chromosome 20 arm
import tszip  # Tree sequences can be further compressed for efficient storage using "tszip"
try:
    tgp_ts = tszip.decompress("1kgp_chr20_5-10MB.tsz")
except FileNotFoundError:
    import urllib.request
    url = "https://raw.githubusercontent.com/ebp-nor/GARG/main/jlite/content/1kgp_chr20_5-10MB.tsz"
    print(f"Downloading '{url}'")
    urllib.request.urlretrieve(url)
    ts = tszip.decompress("1kgp_chr20_5-10MB.tsz")

<dl class="exercise"><dt>Exercise 3</dt>
<dd>Use <code>tgp_ts.num_samples</code> and <code>tgp_ts.num_individuals</code> to check on the number of genomes and number of individuals in this dataset. Then use <code>tgp_ts.diversity()</code> to print out the average number of nucleotide differences between a random pair of genomes (this calculation should be instantaneous). Divide this by twice the mutation rate, then multiply by 27 years per generation to get a rough estimate of the average pairwise TMRCA in years</dd></dl>

In [None]:
# Use this cell to print the number of samples and individuals in tgp_ts
# Then calculate the average pairwise genetic diversity, and use it to find the average TMRCA in years,
# assuming a mutation rate of 1.29e-8 per base pair per generation, and a generation time of 27 years


In [None]:
workbook.question("Human_mu_TRMCA")

<div class="alert alert-block alert-info"><b>Note:</b> Here we have assumed that the same coalescent predictions apply to different regions of the genome, each of which may have a slightly different ancestry. 
</div>

## Advantages of backward simulation

Our forward-time Wright-Fisher simulator does not incoporate any natural selection: it is _neutral_. Neutral simulations, or indeed any simulations where the genomes are [exchangable](), can be implemented much more efficiently in reverse, or backward-time. In this case, we define an initial set of sample genomes and randomly choose parents for those genomes, then parents for the parents, and so on. This means we don't need to `simplify()` the genealogy, and it's far more efficient that forward-simulation because:

* we don't spend time simulating extinct lineages,

* we can efficiently simulate a small set of samples taken from a much larger population, and

* we can stop simulating as soon as the lineages from all the samples merge into a single MRCA

The major _downside_ is that backward simulation cannot deal with most forms of natural selection, or populations distributed in continuous space (but in later workbooks we will see how _tskit_ can be used to combine both backward and forward time simulation in a process known as [recapitation](https://tskit.dev/pyslim/docs/latest/tutorial.html#sec-tutorial-recapitation)).

### Msprime: a backward-time simulator

[_Msprime_](https://tskit.dev/msprime/docs/stable/intro.html) is a widely-used efficient backward-time simulator that outputs tree sequences. Its [discrete-time-Wright-Fisher](https://tskit.dev/msprime/docs/stable/ancestry.html#using-the-dtwf-model) ("`dtwf`") capability will simulate in backward time the same process we simulated in our forward-time model, but will allow much larger population sizes.

In [None]:
import msprime
Ne = 1_000   # Much faster to simulate a small sample taken from a larger populations
ts = msprime.sim_ancestry(samples=10, population_size=Ne, model="dtwf", random_seed=123)
SVG(ts.draw_svg(size=(500, 250), y_axis=True))

#### Continuous-time models: a major speedup

However, the "`dtwf`" model is still fairly slow because it has to go generation-by-generation, backwards in time. If we assume that only one coalescence occurs in any generation, we can simply sample a time back to the next coalescent event from the appropriate distribution, without having to step one generation at a time. The faster method is the default used by msprime, and it means that the size of the population no longer has an effect on the efficiency of the simulation. We can easily simulate a small sample from a population of a billion individuals: all that happens is that the time scale will change (of course, given a generation time of a few years, a constant population of a billion is not a biologically plausible assumption: the MRCAs would be older than the origin of life on earth)

Note that the continuous-time approach means that we never get "polytomies (i.e. multifurcations). Theerefore the trees produced by the default msprime model are always strictly bifurcating.

### Variation in coalescent tree depths and topologies

To demonstrate the efficiency of this approach, we'll sample 20 individuals from a population of a thousand, until all the samples have coalesced into a single root, which should take some thousands of generations. We'll do it 30 times, to get a feel for the wide variation in coalescence times. _Msprime_ will run this in a millisecond or two.

<div class="alert alert-block alert-info"><b>Note:</b> If the <code>num_replicates</code> parameter is given to <code>sim_ancestry()</code>, the function will return a (lazy) iterator over the returned tree sequences. To turn this into a fixed list of tree sequences, we apply Python's built-in <code>list()</code> function.</div>

In [None]:
# To get a feel of the variation in coalescent tree time & topologies

import msprime
from IPython.display import HTML  # we can draw multiple tree sequences using HTML

tree_seqs = list(
    msprime.sim_ancestry(
        20, population_size=1_000, sequence_length=1000, num_replicates=30, random_seed=123,
    )
)

root_times = [ts.max_time for ts in tree_seqs]
ticks = np.linspace(0, 10 ** (np.ceil(np.log10(max(root_times)))), num=6)
ticks = {t: f"{t/1_000:.0f}k" for t in ticks}

HTML("".join(
    ts.draw_svg(
        max_time=max(root_times),
        size=(200, 350),
        node_labels={},
        symbol_size=1,
        y_axis=True,
        y_ticks=ticks,
    )
    for ts in tree_seqs
))

Notice the variation in the times of the root node (i.e. the deepest coalescence)

<dl class="exercise"><dt>Exercise 4</dt>
<dd>Because coalescent times are approximately exponentially distributed, it can be helpful to plot trees using a log timescale. Add `time_scale="log_time"` to the `draw_svg()` parameters above to see what this looks like. Then use `plt.hist()` below, to plot the distribution of <code>root_times</code>.</dd></dl>

In [None]:
# Exercise: plot a histogram of root_times


In [None]:
workbook.question("mean_tmrca")

## Information from additional samples

In a large population, as we increase the sample size from 2 genomes to thousands or millions, the additional common ancestors added to the genealogy will tend to be at younger and younger times. Another way to put this is that small numbers of genomes tend to contain information on deep history; increasing the sample size will tend to fill out information about more recent events.

We can see this by plotting the time of the additional node (if any) that is added to the tree as we increase the number of samples.

We'll compare the forward simulation to the reverse one...

In other words, in a panmictic population, as we increase the sample size we will tend to fill out information about recent timescales, but will not add

# To complete