## Setup

To access material for this workbook please execute the two notebook cells immediately below (e.g. use the shortcut <b>&lt;shift&gt;+&lt;return&gt;</b>). The first cell can be skipped if you are running this notebook locally and have already installed all the necessary packages. The second cell should print out "Your notebook is ready to go!"

In [None]:
if 'pyodide_kernel' in str(get_ipython()):  # specify packages to install under JupyterLite
    %pip install -q -r jlite-requirements.txt
elif 'google.colab' in str(get_ipython()):  # specify package location for loading in Colab
    from google.colab import drive
    drive.mount('/content/drive')
    %run /content/drive/MyDrive/GARG_workshop/Notebooks/add_module_path.py
else:  # install packages on your local machine (-q = "quiet": don't print out installation steps)
    !python -m pip install -q -r https://github.com/ebp-nor/GARG/raw/main/jlite/requirements.txt

In [None]:
# Load questions etc for this workbook
from IPython.display import SVG
import tskit
import ARG_workshop
workbook = ARG_workshop.Workbook2A()
display(workbook.setup)

### Using this workbook

This workbook is intended to be used by executing each cell as you go along. Code cells (like those above) can be modified and re-executed to perform different behaviour or additional analysis. You can use this to complete various programming exercises, some of which have associated questions to test your understanding. Exercises are marked like this:
<dl class="exercise"><dt>Exercise XXX</dt>
<dd>Here is an exercise: normally there will be a code cell below this box for you to work in</dd>
</dl>

# Workbook 2-A: Expected ARG patterns

We'll recap on the patterns observed in simulated ARGs, so that we know what to expect in inferred ARGs. There are many single-site statistics that can be calculated on population-level genome data, e.g. windowed diversity along the genome. However, we'll focus on patterns that require some knowledge of the underlying ARG.

We will use the same simulation as in workbook 1F: a bonobo + 2 chimpanzee population model with recent selective sweeps in two of the populations. Just as a reminder, here's the demography:

In [None]:
import stdpopsim
import demesdraw
import warnings

species = stdpopsim.get_species("PanTro")
model = species.get_demographic_model("BonoboGhost_4K19")
msprime_demography = model.model

# Plot a demesdraw "tubes" view of the model
with warnings.catch_warnings():
    warnings.simplefilter("ignore")  # Ignore a minor bug in the model specification
    demesdraw.tubes(msprime_demography.to_demes(), log_time=True)

<dl class="exercise"><dt>Exercise 1</dt>
<dd>Although the log timescale can be useful when thinking about coalescence times and their expected variation, it can also be helpful to understand the demography on a linear timescale. Plot the same demography below, but with <code>log_time=False</code>.</dd>
</dl>

In [None]:
# Plot the demography in this cell on a linear timescale

In [None]:
ts = tskit.load("data/chimp_selection.trees")
print(f"Loaded data for {ts.num_samples} genomes over {ts.sequence_length/1e6:.1f} Mb")

In [None]:
# Execute code block with <shift>+Return to display question; type and press return, or click on the buttons to answer
workbook.question("chimp_demography")

# Edge plots

As on previous days, a simple way to get a feel for the tree sequence is to plot the edges.

<dl class="exercise"><dt>Exercise 2</dt>
<dd>Use the <code>ARG_workshop.edge_plot</code> function to plot the spans and parent times of the edges in the simulated tree sequence. Add the <code>plot_hist=True</code> argument to show a weighted histogram on the right of the plot, and <code>alpha=0.1</code> to make the edges semi-transparent. You can make it wider using the <code>width</code> argument</dd>
</dl>

In [None]:
# Exercise 1: plot the edges along the genome


In [None]:
# Execute code block with <shift>+Return to display question; type and press return, or click on the buttons to answer
workbook.question("edge_plot")

To illustrate the dependence on sample size, we can create the same plot with one tenth of the number of samples. This is likely to remove recent coalescence events with a far higher probability than removing older ones. As a result, information about the selective sweeps is almost obliterated: most of the coalescences are in deep time.

In [None]:
import numpy as np
ARG_workshop.edge_plot(ts.simplify(ts.samples()[np.arange(0, 120, 10)]), width=15, plot_hist=True)

If our dataset is split up into different populations, we can plot those separately, which can be telling. Note, however, that our example is an extreme case, in which the 3 populations have radically different selective histories, with no migration or admixture since the selection events, so the pattern is much clearer than most real examples.

In [None]:
import matplotlib.pyplot as plt
fig, axes = plt.subplots(3, 2, gridspec_kw={"width_ratios": [8, 1], "hspace": 0.3}, figsize=(15, 10), sharey=True)
for ax_row, pop in zip(axes, ts.populations()):
    xaxis = (pop.id==ts.population(-1).id)
    ARG_workshop.edge_plot(ts.simplify(ts.samples(population=pop.id)), ax=ax_row, xaxis=xaxis, title=pop.metadata["name"], alpha=0.5)

In [None]:
# Execute code block with <shift>+Return to display question; type and press return, or click on the buttons to answer
workbook.question("sweeps")

## Coalescent rates 

As we previously saw, another visualization is to plot the instantaneous coalescence rate over time, weighted by the number of pairs that coalesce at each node (i.e. the *pairwise* rate). The code below does this for you, with the ability to specify windows of time and genomic position. However, don't feel like you need to go through the details, as the _tskit_ API for this will be changing in the next few weeks.

In [None]:
def pair_coalescence_rates(input_ts, sample_sets=None, time_breaks=None, window_breaks=None):
    # NB: in the next tskit release (0.5.9), there will be an API change such that
    # this function will be directly available as `ts.pair_coalescence_rates(time_breaks)`
    if sample_sets is not None:
        sample_sets = [list(s) for s in sample_sets]  # work around small bug in implementation of coalescence_time_distribution
    d = input_ts.coalescence_time_distribution(
        sample_sets=sample_sets,
        window_breaks=window_breaks,
        weight_func="pair_coalescence_events",
    )
    return d.coalescence_rate_in_intervals(np.array(time_breaks))

time_windows = np.logspace(0, np.log10(ts.max_time), 30)
rates = pair_coalescence_rates(ts, time_breaks=time_windows)
fig, axes = plt.subplots(1, 2, figsize=(15, 4))
# This might complain if any rate is 0: that can be ignored
for ax, ylabel, y in zip(axes, ("Instantaneous Coalescence Rate (ICR)", "Inverse ICR (IICR)"), (rates, 1/rates)):
    ax.stairs(y.flatten(), time_windows, baseline=None)
    ax.set_xscale("log")
    ax.set_xlabel("Time ago {ts.time_units}")
    ax.set_ylabel(ylabel)

In [None]:
# Execute code block with <shift>+Return to display question; type and press return, or click on the buttons to answer
workbook.question("ICR")

Discuss with a colleague what might be causing a dip in the IICR at ~200-300 generations ago.

### "Local" coalescence rates

Selection will cause coalescence rate to change along the genome (as we saw in the edge plots). It can therefore be helpful to plot the coalescence rates at different points along the genome (sometimes known as the "local" ICR). Here's a plotting function to do this for multiple comparisons (again, no need to go into the detail of each line of code)

In [None]:
def plot_pair_rates(input_ts, genomic_windows, num_log_timebins, sample_sets=None, indexes=None, axes=None):
    # indexes is a list of tuple pairs, e.g. [(0, 1), (1, 2)]
    time_breaks = np.logspace(0, np.log10(input_ts.max_time), num_log_timebins)
    rates = pair_coalescence_rates(input_ts, sample_sets, window_breaks=genomic_windows, time_breaks=time_breaks)
    if sample_sets is None:
        sample_sets = [input_ts.samples()]
    order = [(a, b) for a in range(len(sample_sets)) for b in range(a, len(sample_sets))]
    if indexes is None:
        indexes = np.arange(len(order))
    else:
        indexes = [order.index(i) for i in indexes]
    if axes is None:
        fig, axes = plt.subplots(len(indexes), 1, figsize=(12.5, 3 * len(indexes)))
    num_axes = 1
    try:
        num_axes = len(axes)
    except TypeError:
        axes = [axes]
    if num_axes != len(indexes):
        raise ValueError("Must have same number of axes as indexes")
    for ax, rate in zip(axes, (rates[i] for i in indexes)):
        im = ax.pcolormesh(genomic_windows, time_breaks, rate)
        ax.set_yscale("log")
        bar = plt.colorbar(im, ax=ax)
        bar.ax.set_ylabel('pairwise coalescent density', labelpad=10, rotation=270)
        ax.set_ylabel(f"Time ({input_ts.time_units})");

We'll pick 30 windows along the genome, and 20 windows in time:

In [None]:
genomic_windows = np.linspace(0, ts.sequence_length, 30)
plot_pair_rates(ts.simplify(), genomic_windows, num_log_timebins=20)
plt.xlabel("Genome position")
plt.ylabel(f"Time ({ts.time_units})");

This gives a similar picture to the edge plot, but using pairwise rates, which do not change depending on sample size. The selective sweeps are clearly visible. However, note that binning means that the resolution is poorer than in the edge plots, and bins are subject to differences in the amount of expected variation.

Again, we can break this down into populations. We can use a `sample_sets` option to provide the samples in each population, and comparing within sample set 0 by looking at the index `(0, 0)`, within sample set 1 by using `(1, 1)`, etc. 

In [None]:
sample_sets = {pop.metadata["name"]: ts.samples(population=pop.id) for pop in ts.populations()}
sample_sets = {k: sample_sets[k] for k in ["bonobo", "central", "western"]}  # make sure bonobo first, central next, western last
indexes = [(0, 0), (1, 1), (2, 2)]
fig, axes = plt.subplots(3, 1, gridspec_kw={"hspace": 0.3}, figsize=(15, 10), sharey=True, sharex=True)
plot_pair_rates(ts, genomic_windows, num_log_timebins=20, sample_sets=sample_sets.values(), indexes=indexes, axes=axes)
axes[0].set_title("bonobo")
axes[1].set_title("central")
axes[2].set_title("western")
axes[2].set_xlabel("Genome position");

The sweeps are obvious, but perhaps surprisingly, the clear bunching of coalescence in the edge plots for bonobos at ~$10.5^4$ generations ago is not very obvious in the coalescence rate plots. There's also some noise in the Central plot at recent times, 10 generations ago, although the line at ~40000 generations ago corresponding to a small population size in the Central group is quite visible.

### Cross coalescence rates

If we want to look at migration *between* populations, we can look at the **cross coalescence rate** (i.e. only record a pairwise coalescence if one of the pair is from population A and the other is from population B. Since we have 3 populations, there are only 3 possible pairs to look at:

* bonono - central
* bonobo - western
* central - western

The indexes are then `(0, 1)`, `(0, 2)`, and `(1, 2)`:

In [None]:
fig, axes = plt.subplots(3, 1, gridspec_kw={"hspace": 0.3}, figsize=(15, 10), sharey=True, sharex=True)
indexes = [(0, 1), (0, 2), (1, 2)]
plot_pair_rates(ts, genomic_windows, num_log_timebins=25, sample_sets=sample_sets.values(), indexes=indexes, axes=axes)
axes[0].set_title("bonobo - central")
axes[1].set_title("bonobo - western")
axes[2].set_title("central - western")
axes[2].set_xlabel("Genome position");

You can clearly see the introgressed regions between bonobo and central populations. However, bonobo and western are not connected. There's not much of a trace of the recent mass migration at ~ 4000 generations between central and western chimpanzee populations, however. Why might this be so?

## Topological analysis

Tskit provides some functionality for working with the *topologies* of tree sequences (in which branch lengths are ignored). You have already met the GNN statistic. The other option detailed below, the topology counter, is a promising approach that has not yet seen widespead use, but we hope will do in the future, especially combined with approaches such as [twisst](https://github.com/simonhmartin/twisst). We include it here in case it seems applicable to your model system.

### GNN

The [genealogical_nearest_neighbours](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.TreeSequence.genealogical_nearest_neighbours) method was introduced at the end of workbook 1E with the out-of-Africa simulated dataset. Below we outline a different topological method, which is also covered in the official _tskit [counting topologies tutorial](https://tskit.dev/tutorials/counting_topologies.html). If you need a refresher, you can look there. Here's the same analysis repeated on the simulated chimp data. In this case it's pretty boring: all the genomes within a population are each others closest relatives.

In [None]:
import pandas as pd
import seaborn as sns

gnn = ts.genealogical_nearest_neighbours(ts.samples(), sample_sets=list(sample_sets.values()))
df = pd.DataFrame(gnn, columns=sample_sets.keys())
df["focal_population"] = [ts.population(ts.node(u).population).metadata["name"] for u in ts.samples()]
mean_gnn = df.groupby("focal_population").mean()
sns.clustermap(mean_gnn, col_cluster=False, z_score=0, cmap="mako", cbar_pos=(1.0, 0.05, 0.05, 0.7));

### Counting topologies

If we have a tree of N tips, _tskit_ gives an easy way to list all the possible topologies (i.e. ignoring branch lengths), using the [`all_trees()` function](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.all_trees). _Tskit_ can label each of these tree topologies with a **rank** (see [here](https://tskit.dev/tskit/docs/stable/topological-analysis.html#interpreting-tree-ranks) for a detailed explanation)

In [None]:
labels = {i: name for i, name in enumerate(sample_sets.keys())}  # label node 0 "bonono", node 1 "central, etc.
for tree in tskit.all_trees(3):
    display(SVG(tree.draw_svg(node_labels=labels)))
    print(tree.rank())

We can also convert from a ranked topology back to a tree using [`tskit.Tree.unrank()`](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.Tree.unrank). E.g. here's the last tree topology:

In [None]:
tskit.Tree.unrank(num_leaves=3, rank=(1, 2)).draw_svg(node_labels=labels)

<dl class="exercise"><dt>Exercise 3</dt>
<dd>Use <code>unrank(...)</code> to plot the tree topology that groups tips 1 (central) and 2 (western) together</dd>
</dl>

In [None]:
# Plot a tree that groups central and western together


If we take one genome from each of the 3 chimp populations, the coalescent tree at any one location in the genome must take one of the 4 possible topologies. For example, we can take the first sample genome from each population:

In [None]:
three_chimps = [samp[0] for samp in sample_sets.values()]
tiny_tree_sequence = ts.simplify(three_chimps)
print(f"There are {tiny_tree_sequence.num_trees} trees in the tiny 3-tip tree sequence. Here is the first:")
first_tree = tiny_tree_sequence.first()
display(SVG(first_tree.draw_svg(
    y_axis=True,
    x_axis=True,
    size=(300, 500),
    node_labels={i: lab + "_0" for i, lab in labels.items()}
)))
print(f"It can be classified as {first_tree.rank()}")

However, this is just one of a large number of possible choices of bonobo + common + western genome. What if we could count *all possible* choices (rather than just the first sample from each population)? It turns out that there is an efficient way to do this using tree-based algorithms. We call these the "embedded topologies":

In [None]:
topology_counter = ts.first().count_topologies(
    sample_sets = sample_sets.values()
)
# Careful, do not list out the topology_counter, as it will run for infinity
# Here's the issue if anyone want to fix it! https://github.com/tskit-dev/tskit/issues/1462

topology_counter[0, 1, 2].most_common()

In this case, all 64000 combinations of one-tip-from-each-population give the same topology, of rank `(1, 0)`, which is the same as the one above. There is also an efficient way to do this over all the trees in the genome. Below we count all the topologies over the entire genome, and also count them weighted by the span of genome that they cover:

In [None]:
from tqdm.auto import tqdm
topology_totals = {tree.rank(): {"counts": 0, "spans": 0} for tree in tskit.all_trees(3)}

for topology_counter, tree in tqdm(zip(ts.count_topologies(sample_sets.values()), ts.trees()), total=ts.num_trees):
    embedded_topologies = topology_counter[0, 1, 2]
    weight = tree.span / embedded_topologies.total()
    for rank, count in embedded_topologies.items():
        topology_totals[rank]["counts"] += count
        topology_totals[rank]["spans"] += count * weight

In [None]:
for rank, data in topology_totals.items():
    display(SVG(tskit.Tree.unrank(num_leaves=3, rank=rank).draw_svg(node_labels=labels)))
    print(data)

So there are only two embedded topologies that are seen in the data. We never see genealogies that link bonobo with western (which could have occurred due to incomplete lineage sorting, or ILS)


In [None]:
# Execute code block with <shift>+Return to display question; type and press return, or click on the buttons to answer
workbook.question("topology_count")

In [None]:
# Cell for calculations to answer the question above
