## Setup

To access material for this workbook please execute the two notebook cells immediately below (e.g. use the shortcut <b>&lt;shift&gt;+&lt;return&gt;</b>). The first cell can be skipped if you are running this notebook locally and have already installed all the necessary packages. The second cell should print out "Your notebook is ready to go!"

In [None]:
if 'pyodide_kernel' in str(get_ipython()):  # specify packages to install under JupyterLite
    %pip install -q -r jlite-requirements.txt
    get_ipython().__class__.__name__ = "Shell"  # Temporary hack to get the tskit_arg_visualizer working
elif 'google.colab' in str(get_ipython()):  # specify package location for loading in Colab
    from google.colab import drive
    drive.mount('/content/drive')
    %run /content/drive/MyDrive/GARG_workshop/Notebooks/add_module_path.py
else:  # install packages on your local machine (-q = "quiet": don't print out installation steps)
    !python -m pip install -q -r https://github.com/ebp-nor/GARG/raw/main/jlite/requirements.txt

In [None]:
# Load questions etc for this workbook
from IPython.display import SVG
import tskit
import ARG_workshop
workbook = ARG_workshop.Workbook1C()
display(workbook.setup)

### Using this workbook

This workbook is intended to be used by executing each cell as you go along. Code cells (like those above) can be modified and re-executed to perform different behaviour or additional analysis. You can use this to complete various programming exercises, some of which have associated questions to test your understanding. Exercises are marked like this:
<dl class="exercise"><dt>Exercise XXX</dt>
<dd>Here is an exercise: normally there will be a code cell below this box for you to work in</dd>
</dl>

# Workbook 1-C: Pedigrees, graphs and recombination

Previously we explored genetic genealogies in the absence of recombination, when the relationships between a set of sample genomes can be depicted as a tree. Results from standard coalescent theory are often centred around a single tree; the theory and practice revolving around genetic genealogies with recombination is less well characterised. It is the main focus of this specific practical, and the workshop in general.

## The pedigree

In a species in which each diploid individual has two parents, the links from one individual to each of its (diploid) parents form a network known as a **pedigree**. Although the pedigree describes all possible routes of inheritance, in most cases involving real data it is unknown. However, we wrote our `FwdWrightFisherSimulator` to not only keep track of the routes of genetic transmission (in the _edge_ table), but also the pedigree (in the `parents` column of the _individual_ table). Thie means we can show the simulated pedigree as well as the genetic genealogy.

In [None]:
# Show the individuals table from a simulation
ts = ARG_workshop.FwdWrightFisherSimulator(population_size=3).run(gens=2, filter_nodes=False)
ind = ts.individual(6)
print(f"The biological parents of individual {ind.id} are individuals {ind.parents[0]} and {ind.parents[1]}")
print(f"A parent value of {tskit.NULL} means the parent is unknown or undefined")
display(ts.tables.individuals)

### Genetic transmission is embedded in the pedigree

We can plot the pedigree stored in a tree sequence using the `draw_pedigree` workbook function (see also [pedigrees in the msprime docs](https://tskit.dev/msprime/docs/stable/pedigrees.html)). Below, each individual is a hexagon containing its node (genome) ids, with the pedigree shown as light grey lines. On the right, we have used `simplify()` to sample-resolve the tree sequence and retain the genetic ancestry of only 3 of the genomes at time 0 (ids 0, 4, and 10): the tskit edges are overlaid as black lines onto the pedigree. Note that we had to specify `keep_unary=True` when simplifying, *** of tree sequences often results in the loss of individuals, and therefore is likely to destroy the continuity of any pedigree.

In [None]:
from matplotlib import pyplot as plt
# choose a random seed to give a nice viz without line crossing
sim = ARG_workshop.FwdWrightFisherSimulator(population_size=6, random_seed=21916)
base_ts = sim.run(gens=6, simplify=False)

# Sample-resolve to only 3 samples in the current generation (i.e. simplify, but keep all the nodes etc.)
S = [0,4,10]  # The sample IDs to use
ts = base_ts.simplify(samples=S, filter_nodes=False, keep_unary=True)
fig, axes = plt.subplots(1, 2, figsize=(12, 6))
plot_params = {
    "node_color": "skyblue",
    "font_size": 5.8,
    "node_size":350,
    "node_shape": "H",
    "edge_color": "lightgrey",
    "arrowstyle": "simple",
}
ARG_workshop.draw_pedigree(base_ts, axes[0], "Pedigree only", show_axis=True, width=0.5, **plot_params)
ARG_workshop.draw_pedigree(ts, axes[1], "Genetic transmission overlaid", width=2, ts_edge="black", **plot_params)

## The effect of recombination

In the plot above there is no recombination, so each genome can only have one parent genome. In contrast, recombination leads to *biparental inheritance*, where some regions of the DNA sequence come from the mother and some from the father. For simplicity, we'll just deal with a single chromosome of length $L$ undergoing crossover recombination. Genetic inheritance is then described by *intervals* that span different regions from $0...L$. A _tskit_ **edge** consists of a _left_ and _right_ position defining the interval, together with a _parent_ and _child_ node.

<div class="alert alert-block alert-info"><b>Note:</b>Modelling multiple chromosomes can be done, but is an advanced topic, e.g. see the <a href="https://tskit.dev/msprime/docs/stable/ancestry.html#multiple-chromosomes">msprime documentation on the topic</a>.</div>

### A modified simulator
We can derive a new class from the `ARG_workshop.FwdWrightFisherSimulator` and add recombination by overriding the `add_edges` function. We will simulate crossover recombination, in which the DNA is copied from one parent and then switches, at a *recombination breakpoint* to coming from the other parent ([gene conversion](https://en.wikipedia.org/wiki/Gene_conversion) is another form of recombination, which can also be incorporated into more advanced tree sequence simulators).

As is common in such simulators, we'll set a per-base-pair-per-generation "recombination rate", $\rho$  (in an average human chromosome of length $1\times10^8$ bp in length, $\rho$ tends to be between $1\times10^{-8}$ and $2\times10^{-8}$, giving one or two crossover recombinations per chromosome). To record biparental inheritance, we create edges from the maternal or paternal genome of each parent to the child genome, using the `left` and `right` value to specify which region of the genome is being passed on. To pick a location where recombination occurs, we choose a point at random along the chromosome. This means everywhere along the genome is a potential location at which a breakpoint can occur. 

In [None]:
import numpy as np
class FwdWrightFisherRecombSim(ARG_workshop.FwdWrightFisherSimulator):
    def add_edges(self, randomly_ordered_parent_nodes, child_node):
        L = self.tables.sequence_length
        num_breakpoints = self.rec_rng.poisson(L * self.recombination_rate, size=1)
        breakpoint_positions = np.unique([0, *self.rec_rng.integers(L, size=num_breakpoints), L])
        choose_genome = 0
        for left, right in zip(breakpoint_positions[:-1], breakpoint_positions[1:]):
            self.tables.edges.add_row(
                left=left,
                right=right,
                parent=randomly_ordered_parent_nodes[choose_genome],
                child=child_node,
            )
            choose_genome = 1 if choose_genome == 0 else 0

    def __init__(self, Ne, seq_len=1000, recombination_rate=1e-8, random_seed=21916):
        self.recombination_rate = recombination_rate
        # make a different random number generator to use to pick recombination breakpoints
        self.rec_rng = np.random.default_rng(seed=random_seed)
        super().__init__(Ne, seq_len, random_seed)  # calls the __init__ function of the underlying class


Here are some examples of the simulation results, showing both the pedigree and the darker lines of genetic transmission (the _tskit_ edges).

Importantly, these tree sequences have been **sample-resolved**: that is, the edges have been trimmed down at the end of the simulation to show only the ancestry of three specific samples. Sample-resolution is inherently a retrospective process. it involves traversing backwards through the ancestry from the samples, removing non-ancestral sections of DNA, to leave only edge intervals that trace the history of "ancestral genetic material". This can't be done forwards in time, because we can't know in advance which spans of genome which will end up making it to the current day.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 6))
fig.suptitle(f"Genetic ancestry of sample nodes: {S}")
sim_params = {"gens": 6, "simplify": True, "samples": S, "filter_nodes": False, "keep_unary": True}
ts_no_recomb = FwdWrightFisherRecombSim(Ne=6, recombination_rate=0).run(**sim_params)
ts_low_recomb = FwdWrightFisherRecombSim(Ne=6, recombination_rate=1e-4).run(**sim_params)
ts_high_recomb = FwdWrightFisherRecombSim(Ne=6, recombination_rate=1e-3).run(**sim_params)
for name, ts, ax in zip(("No", "Low", "High"), (ts_no_recomb, ts_low_recomb, ts_high_recomb), axes):
    ARG_workshop.draw_pedigree(ts, ax, f"{name} recombination", width=2, ts_edge="black", **plot_params)
plt.show()

<dl class="exercise"><dt>Exercise 1</dt>
<dd>Plot the edge table for the low recombination tree sequence using <code>ts_low_recomb.tables.edges</code></dd>
</dl>

In [None]:
# Use this cell to plot the ts_low_recomb edge table

In [None]:
# Execute code block with <shift>+Return to display question; press on one of the buttons to answer
workbook.question("low_rec_breaks")

## Ancestral Recombination Graphs

Above, it is clear that a the recombination rate increases, the ancestry of the samples gets more and more tangled. More specifically, the black lines form a network or **graph** rather than a tree. This inheritance structure is sometimes loosely referred to as an Ancestral Recombination Graph, or ARG, and a _tskit_ tree sequence can be thought of as a storage format that is flexible enough to store many different types of ARG. Two main features distinguish ARGs from more general phylogenetic networks

* A genomic coordinate space is defined, such that we can pick an equivalent position in each of the sample genomes and extract a *local tree* at that position. The ability to project an ARG into a series of local trees is a direct result of particulate inheritance (a single basepair in a sampled genome can only have been inherited from one of the parental genomes)
* In the same way that we simplified trees by changing branches and removing nodes, we can simplify ARGs by changing edges and removing nodes (but the process is more intricate)

Visualization can help to illustrate both these points.

### Direct visualizations

There are 2 ways to directly visualise the relationships in a recombinant genealogy:

* A traditional graph view. Below we use an [interactive visualiser developed for tree sequences](https://github.com/kitchensjn/tskit_arg_visualizer).
* A plot of the *local trees* implied by the tree sequence. This is the standard *tskit* view, with shading indicating the region covered by each tree.

<div class="alert alert-block alert-info"><b>Note:</b> Although tree-by-tree plots are the default visualization output by the <code>draw_svg()</code> method, it's important to realise that <em>tskit</em> does not store each tree separately, but constructs it from the previous tree by adding and removing a small number of edges (the edges added and removed can be obtained using the <a href="https://tskit.dev/tskit/docs/stable/python-api.html#tskit.TreeSequence.edge_diffs">.edge_diffs()</a> method). Bear in mind that in the local tree visualization, the same edge is therefore plotted multiple times, once for each local tree in which it appears.
</div>

In [None]:
import tskit_arg_visualizer

# Viz 1: full graph
d3arg = tskit_arg_visualizer.D3ARG.from_ts(ts_low_recomb, ignore_unattached_nodes=True)
d3arg.draw(width=600, height=300)
    
# Viz 2: local trees
display(SVG(ts_low_recomb.draw_svg(y_axis=True, size=(1500, 300))))

<dl class="exercise"><dt>Exercise 2</dt>
<dd>Have a play with the interactive visualiser (you can drag the nodes), and in particular, mouse over the bars at the bottom. Make sure you understand how the tree-by-tree view underneath corresponds to the interactive graph view. Then try the following 2 modifications, and rerun the code:
    <ol>
    <li>Use the <code>ts_high_recomb</code> instead of the <code>ts_low_recomb</code></li>
    <li>Plot a *fully simplified* version of the  <code>ts_high_recomb</code>, by applying <code>.simplify(filter_nodes=False)</code></li>
    </ol>
Finally, in the cell below print the total number of trees along the genome in the original and fully simplified high recombination example. You should see that full simplification can actually change the number of local trees.</dd>
</dl>

In [None]:
# Exercise 2: use this cell to print out the number of trees in the original and fully simplified high recombination tree sequence


In [None]:
# Execute code block with <shift>+Return to display question; press on one of the buttons to answer
workbook.question("high_rec_questions")

### Simplifying recombinant genealogies

When we simplified a tree sequence containing just a single tree, the possibilities were relatively simple. We could either *sample-resolve* (which removed edges), or *fully simplify* (which replaced edges to remove pass-through nodes). Simplifying a genealogical graph is much more complex, because:

1. Edges may not be compeletely removed, but instead be shortened or split up
2. Nodes may appear to be pass-though in a local tree, but have multiple children or multiple parents in the graph.

In the exercise above you saw the effect of *full simplification*, but this is the most extreme of (at least) 5 levels of simplification that can be performed:

* **Level 1** Sample-resolve only: i.e. do not remove nodes from the graph but just remove edges or change edge spans to trace only the ancestry of the chosen sample nodes (this is what happens when simplifying with`keep_unary=True`)
* **Level 2** As above but also remove pass-though graph nodes (the result is sometimes known as a "full ARG")
* **Level 3** As above but also remove "diamonds" and "super diamonds"
* **Level 4** As above but also remove any node not associated with a local coalescence.
* **Level 5** Fully simplify: as above and remove all non-coalescent segments. This is the default level of `simplify` (when `keep_unary=False`) and was the result of the last part of the exercise above.

To investigate what's going on, we'll focus on the first section of the genome of the high recombination example, from position 0..52. We can do this by using the [delete_intervals](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.TreeSequence.delete_intervals) function. We'll also highlight the recombination nodes in red.

In [None]:
from IPython.display import HTML
height = 250
def arg_parents(ts):
    # NB: this is inefficient for large ARGs, and we plan to make a more efficient version
    parents = {u: set() for u in range(ts.num_nodes)}
    for tree in ts.trees():
        for u in tree.nodes():
            parents[u].add(tree.parent(u))
    return parents

ts_level1 = ts_high_recomb.keep_intervals([[0, 52]], simplify=False).trim()  # Fully simplifies by default, which we don't want yet
display(HTML(
    "<h3>1. Sample-resolve only</h3>"
    "(Keeps all nodes that are ancestral to 0, 4, and 10, even if they are pass-through in the ARG)"
))
l1arg = tskit_arg_visualizer.D3ARG.from_ts(ts_level1, ignore_unattached_nodes=True)
# make nodes with multiple parents red
l1arg.set_node_styles([{"id": u, "fill": "#FF0000"} for u, p in arg_parents(ts_level1).items() if len(p) > 1])
l1arg.draw(height=height)

# NB: Tskit doesn't currently have an easy way to simplify to levels 2, 3, and 4, so we've provided a separate
# `ARG_workshop.partial_simplify()` function, which defaults to level 2. We'll also keep the root node 73
ts_level2 = ARG_workshop.partial_simplify(ts_level1, keep_input_roots=True, filter_nodes=False)
display(HTML(
    "<h3>2. 'Full ARG' (pass-through ARG nodes removed)</h3>"
    "(retains the same ARG topology and is hence a more efficient way of recording the genetic genealogy, "
    "although individuals associated with the removed nodes are lost, and therefore the pedigree cannot be recovered)"
))
l2arg = tskit_arg_visualizer.D3ARG.from_ts(ts_level2, ignore_unattached_nodes=True)
# make nodes with multiple parents red
l2arg.set_node_styles([{"id": u, "fill": "#FF0000"} for u, p in arg_parents(ts_level2).items() if len(p) > 1])
l2arg.draw(height=height)

# No diamonds/super-diamonds in this example, so go to level 4
ts_level4 = ARG_workshop.partial_simplify(ts_level2, remove_non_coalescent_nodes=True, keep_input_roots=True, filter_nodes=False)
display(HTML(
    "<h3>4. Remove non-coalescent nodes</h3>"
    "(removes the red recombination nodes and common-ancestor-only nodes; splits resulting in multiple parents are moved "
    "downwards to the nearest coalescent node)"))
tskit_arg_visualizer.D3ARG.from_ts(ts_level4, ignore_unattached_nodes=True).draw(height=height)

ts_level5 = ts_level4.simplify(keep_input_roots=True, filter_nodes=False)
display(HTML(
    "<h3>5. Full simplification</h3>"
    "(remap edges so that all nodes are locally coalescent, i.e. remove 'locally unary' segments; "
    "appears to make the graph more complex)"))
tskit_arg_visualizer.D3ARG.from_ts(ts_level5, ignore_unattached_nodes=True).draw(height=height)


(Since there are no diamonds or super diamonds to remove in this ARG, we skipped level 3 above)

In [None]:
# Execute code block with <shift>+Return to display question; press on one of the buttons to answer
workbook.question("simplification_nodes")

Here are the equivalent plots of the local trees. Note that we have used the [draw_svg()](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.TreeSequence.draw_svg) `style` parameter to colour the recombination nodes in red: the [visualization tutorial](https://tskit.dev/tutorials/viz.html) gives more information about the available viz options.

In [None]:
height = 180
re_nodes = [u for u, p in arg_parents(ts_level1).items() if len(p) > 1]
red_re_node_style = ",".join([f".n{u}>.sym" for u in re_nodes]) + "{fill:red}"

display(HTML("<h3>1. Sample-resolve only</h3>"))
display(SVG(ts_level1.draw_svg(size=(600, height), style=red_re_node_style)))

display(HTML("<h3>2. Full ARG</h3>"))
display(SVG(ts_level2.draw_svg(size=(600, height), style=red_re_node_style)))

display(HTML("<h3>4. Remove non-coalescent nodes</h3>"))
display(SVG(ts_level4.draw_svg(size=(600, height))))

display(HTML("<h3>5. Full simplification</h3>"))
display(SVG(ts_level5.draw_svg(size=(600, height))))

Observe that the level 5 plot appears to be the simplest, although the associated "fully simplified" graph appeared more complex. This illustrates the counterintuitive (and little remarked) fact that removing nodes from local trees can actually increase the complexity of an ARG (and also increase the storage space required). 

In [None]:
# Execute code block with <shift>+Return to display question; press on one of the buttons to answer
workbook.question("locally_coalescent")

It should be clear that the sequence of local trees which only contains coalescent nodes has lost substantial information, especially about recombination. Another way to put this is that nodes which appear to be unary in the local trees ("locally unary" nodes) contain important information about the passage of history. To explore this point, we'll show the full ARG again, both as a graph and as a set of local trees:

In [None]:
display(HTML("<h3>2. Full ARG</h3>"))
l2arg.draw(height=250)
SVG(ts_level2.draw_svg(size=(600, height), style=red_re_node_style))

<dl class="exercise"><dt>Exercise 3</dt>
<dd>Move your pointer over the black bars at the bottom of the graph view, to see how each breakpoint in the ARG corresponds to a switch in the ancestral path taken by a piece of genome.</dd>
</dl>

### SPR moves

In a full ARG the "locally unary" nodes indicate the exact transformation to turn one local tree into an adjacent one. For instance, we can make the second tree from the first tree by taking the branch above node 15 and changing its parent from node 38 to node 58. Likewise we can make the third tree from the second by taking the branch above 23 and changing its parent from 25 to 58. This is known as a "Subtree Prune and Regraft" (SPR) operation. As long as each recombination event in the ancestry happens at a new location, local trees will differ from each other by a single SPR: locally unary nodes provide information about where in the genetic genealogy this takes place. If we fully simplify the ARG, we retain information about the location of the breakpoints along the genome, but lose track of when in time the recombination occurred.

The transition from one local tree to another along the genome is a key process in the ARG literature. For example, a number of key algorithms approximate ARGs by a process known as the sequential Markov coalescent (SMC or SMC'), in which the next local tree along the genome can be constructed only by consideration of the current local tree (in other words, that the left-to-right generation of local trees can be treated as Markovian). This essentially throws away "long range correlations" across the genome. We will return to this in later workbooks.

<div class="alert alert-block alert-info"><b>Note:</b> A major issue in ARG inference is that these locally unary nodes (e.g. the recombination nodes) are not knowable from the topology of the local trees, so their timing and exact position is essentially unknowable in real datasets. Comprehansive ARG inference techniques may thus need to integrate over all compatible recombination node timings.</div>

## Backward-time simulation

In the same way that we were able to simulate coalescent trees in reverse time, we can also simulate recombinant genealogies in reverse time. In fact, ARGs were originally proposed as a backward-time representation of recombinant ancestry (forward-time, or "prospective" ARGs are a rather novel concept, and have only recently become popular as a result of the coupling the fast forward-time _SLiM_ simulator with _tskit_, see [this paper](https://doi.org/10.1093/genetics/iyae100)).

The usual backward-time model that generates an ARG is known as the coalescent with recombination (CwR). The approach has the same advantages as simulating a single coalescent tree backward in time. Indeed, although local trees in are highly non-independent, each local tree on its own can be treated as a sample from the standard coalescent.

In [None]:
# Execute code block with <shift>+Return to display question; press on one of the buttons to answer
workbook.question("backward_time")

An efficient algorithm for running the CwR was introduced by Hudson (1982), and implemented in his `ms` software. It was later reimplemented more efficiently in the [_msprime_](https://tskit.dev/software/msprime.html) simulator. _Msprime_ can be used to simulate genalogies of thousands or millions of whole genomes in fractions of a second.

To simulate a genalogy using msprime, you call `msprime.sim_ancestry()` with a number of (usually diploid) individuals, a population size (or [demography](https://tskit.dev/msprime/docs/stable/demography.html)), a sequence length, and a recombination rate or genetic map. The result is a standard tree sequence:

In [None]:
# Simulate a megabase of genome
import msprime
ts = msprime.sim_ancestry(1000, population_size=1e4, sequence_length=1e6, recombination_rate=1e-8, random_seed=1)
print(f"Simulated {ts.num_samples} genomes: created {ts.num_trees} trees. Tree sequence takes up {ts.nbytes/1e6} megabytes")

<dl class="exercise"><dt>Exercise 4</dt>
<dd>In the space below, repeat the code in the cell above, simulating a million genomes instead of 2000. Also add the <code>%time</code> "magic command" at the start of the first line, which will report how long the simulator takes</dd>
</dl>

In [None]:
# Exercise: prefix the call with the "magic" %time command to see how long it takes to simulate a genealogy of a million 1Mb human genomes


In [None]:
# Execute code block with <shift>+Return to display question; press on one of the buttons to answer
workbook.question("large_sim")

### Simplified vs full ARG simulations

By default, msprime produces *fully simplified* tree sequences. As we have seen, this means that the time and location of recombination events is not precisely recorded. For many purposes, this is not a problem, but it does mean, for example, that we cannot calculate the full likelihood of the ARG under a given model of recombination.

To carry out "full ARG" simulations, we can use the "record_full_arg" option of msprime. Note that for technical reasons involving likelihood calculations, this outputs a slighly unusual tree sequence where each recombination is represented by two nodes (the two children). Here's a small example:

In [None]:
from IPython.display import SVG 
import msprime
import tskit_arg_visualizer
ts = msprime.sim_ancestry(4, population_size=100, record_full_arg=True, sequence_length=10000, recombination_rate=1e-6, random_seed=2)
print(f"Simulated full ARG of {ts.num_samples} genomes. There are {ts.num_trees} local trees")
ts.draw_svg(time_scale="log_time", size=(1200, 300))

The _tskit_arg_visualizer_ can plot these _msprime_ full ARGs using an "orthogonal" edge style that works slighly better for small ARGs (as it avoids diagonal lines).

In [None]:
d3arg = tskit_arg_visualizer.D3ARG.from_ts(ts)
d3arg.draw(width=800, height=500, edge_type="ortho", sample_order=[1, 0, 7, 2, 3, 6, 4, 5])

Above, it is easy to see examples of "diamonds" where backward-in-time, two lineages split then immediately join again. There can be a suprisingly large number of these in a full ARG, and they are essentially undetectable. The same goes for "super-diamonds", in which there is a single edge going into a cluster of nodes, and a single edge going out again.

In [None]:
# Execute code block with <shift>+Return to display question; press on one of the buttons to answer
workbook.question("diamonds")

Diamonds and super-diamonds are also removed by simplification. As this removes the associated breakpoints from the ARG, it thus decreases the number of breakpoints (and trees), as we can see by comparing the simplified and unsimplified tree sequences.

In [None]:
fully_simplified_ts = ts.simplify()
print(f"{ts.num_trees} trees in the full ARG, {fully_simplified_ts.num_trees} in the simplified ARG")
fully_simplified_ts.draw_svg(time_scale="log_time")  # 3 fewer trees than the unsimplified version

As you can see, the fully simplified trees are simpler to inspect than the full ARG trees. In fact, as the sequence length increases, the number of nodes in a full ARG goes up much more quickly than in the equivalent simplified version. The ARG is quickly dominated by essentially unknowable nodes, which end up slowing down simulation and analysis. This is an important reason why full ARG simulations are not the default output from _msprime_ (note the log scale below):

In [None]:
import msprime
import matplotlib.pyplot as plt
full_arg = []
simplifed = []
x = (1000, 2000, 5000, 10000, 20000, 50000, 100000)
for seq_len in x:
    ts = msprime.sim_ancestry(10, record_full_arg=True, population_size=1e4, sequence_length=seq_len, recombination_rate=1e-6, random_seed=1)
    full_arg.append(ts.num_nodes)
    simplifed.append(ts.simplify().num_nodes)
plt.plot(x, full_arg, label="Full ARG")
plt.plot(x, simplifed, label="Simplified ARG")
plt.xlabel("Sequence length")
plt.ylabel("Number of nodes")
plt.xscale("log")
plt.yscale("log")
plt.legend();

## A tree sequence is a summary of underlying events

TODO: Explain why a tree sequence takes each node as representing a genome, rather than an event You do not need to know the precise details of all the underlying events.

In the original ARG proposal by Griffiths (1992) each internal node represented a single event: either a common ancestor event or a recombination event.

NB: we have not mentioned hidden recombination events and "trapped material".

## Large ARGs

As you can see, it is veyr easy to create large ARGs of thousands or millions of nodes. In these cases, the plots we have seen so far become useless. Instead, we must visualise the ARG using various other genealogical summaries. 

### Edge plots
As with a single tree, we can simply plot the edges. In most cases we will want to focus on coalescence points, and therefore want to plot the time of each edge parent from a fully simplified tree sequence. However, as there is recombination, edges will not span the entire genome. In fact as the edge parents get older, we would epxect to find edges getting shorter and shorter, as indeed tends to happen.

In [None]:
# Make a large tree sequence with many trees
import msprime
ts = msprime.sim_ancestry(500, population_size=1e4, sequence_length=1e6, recombination_rate=1e-8, random_seed=1)
print(f"Simulated {ts.num_samples} samples, {ts.num_trees} trees")
ARG_workshop.edge_plot(ts, alpha=0.1, plot_hist=True, width=20)

### Coalescent density heatmaps

Above, the plotted edges represent coalescence points. However, in the previous workbook we saw that the *pairwise* coalescent rate (i.e. coalescent nodes weighted by the number of pairs that go through that point) can be more helpful, as this is not expected to be dependent on sample size. The pairwise rates can be calculated using `pair_coalescence_rates`, and plotted as a heatmap.

In [None]:
def pair_coalescence_rates(ts, time_breaks=None, window_breaks=None):
    # NB: in the next tskit release (0.5.9), there will be an API change such that
    # this function will be directly available as `ts.pair_coalescence_rates(time_breaks)`
    d = ts.coalescence_time_distribution(window_breaks=window_breaks, weight_func="pair_coalescence_events")
    return d.coalescence_rate_in_intervals(np.array(time_breaks))

def plot_pair_rates(ts, genomic_windows, num_log_timebins):
    time_breaks = np.logspace(0, np.log10(ts.max_time), num_log_timebins)
    rates = pair_coalescence_rates(ts, window_breaks=genomic_windows, time_breaks=time_breaks)
    fig, ax = plt.subplots(1, figsize=(12.5, 3))
    im = ax.pcolormesh(genomic_windows, time_breaks, rates[0])
    ax.set_yscale("log")
    bar = plt.colorbar(im, ax=ax)
    bar.ax.set_ylabel('pairwise coalescent density', labelpad=10, rotation=270)

In [None]:
plot_pair_rates(ts, np.arange(0, 1e6, 5e4), 10)

Note that the oldest time bins contain rather few unique coalescence points, and so are likely to be particularly noisy

### Signatures of demography or selection

In comparison, below is what you might expect to see under a population expansion or selective sweep. This also demonstrates the power of the _msprime_ simulator to incorporate demography and simple models of selection. We will encounter this again in future workbooks.

In [None]:
# Selective sweep: see https://tskit.dev/msprime/docs/stable/ancestry.html#sec-ancestry-models-selective-sweeps
Ne = 1e4
L = 1e6  # Length of simulated region

# define hard sweep model
sweep_model = msprime.SweepGenicSelection(
    position=L / 2,  # sweep is focueed on the middle of the chrom
    start_frequency=1.0 / (2 * Ne),
    end_frequency=1.0 - (1.0 / (2 * Ne)),
    s=0.01,
    dt=1e-6,
)
ts = msprime.sim_ancestry(
    500,
    model=[sweep_model, msprime.StandardCoalescent()],
    population_size=Ne,
    recombination_rate=1e-8,
    sequence_length=L,
    random_seed=1
)
ARG_workshop.edge_plot(ts, alpha=0.1, plot_hist=True, width=20)
plot_pair_rates(ts, np.arange(0, 1e6, 5e4), 10)

The selective sweep above is in the middle of the genome, and results in a peaked coalescent density at a particular location and time, with a corresponding lack of coalescences at older times. In general genealogical patterns that affect some areas of the genome and not others are indivative of selection rather than demography.

In [None]:
# Expansion
demography = msprime.Demography.isolated_model([10000], growth_rate=[0.001])
ts = msprime.sim_ancestry(500, demography=demography, sequence_length=1e6, recombination_rate=1e-8, random_seed=1)
ARG_workshop.edge_plot(ts, alpha=0.1, plot_hist=True, width=20)
plot_pair_rates(ts, np.arange(0, 1e6, 5e4), 10)

The timescale is much compressed (maxes out at 1000 rather than 100000 generations, and there are many fewer very old coalescence points). The inverse of the rates (the IICR) gives an estimate of population size, and we can plot that too, which should give an estimate of population size over time in this simple model:

In [None]:
num_log_timebins = 20
time_breaks = np.logspace(0, np.log10(ts.max_time), num_log_timebins)
rates = pair_coalescence_rates(ts, time_breaks=time_breaks)
plt.stairs(1/rates[0].flatten(), time_breaks)
plt.xscale("log")
plt.ylabel(f"Estimated (haploid) population size");
plt.xlabel(f"Time {ts.time_units} generations ago");

<dl class="exercise"><dt>Exercise 5</dt>
<dd>Run the cells aove but with different seeds, to get a feel for the random variation between different ARGs simulated under the same model</dd>
</dl>