## Setup

To access material for this workbook please execute the notebook cell immediately below (e.g. use the shortcut <b>&lt;shift&gt;+&lt;return&gt;</b>). This can be skipped if you are running this notebook locally and have already installed all the necessary packages. 

In [None]:
if 'pyodide_kernel' in str(get_ipython()):  # specify packages to install under JupyterLite
    raise ValueError("Can't run this notebook in JupyterLite. Try colab")
elif 'google.colab' in str(get_ipython()):  # specify package location for loading in Colab
    from google.colab import drive
    drive.mount('/content/drive')
    %run /content/drive/MyDrive/GARG_workshop/Notebooks/add_module_path.py
else:  # install packages on your local machine (-q = "quiet": don't print out installation steps)
    !python -m pip install -q -r https://github.com/ebp-nor/GARG/raw/main/jlite/requirements.txt

# Workbook 2.F: _sc2ts_ analysis playground

This lab session is designed for you to experiment with your own datasets, if you have them. If not, you can either:

1. work with another attendee with an intereating dataset
2. have a play with the SARG-CoV2 ARG below. Some suggested analyses are described.

## Case study: Covid genealogies

Covid sequencing datasets consist of large numbers of genomes, known as "strains", that are sampled at multiple points through time. This can involve hundreds or thousands of genomes every day over 2 or 3 years. There will be occasional recombination events between different strains produced when one individual is co-infected with different strains. Most of these do not result in many descendant recombinant strains, but occasionally some spread widely. Standard representations of SARS-CoV2 genealogies do not account for recombination, but represent viral history as a single tree. As more and more recombinant strains emerge, a single tree becomes inappropriate.

However, inferring a large-scale recombinant viral genealogy is non-trivial. Expected genealogical patterns are very different from the coalescent-with-recombination, so that standard inference tools such as _SINGER_ or _tsinfer_ are unlikely to give reasonable results. However, the concept behind _tsinfer_ can be used to efficently construct large covid genealogies. In particular, we can consider older samples as potential ancestors for younger samples, using the Li & Stephens HMM algorithm to treat samples as putative recombinant mosaics of all older samples. This is the rationale behind the _sc2ts_ algorithm, described in [this preprint](https://www.biorxiv.org/content/10.1101/2023.06.08.544212v1) (an extra step builds all samples collected on a single day into a parsimonious tree where possible). The result is a tree sequence with many samples at different times in the past, and many samples which act as ancestors for other samples, as well as some extra ancestral nodes inserted by the algorithm (e.g. recombination nodes).

Running _sc2ts_ on millions of genomes takes considerable computing resources. In this workbook, we will simply analyse a tree sequence produced froma recent run of the _sc2ts_ algorithm on half a million SARS-CoV2 genomes.

### Data issues
We have found that the main issues when building such an ARG involve data quality. In particular, the following adversely affect inference:
1. inacccurate collection dates
2. sequencing with high error rates (which could be mistaken for recombinants)
3. accidentally mixing strains when sequencing.

Since the publication of the preprint, we have rerun the inference method using filtered, higher-quality data from the [Viridian](https://www.biorxiv.org/content/10.1101/2024.04.29.591666v1) project, which includes about a quarter of all sequenced covid genomes. A preliminary unpublished ARG containing genomes sequenced up to March 2021 is available in `data/viridian-full-mm_3-mhc_5-mgs_20-2021-03-21.tsz`. This has not been through a final quality-control (QC) process: as we shall see it contains some data problems, which serve as a useful example of how to detect QC issues.

For constructing the ARG, a stringent inclusion criterion was used. In particular, strains that have a high HMM cost (i.e. cannot easily fit into the genealogy, even allowing for recombination) are excluded. Also sites known to have high mutation rates, especially those associated with reversions or recurrent mutations, are excluded.

This means that the ARG will only contains a fraction of all collected SARS-CoV2 sequences. In addition, the ARG stops at the end of March 2021, when relatively few recombinants were circulating (e.g. the _omicron_ wave had not taken off). Analysing later data, with an expected large number of recombinants, is a full research project.

In [None]:
import tszip
ts = tszip.decompress("data/viridian-full-mm_3-mhc_5-mgs_20-2021-03-21.tsz")
print(
    f"Loaded {ts.nbytes/1e6:0.1f} megabyte SARS-CoV2 genealogy of {ts.num_samples} strains",
    f"({ts.num_trees} trees, {ts.num_mutations} mutations over {ts.sequence_length} basepairs).",
    f"Last collection date is {ts.node(ts.samples()[-1]).metadata['date']}",
)
print(f"\nFirst few nodes are\n  * {ts.node(0)}\n  * {ts.node(1)}\n  * {ts.node(2)}")

regions={"Spike": (21563, 25384), 'ORF8': (27894, 28259)}  # a few locations on the covid genome

# Create mappings between  "strain" (actually ENA accession #) and node ID
strain_to_id = {ts.node(u).metadata['strain']: u for u in ts.samples()}
id_to_strain = [node.metadata.get('strain', '') for node in ts.nodes()]

### Available metadata

As you can see above, the sample nodes (flags=1 or an odd number) have extensive metadata. Using this is often an integral part of analysis, so it's worth checking out what's available, e.g. by printing the metadata for the last node in a nicer format:

In [None]:
import json
last_sample = ts.node(ts.samples()[-1])
print(json.dumps(last_sample.metadata, indent=2))

As well as the `date` and `Country`, the `Viridian_scorpio` and `Viridian_pangolin` items look useful, to classify the strains into different groups knowns as Covid "variants", e.g. as assigned by the [PANGOLIN](https://en.wikipedia.org/wiki/Phylogenetic_Assignment_of_Named_Global_Outbreak_Lineages) project. In order to avoid confusion with the normal use of the term "variant" in genetics, we'll mostly refer to these as "Variant of Concern" (VoC) groups, such as _Alpha_, _Omicron_, etc. Let's see how many different groups there are in our ARG:

In [None]:
from collections import Counter
# NB decoding the metadata for all samples might take a number of seconds
samples_country = []
samples_viridian_pangolin = []
samples_viridian_scorpio = []
for u in ts.samples():
    md = ts.node(u).metadata
    samples_country.append(md.get("Country", ""))
    samples_viridian_pangolin.append(md["Viridian_pangolin"])
    samples_viridian_scorpio.append(md["Viridian_scorpio"])
print(f"{len(set(samples_viridian_pangolin))} Viridian_pangolin categories (e.g. `{last_sample.metadata['Viridian_pangolin']})`")
print(f"{len(set(samples_viridian_scorpio))} Viridian_scorpio categories (e.g. `{last_sample.metadata['Viridian_scorpio']})`")

In [None]:
# since there are only a few Viridian_scorpio categories, let's have a look
print("Viridian_scorpio categories are:")
scorpio_counts = Counter(samples_viridian_scorpio)
scorpio_counts

The `'.'` class seems to be cases where no `Viridian_scorpio` is designated. In many cases we might hope these will have a `Viridian_pangolin` classification: let's check.

In [None]:
i = 0
for u, scorpio, pangolin in zip(ts.samples(), samples_viridian_scorpio, samples_viridian_pangolin):
    if scorpio == '.':
        print(f"Node {u}: Viridian_scorpio='.', Viridian_pangolin={pangolin}")
        i += 1
        if i > 10:
            break

### Number of samples, collection dates

We can check on the number of samples collected on different days, which averages in the thousands per day once we get into 2021. There's a big spike just before the new year, which is probably indicative of data problems (as we shall see later).

In [None]:
from matplotlib import pyplot as plt
import numpy as np
import datetime
dates = []
for u in ts.samples():
    dates.append(datetime.datetime.fromisoformat(ts.node(u).metadata['date']))
fig, ax = plt.subplots(figsize=(15, 5))
ax.hist(dates, bins=(np.max(dates) - np.min(dates)).days)
ax.set_ylabel("Number of samples collected per day");

In [None]:
# Break this down into different categories, and cut off the histogram so it isn't dominated by the big spike
from collections import defaultdict
import matplotlib as mpl
dates_by_voc = defaultdict(list)
for u in ts.samples():
    md = ts.node(u).metadata
    dates_by_voc[md['Viridian_scorpio']].append(datetime.datetime.fromisoformat(md['date']))
dates_by_voc = {k: dates_by_voc[k] for k in sorted(dates_by_voc.keys(), key=lambda x: -scorpio_counts[x])}

fig, ax = plt.subplots(figsize=(15, 5))
data = list(dates_by_voc.values())
labels = list(dates_by_voc.keys())
# list of indexes into the matplotlib tab20 colourmap, 0, 2, 4, ... 1, 3, 5, ... 20, 21, 22 ...
cm_indexes = np.concatenate((np.arange(0, 20, 2), np.arange(1, 20, 2), np.arange(20, len(data))))
scorpio_colors = {k: mpl.colors.rgb2hex(mpl.colormaps['tab20'](i)) for k, i in zip(labels, cm_indexes)}
ax.hist(data, bins=(np.max(dates) - np.min(dates)).days, stacked=True, label=labels, color=scorpio_colors.values())
legend = ax.legend(ncol=2, loc="upper left")
legend_handles, legend_labels = ax.get_legend_handles_labels()
ax.set_ylabel("Number of samples collected per day");
ax.set_ylim(0, 5000);

You can see that this ARG mostly only covers the Covid _Alpha_ wave. It is hard to plot out a tree with half a million samples, so we'll simplify it by taking the first sample of each identified pangolin lineage:

In [None]:
first_sample_of_pango = []
alpha_strains = set()
countries = Counter()
seen_pango = set()
for i, u in enumerate(ts.samples()):
    if samples_viridian_pangolin[i] not in seen_pango:
        md = ts.node(u).metadata
        first_sample_of_pango.append(u)
        countries[md['Country']] += 1
        if md['Viridian_scorpio'].startswith("Alpha"):
            alpha_strains.add(md["strain"])
        seen_pango.add(md['Viridian_pangolin'])
print(f"Found first sample of {len(first_sample_of_pango)} lineages ({len(alpha_strains)} identified as Alpha)")
countries

In [None]:
from IPython.display import SVG

small_ts = ts.simplify(first_sample_of_pango, update_sample_flags=False, keep_unary=True, keep_input_roots=True)
# Most samples are from the UK and USA, so use colours to label these to avoid excess labelling
node_labels = {}
uk_ids = []
usa_ids = []
alpha_ids = []
for nd in small_ts.nodes():
    country = nd.metadata.get("Country", "")
    if nd.metadata.get("strain") == "Wuhan/Hu-1/2019":
        node_labels[nd.id] = "Wuhan/Hu-1/2019"
    elif nd.metadata.get("strain") in alpha_strains:
        alpha_ids.append(nd.id)
        node_labels[nd.id] = country
    else:
        if country == "United Kingdom":
            uk_ids.append(nd.id)
        elif country == "USA":
            usa_ids.append(nd.id)
        else:
            node_labels[nd.id] = country
    if nd.metadata.get("date") == "2020-12-31":
        sampling_spike = nd.time
uk_style = ",".join([f".n{u}>.sym" for u in uk_ids]) + "{fill: green}"
usa_style = ",".join([f".n{u}>.sym" for u in usa_ids]) + "{fill: blue}"
alpha_style = ",".join([f".n{u}>.sym" for u in alpha_ids]) + f"{{fill: {scorpio_colors['Alpha (B.1.1.7-like)']}}}"
small_font_style = ".node > .lab {font-size: 10px}"
opaque_symbols_style = ".node>.sym{fill-opacity: 0.4}.node.sample>.sym{fill-opacity: 1}.mut .sym{stroke-opacity:0.4}"
rotate_tip_labels_style = ".leaf > .lab {text-anchor: start; transform: rotate(90deg) translate(6px)}"

# Plot the spike in sampling as a dashed line, as it seems to cause big polytomies
tick_positions = np.unique(np.concatenate((np.arange(10, 480, 10),[sampling_spike])))
tick_style = ".y-axis .tick .grid {stroke: #DDDDDD}"  # Default gridline type
tick_style += f".y-axis .ticks .tick:nth-child({np.where(tick_positions==sampling_spike)[0][0]+1}) .grid {{stroke-dasharray: 4}}"

# Since most samples do not trace back to a recombination node, roughtly the same tree will be seen everywhere
# We arbitrarily choose to show the tree in the middle of the spike region.
tree_loc = np.mean(regions["Spike"])
print("Tree in middle of spike: blue = UK samples, green = USA samples, orange = Alpha VoC")
SVG(small_ts.at(tree_loc).draw_svg(
    size=(2500, 800),
    mutation_labels={},
    node_labels=node_labels,
    y_axis=True,
    y_gridlines=True,
    y_ticks=tick_positions,
    style=uk_style + usa_style + alpha_style + small_font_style + tick_style + rotate_tip_labels_style + opaque_symbols_style
))

In [None]:
# Carry out the same plot, but colour by the Viridian_scorpio designation, to look for clusters
scorpio_nodes = defaultdict(list) 
for u in small_ts.samples():
    scorpio_nodes[small_ts.node(u).metadata['Viridian_scorpio']].append(u)
scorpio_styles = ".node > .sym {fill: transparent}"
for name, nodes in scorpio_nodes.items():
    scorpio_styles += ",".join([f".n{u}>.sym" for u in nodes]) + f"{{fill: {scorpio_colors[name]}}}"
fig, ax = plt.subplots(figsize=(20, 2))
# Add the legend to the new figure
ax.legend(legend_handles, legend_labels, ncols=6)
ax.axis('off')
plt.show()

SVG(small_ts.at(tree_loc).draw_svg(
    size=(1700, 800),
    omit_sites=True,
    node_labels=node_labels,
    y_axis=True,
    y_gridlines=True,
    y_ticks=tick_positions,
    style=scorpio_styles + small_font_style + tick_style + rotate_tip_labels_style
))

There are some oddities in this tree. The orange _Alpha_ variants appear to cluster as expected. However, there are two large unresolved clusters of _Omicron_ (purple) and _Delta_ (brown) that all appear in the large spike of samples taken on 2020-12-31, which are mostly UK samples and may be problematic. Historically, the earliest _Delta_ (brown) sample was found on in India on 5 October 2020, but only made it to the UK in Feb 2021, and the earliest _Omicron_ should be in November 2021. As these dates are well after the sampling spike, it appears as if some of the _Omicron_ and _Delta_-labelled variants on 2020-12-31 are likely to be either misdated, or misclassified by Viridian. A simple solution would be to omit all samples on 2020-12-31.

## Mutations

We can also looks to see if some areas of the genome have seen more mutations:

In [None]:
# plot the mutation locations
fig, ax = plt.subplots(figsize=(15, 5))
# fastest to use the direct memory accessors, e.g. `ts.sites_position`, `ts.mutations_site`, etc.
y, _, _ = ax.hist(ts.sites_position[ts.mutations_site], bins=500)
for name, reg in regions.items():
    for x in reg:
        ax.axvline(x=x, c="orange")
    ax.text(np.mean(reg), 0.95 * y.max(), name, ha="center", c="orange", rotation=90)
ax.set_xlim(0, ts.sequence_length)
ax.set_title("Mutation locations along the SARS-CoV2 genome")
ax.set_xlabel("Genome position")
ax.set_ylabel("Density of mutations");

In [None]:
# Or instead, plot mutations weighted by number of descendants
fig, ax = plt.subplots(figsize=(15, 5))
pos = np.zeros(ts.num_mutations)
weights = np.zeros(ts.num_mutations)
for tree in ts.trees():
    for s in tree.sites():
        for m in s.mutations:
            pos[m.id] = s.position
            weights[m.id] = tree.num_samples(m.node)
y, _, _ = ax.hist(pos, weights=weights,  bins=500)
for name, reg in regions.items():
    for x in reg:
        ax.axvline(x=x, c="orange")
    ax.text(np.mean(reg), 0.95 * y.max(), name, ha="center", c="orange", rotation=90)
ax.set_xlim(0, ts.sequence_length)
ax.set_title("Mutation locations along the SARS-CoV2 genome")
ax.set_xlabel("Genome position")
ax.set_ylabel("Density of mutations, weigted by number of descendant samples");

### Recombination

The main advantage of having a _tskit_ SARS-CoV2 genealogy is that we can analyse suggested viral recombination events. Hopefully these will not be too badly affected by the QC issues associated with the 2020-12-31 sampling spike. First we'll identify recombination nodes by finding those with more than one parent. These have all been inserted by _sc2ts_ into the genealogy, and are not sample nodes.

In [None]:
# Recombination nodes have > 1 parent. They are usually inserted by sc2ts, rather than being samples themselves
import numpy as np
unique_parent_child_combo = np.unique([ts.edges_parent, ts.edges_child], axis=1)
child_ids = unique_parent_child_combo[1,:]
node_id, parent_count = np.unique(child_ids, return_counts=True)
recombination_nodes = {u: c for u, c in zip(node_id, parent_count) if c > 1}
print(ts.num_trees - 1, "breakpoints for", len(recombination_nodes), "recombination nodes")
print(len([v for v in recombination_nodes.values() if v == 2]), "RE nodes have exactly 2 parents")

In [None]:
# find info for each recombinant: breakpoint position(s), parent IDs, descendant strain IDs, etc

recombination_info = {}
for u in recombination_nodes:
    d = {'parents': set(), 'breaks': set(), 'descendant_samples': set()}
    edges = np.where(ts.edges_child == u)[0]
    # find a breakpoint and parents for each recombinant.
    # Also sort so that the first parent is the leftmost
    for e in sorted(edges, key=lambda e: ts.edge(e).left):
        edge = ts.edge(e)
        d['parents'].add(edge.parent)
        if edge.left != 0:
            d['breaks'].add(edge.left)
        if edge.right != ts.sequence_length:
            d['breaks'].add(edge.right)
    recombination_info[u] = d

# Set number of descendant samples. NB: future versions of tskit will have an efficient
# ARG API to do this quickly see https://github.com/tskit-dev/tskit/discussions/2869
for tree in ts.trees():
    for u in recombination_nodes:
        recombination_info[u]['descendant_samples'].update(tree.samples(u))

print(", ".join([f"Node {u}: breaks {info['breaks']}" for u, info in recombination_info.items()]))

All recombination nodes appear to be associated with only one breakpoint, so we can use a single breakpoint value

In [None]:
assert all([len(info['breaks']) == 1 for info in recombination_info.values()])
breakpoints = [list(info['breaks'])[0] for info in recombination_info.values()]
fig, ax = plt.subplots(figsize=(15, 5))
y, _, _ = ax.hist(breakpoints, bins=30)
ax.set_title("recombination breakpoints by SARS-CoV2 genomic region")
ax.set_xlabel("Genome position")
ax.set_ylabel("Number of breakpoints")
for name, reg in regions.items():
    for x in reg:
        ax.axvline(x=x, c="orange")
    ax.text(np.mean(reg), 0.95 * y.max(), name, ha="center", c="orange", rotation=90)

An extra complication is that the immediate parents of each node may not be a sample node (so we might not know its VoC class). We can ascend up the list of parents on each side of the breakpoint until we find one with a pangolin designation

In [None]:
import tskit
samples = set(ts.samples())
for u, info in recombination_info.items():
    assert len(info['parents']) == 2
    assert len(info['breaks']) == 1
    info['parents'] = list(info['parents'])
    info['breakpoint'] = list(info['breaks'])[0]
    tree = ts.at(info['breakpoint'])
    right_parent = list(info['parents'])[1]
    while ts.node(right_parent).metadata.get('Viridian_pangolin') is None and right_parent != tskit.NULL:
        right_parent = tree.parent(right_parent)
    tree.prev()
    left_parent = info['parents'][0]
    while ts.node(left_parent).metadata.get('Viridian_pangolin') is None and left_parent != tskit.NULL:
        left_parent = tree.parent(left_parent)
    info['sample_parents'] = [left_parent, right_parent]
print("Found nearest sample nodes to parents")

## Identifying known recombinants

[Jackson et al. (2021)](https://doi.org/10.1016/j.cell.2021.08.014) identified a set of potential recombinant strains by hand, and classified them into groups that seemed to share the same original recombination event. They found four such groups (A-D) and a few cases where the recombination appeared to be a singleton (i.e. a recombination event leading to only a single strain). Here are the ENA IDs of the strains they identified as recombinants.

In [None]:
jackson = {
    'A': {"ERR5308556", "ERR5323237", "ERR5414941", "ERR5404883"},
    'B': {"ERR5058070","ERR5058141"}, 
    'C': {"ERR5272107", "ERR5406307", "ERR5232711"},
    'D': {"ERR5349458", "ERR5335088", "ERR5340986"},
    'singleton 1': {"ERR5054123"},
    'singleton 2': {"ERR5054082"},
    'singleton 3': {"ERR5304348"},
    'singleton 4': {"ERR5238288"},
}
jackson_nodes = {}
for k, v in jackson.items():
    jackson_nodes[k] = set()
    for strain in v:
        if strain in strain_to_id:
            jackson_nodes[k].add(strain_to_id[strain])
        else:
            print(f"{strain} ({k}) not in the ARG, possibly for QC reasons")

### Pangolin recombinants

The pangolin project also classifies identified recombinants by a designation starting with the letter "X". Since the dateset only goes up to March 2021, we are not expecting many "X" designations. In fact, we only see 26 of them, all designated "XA":

In [None]:
X_samples = [s for s, p in enumerate(samples_viridian_pangolin) if p.startswith("X")]
X_nodes = ts.samples()[X_samples]
pango_x_nodes = defaultdict(set)
for u in X_nodes:
    pango_x_nodes[ts.node(u).metadata["Viridian_pangolin"]].add(u)
for k, v in pango_x_nodes.items():
    print(f"Pangolin designated '{k}': {len(v)} strains")

We can print out all the recombinants and their descendants

In [None]:
last_sample = ts.node(ts.samples()[-1])  # to calibrate dates
for u, info in recombination_info.items():
    strains = info['descendant_samples']
    parents = info['sample_parents']
    jacksons = {j: len(k & strains) for j, k in jackson_nodes.items() if len(k & strains) > 0}
    pango_x = {j: len(k & strains) for j, k in pango_x_nodes.items() if len(k & strains) > 0}
    parent_names = []
    for p in parents:
        md = ts.node(p).metadata
        if p == tskit.NULL:
            name = f"B(root)[Wuhan]"
        else:
            name = md.get("Viridian_scorpio") + f"[{md['date']}.{md['Country'].replace("United Kingdom", "UK")}]"
        if name.startswith("."):
            name = md['Viridian_pangolin'] + f"[{md['date']}.{md['Country'].replace("United Kingdom", "UK")}]"
        parent_names.append(name)
    re_node_date = datetime.datetime.fromisoformat(last_sample.metadata["date"]) - datetime.timedelta(days=(ts.nodes_time[u] - last_sample.time))
    extra = []
    if jacksons:
        extra.append(f"JACKSON {jacksons}")
    if pango_x:
        extra.append(f"PANGO {pango_x}")
    print(
        f"RE node {u:<6} - ({re_node_date.date()}) #strains: {len(strains):<4}",
        f"Parents: {" + ".join(parent_names)}",
        f"--- ({', '.join(extra)})" if len(extra) else "",
    )

#### Discussion of recombinations

It appears as if _sc2ts_ also identifies a number of the Jackson recombinants, and places them as all descendants of a single recombination node. It has found another 24 "A group" recombinant strains, on top of the 4 identified by Jackson et al. Most of these are designated pangolin "XA", although 2 out of 28 are not.

There are also a resonable number of other recombination nodes, none of which include _Delta_ or _Omicron_-labelled strains (which as we saw previously, may be subject to QC issues). One of the recombination nodes (323158, on 2020-12-17) has produced 2183 putative recombinant samples. Although its parents can't be in the sample spike, it would be sensible to check whether the children of this node (which are the "cause" of the suggested recombination), are on the problematic date of 2020-12-31:

In [None]:
def child_dates(ts, node):
    last_sample = ts.node(ts.samples()[-1])  # to calibrate dates
    samples = set(ts.samples())
    children = set()
    dates = []
    for tree in ts.trees():
        children.update(tree.children(node))
    for i, child in enumerate(children):
        if child in samples:
            dates.append(ts.node(child).metadata['date'])
        else:
            date = datetime.datetime.fromisoformat(last_sample.metadata["date"]) - datetime.timedelta(days=(ts.nodes_time[child] - last_sample.time))
            dates.append(date.date())
    return np.array(dates)
    
def print_child_dates(ts, node):
    dates = child_dates(ts, node)
    print(f"Node {node}:")
    for date in dates:
        print(f" child {i} has date {date}" + (" <- ***" if date == "2020-12-31" else ""))

print_child_dates(ts, 323158)

So we should probably treat this recombination node with some caution: it may be artefactual. That's also true of node 132596 (34 descendants), but not node 86494 (57) descendants or the Jackson "A" recombination node (id 368327, which occurs after 2020-12-31 anyway, so can't possibly have children at that date).

In [None]:
print_child_dates(ts, 132596)
print_child_dates(ts, 86494)
print_child_dates(ts, 368327)  # Jackson "A" group


Some of these recombination nodes seem potentially trustworthy. The relatively large number of recombination events is probably due to the ability of _sc2ts_ to capture recombination event between relatively closely related strains (most other methods of identifying recombinant strains involve hand-inspection). Note that the UK carried out a large proportion of the sequencing in this stage of the pandemic, so although many of the parents appear to be in the UK, this may simply reflect sampling bias.



To show the ability of the sc2ts method to detect recombination even between closely related strains, we can recapitulate Figure 5 of the sc2ts preprint, which shows the distance (in days) between the two recombinant parents. We'll skip any recombinants that have children from the dodgy date.

In [None]:
# plot the MRCA between the two parents. Note that this requires a special MRCA routine,
# as we are not looking at the MRCA between two individuals in a single tree, but
# on either side of a breakpoint

import tskit

def get_root_path(tree, node):
    u = node
    path = []
    while u != tskit.NULL:
        path.append(u)
        u = tree.parent_array[u]
    return path


def get_node_tmrca_about_break(ts, breakpoint, node):
    node_time = ts.nodes_time
    tree = ts.at(breakpoint)
    path1 = get_root_path(tree, node)[1:]
    tree.prev()
    path2 = get_root_path(tree, node)[1:]
    j1 = 0
    j2 = 0
    while True:
        if path1[j1] == path2[j2]:
            if path1[j1] == tskit.NULL:
                return np.nan
            return node_time[path1[j1]]
        elif node_time[path1[j1]] < node_time[path2[j2]]:
            j1 += 1
        elif node_time[path2[j2]] < node_time[path1[j1]]:
            j2 += 1
        else:
            # Time is equal, but the nodes differ
            j1 += 1
            j2 += 1


# find the MRCA of the two parents either side of the breakpoint
x = [[], []]  # first is a jackson-identified recombination node
y = [[], []]
size = [[], []]
for u, info in recombination_info.items():
    strains = info['descendant_samples']
    jacksons = {j: len(k & strains) for j, k in jackson_nodes.items() if len(k & strains) > 0}
    tmrca = get_node_tmrca_about_break(ts, info['breakpoint'], u)
    idx = 0 if jacksons else 1
    if np.any(child_dates(ts, u) == "2020-12-31"):
        continue
    y[idx].append(tmrca)
    x[idx].append(tmrca - ts.node(u).time)
    size[idx].append(len(strains))
plt.scatter(x[0], y[0], s=np.sqrt(size[0]) * 10, c="red", label="Jackson-identified", alpha=0.5)
plt.scatter(x[1], y[1], s=np.sqrt(size[1]) * 10, c="blue", label="Novel", alpha=0.5)
plt.xlabel("Divergence between strains combined (days)")
plt.ylabel(f"Age of MRCA (days before 2021-03-21)")
lgnd = plt.legend()
for handle in lgnd.legend_handles:
    handle.set_sizes([30.0])
plt.title("Recombination node MRCA ages\n(size of point denotes number of descendant samples)");

The Jackson recombinants are indeed those with a very high divergence between strains. However, _sc2ts_ is also identifying many recombination nodes with a divergence of < 200 days between the two parents.



### Looking at subgraphs

We can use the tskit_arg_visualizer to look at a few of the recombination nodes, and their surrounding sample nodes. We'll colour the nodes using the same scheme as before (based on the viridian scorpio classification), but use red nodes to indicate recombinants.

In [None]:
%%javascript
var script = document.createElement('script');
script.type = 'text/javascript';
script.src = 'https://d3js.org/d3.v7.min.js';
document.head.appendChild(script);

In [None]:
# Initialise the visualiser
import tskit_arg_visualizer
d3arg = tskit_arg_visualizer.D3ARG.from_ts(ts=ts) # could take e.g. 30 secs
fills = []
symbols = []
for nd in ts.nodes():
    if nd.is_sample():
        fill = scorpio_colors[nd.metadata["Viridian_scorpio"]]
        symbol = "d3.symbolSquare"
    else:
        symbol = "d3.symbolCircle"
        if nd.id in recombination_nodes:
            fill = "#FF0000"
        else:
            fill = "#FFFFFF"
    fills.append(fill)
    symbols.append(symbol)
d3arg.set_all_node_styles(fill=fills, stroke_width=2, symbol=symbols)

Look at the Jackson "A" recombinant (the "XA" pangolin group). One of the parent lineages needs 12 hops to the nearest sample parent (possibly indicating a lack of sampling in the right area), and we need 4 hops downwards to encompass all the children of the recombination node.

In [None]:
# Set some useful labels on the ARG nodes
labels = {i: '' for i in range(ts.num_nodes)}
for i, node_id in enumerate(ts.samples()):
    labels[node_id] = id_to_strain[node_id]
    extra = []
    if samples_country[i]:
        extra.append(f"{samples_country[i].replace("United Kingdom", "UK")}")
    viridian_scorpio = samples_viridian_scorpio[i]
    if "(" in viridian_scorpio:
        extra.append(viridian_scorpio[:viridian_scorpio.find("(")].strip())  # just show first part of scorpio label
    viridian_pangolin = samples_viridian_pangolin[i]
    if viridian_pangolin.startswith("X"):
        extra.append(viridian_pangolin)
    if len(extra) > 0:
        labels[node_id] += f" ({", ".join(extra)})"
d3arg.set_node_labels(labels)

#  This has many non-sample nodes above one of the parents (these are nodes
older_node_degree = 12
younger_node_degree = 4
d3arg.draw_node(node=368327, height=500, width=1000, degree=[older_node_degree, 4])

By dragging the nodes apart, you should be able to see that the non-XA labelled descendants are `ERR5404883` and `ERR5404883`, immediate descendants of the red recombinant node. It could be worth investigating if these really are recombinants that have been mislabelled by the Viridian pangolin classifier, as seems likely.

We could also use some form of graph algorithm to assign likely countries to the white nodes on the plot (based on the countries of the samples below them). This could give clues as to the possible country of origination of the recombinants, and is a form of geographical inference.

In [None]:
# Look at one of the more interesting not-previously-identified recombinants
d3arg.draw_node(node=449789, height=400, degree=[1, 2])

We can check if this is a believable set of recombinants by looking at the sequence data, comparing mutations on SRR20694798 and SRR14401372 with those on the descendants:

In [None]:
strains = ("SRR20694798", "SRR14401372", "SRR14453556", "SRR14376771", "SRR14377696", "SRR14158202", "SRR16747820")
ids = [strain_to_id[strain] for strain in strains]

show_bp = True
breakpoint = recombination_info[449789]['breakpoint']
print(" pos   parents | recombinants")
for v in ts.variants(samples=ids):
    if v.genotypes[0] == v.genotypes[1]:
        continue  # do not print nonvariable sites in the children
    genos = np.array(v.alleles)[v.genotypes]
    if show_bp and v.site.position >= breakpoint:
        print(f"---- breakpoint @ {int(breakpoint)} ----")
        show_bp = False
    print(f"{int(v.site.position):<5}:  {genos[0]}   {genos[1]}  |  {" ".join(genos[2:])}")

This looks plausible, although the section of genome before the breakpoint only has 3 sites that vary between the two parents, so the inferred recombinant could be due to chance similarities rather than true recombination.

As a further exercise, you could try repeating this output for the Jackson "A" / pango XA grouping.