# Setup

To access material for this workbook please execute the two notebook cells immediately below (e.g. use the shortcut <shift>+<return>). The first cell can be skipped if you are running this notebook locally and have already installed all the necessary packages. The second cell should print out "Your notebook is ready to go!"

In [4]:
if 'pyodide_kernel' in str(get_ipython()):  # specify packages to install under JupyterLite
    raise RuntimeError("This workbook is not designed to run in JupyterLite. Please use a Colab or local install")
elif 'google.colab' in str(get_ipython()):  # specify package location for loading in Colab
    from google.colab import drive
    drive.mount('/content/drive')
    %run /content/drive/MyDrive/GARG_workshop/Notebooks/add_module_path.py
else:  # install packages on your local machine (-q = "quiet": don't print out installation steps)
    # (NB: you can probably ignore any message about restarting the kernel)
    !pip install -q -r https://github.com/ebp-nor/GARG/raw/main/jlite/requirements.txt

In [13]:
# Load questions etc for this workbook
import ARG_workshop
workbook = ARG_workshop.Workbook2D()
display(workbook.setup)

0
✅ Your notebook is ready to go!


In [2]:
import tsinfer
import subprocess
import sys
import zarr
import pandas as pd
import tskit
import json
import numpy as np
from tqdm import tqdm

### Using this workbook

This workbook is intended to be used by executing each cell as you go along. Code cells (like those above) can be modified and re-executed to perform different behaviour or additional analysis. You can use this to complete various programming exercises, some of which have associated questions to test your understanding. Exercises are marked like this:
<dl class="exercise"><dt>Exercise XXX</dt>
<dd>Here is an exercise: normally there will be a code cell below this box for you to work in</dd>
</dl>

# Workbook 2-D: inferning ARGs from real data

In this lab session we will infer ARGs from real data, using the House sparrow system as an example. You can also consult the [tsinfer documentation](https://tskit.dev/tsinfer/docs/stable/index.html) and the [tsinfer tutorial](https://tskit.dev/tsinfer/docs/stable/tutorial.html) for more information and examples.

## Input data setup

### Sparrow data

In this workbook we will look at data based on the House sparrow (*Passer domesticus*) system. House sparrows are an anthrodependent species and have spread all over the world, but little is known about their origin. [A recent publication](https://royalsocietypublishing.org/doi/10.1098/rspb.2018.1246) studied three Eurasian species, including populations from the Bactrianus sparrow that serves as ancestral proxy for house sparrows, with an inferred split between commensal house and Bactrianus occuring ~11kya. The publication identified putative regions for adaptation to an anthropogenic niche, one of which occurs on chromosome 8. The example data in this workbook is based on a region on [chromosome 8 that has a strong signal of divergence](https://royalsocietypublishing.org/cms/asset/df530257-c8ec-4a9f-945c-23dfdede4a23/rspb20181246f04.jpg) between bactrianus and House sparrows. This region harbors a gene, AMY2A, that encodes the amylase enzyme which helps digest starch that presumably became more abundant with the advent of agriculture.

### Convert BCF to vcz

The raw data is provided in binary Variant Call Format (VCF) and consists of phased bi-allelic SNP calls, which is a [requirement for tsinfer analyses](https://tskit.dev/tsinfer/docs/stable/inference.html#data-requirements). Even though tsinfer supports loading VCF/BCF data, the [current trend is to move toward other data storage formats, such as the Zarr format](https://www.biorxiv.org/content/10.1101/2024.06.11.598241v1.full), as the variant call format does not allow for easy retrieval of data based on subsets of samples or fields. The [Zarr format](https://zarr.dev/) stores data in arrays in a data store (directory) and is designed to efficiently subset data in different ways, for instance making it easy to mask samples or sites without tampering with the raw data. The [bio2zarr Python module](https://sgkit-dev.github.io/bio2zarr/intro.html) has recently been released as a tool to convert various bionformatics data formats to Zarr format. Here we will use the `vcz` suffix to designate Zarr data stores. Run the code below to convert BCF to Zarr format.

In [3]:
vcf_name = "data/chr8.subset.bcf"
zarr_file_name = "data/chr8.vcz"
try:
    subprocess.run([sys.executable, "-m", "bio2zarr", "vcf2zarr", "convert", "--force", vcf_name, zarr_file_name])
except FileNotFoundError:
    print("Please install bio2zarr to convert VCF to Zarr by running !pip install bio2zarr")

    Scan: 100%|██████████| 1.00/1.00 [00:00<00:00, 5.50files/s]
 Explode: 100%|██████████| 123k/123k [00:13<00:00, 9.21kvars/s] 
  Encode: 100%|██████████| 301M/301M [00:07<00:00, 39.2MB/s] 
Finalise: 100%|██████████| 12.0/12.0 [00:00<00:00, 1.14karray/s]


We can load the data store with the `zarr` Python module.

In [4]:
ds = zarr.load(zarr_file_name)
display(ds)

<LazyLoader: call_genotype, call_genotype_mask, call_genotype_phased, contig_id, filter_id, sample_id, variant_AC, variant_AN, variant_allele, variant_contig, variant_filter, variant_id, variant_id_mask, variant_position, variant_quality>

Basically, the `ds` data structure consists of arrays of data that can be accessed in a very efficient manner. If you peek into the zarr file (actually a directory) you will see that the variable names above simply reflect the folder structure

In [5]:
!ls data/chr8.vcz

call_genotype	      filter_id   variant_allele  variant_id_mask
call_genotype_mask    sample_id   variant_contig  variant_position
call_genotype_phased  variant_AC  variant_filter  variant_quality
contig_id	      variant_AN  variant_id


in which each folder contains subfolders and binary arrays:

In [6]:
!tree data/chr8.vcz | head  # Remove the pipe to head to see all files 

[01;34mdata/chr8.vcz[0m
├── [01;34mcall_genotype[0m
│   ├── [01;34m0[0m
│   │   └── [01;34m0[0m
│   │       └── 0
│   ├── [01;34m1[0m
│   │   └── [01;34m0[0m
│   │       └── 0
│   ├── [01;34m10[0m
│   │   └── [01;34m0[0m


We don't really need to know these details as loading of the data store generates a convenient data object; suffice to say that if you look at the contents of the VCF file, you will that the columns and fields from the VCF map to the different variable names.

We can investigate the shape and type of the variables by accessing them similar to keys in a dict, showing that they are [numpy.ndarray](https://numpy.org/doc/1.26/reference/generated/numpy.ndarray.html)s:

In [7]:
for key in ["sample_id", "call_genotype", "variant_allele"]:
    print(key, ds[key].shape, ds[key].dtype, type(ds[key]))

sample_id (483,) object <class 'numpy.ndarray'>
call_genotype (122686, 483, 2) int8 <class 'numpy.ndarray'>
variant_allele (122686, 2) object <class 'numpy.ndarray'>


Since the variables are arrays, we can subset them by [ndarray indexing](https://numpy.org/doc/1.26/user/basics.indexing.html#indexing-on-ndarrays):

In [8]:
ds["sample_id"][0:10], ds["call_genotype"][0:4,0:4,:], ds["variant_allele"][0:10]

(array(['PDOM2013AUS0029U_01-78565', 'PIAG2014CPV0001F_01-Sp3',
        'PIAG2014CPV0003M_02-RBB02', 'PIAG2014CPV0002M_02-Sp4',
        'PDOM2013AUS0030U_03-78832', 'PHIS2014CPV0063M_03-Sp5',
        'PHIS2014CPV0064M_04-Sp6', 'PDOM2013AUS0031U_07-46265',
        'PDOM2013AUS0032U_10-78830', 'PIAG2014CPV0006F_11-RBC12'],
       dtype=object),
 array([[[0, 0],
         [0, 0],
         [0, 0],
         [0, 0]],
 
        [[0, 0],
         [1, 0],
         [0, 0],
         [0, 0]],
 
        [[0, 0],
         [0, 0],
         [0, 0],
         [0, 0]],
 
        [[0, 0],
         [0, 0],
         [0, 0],
         [0, 0]]], dtype=int8),
 array([['G', 'A'],
        ['T', 'C'],
        ['A', 'T'],
        ['A', 'G'],
        ['G', 'A'],
        ['G', 'T'],
        ['A', 'G'],
        ['T', 'A'],
        ['G', 'A'],
        ['A', 'C']], dtype=object))

At this point we have samples and variant data, but we would like to add more metadata, such as population information about the samples.

### Add individual and population metadata

Individual and population metadata are persistent (well, sort of) and can therefore be added to the Zarr store itself. We provide sample and population information in two tabular text files that we load below:

In [9]:
samplesfile = "data/samples.tsv"
populationfile = "data/populations.tsv"

population_df = pd.read_table(populationfile)
samples_df = pd.read_table(samplesfile).set_index("sample")
schema = json.dumps(tskit.MetadataSchema.permissive_json().schema).encode()

The last step defines a generic schema (basically [a data description format in JSON](https://json-schema.org/understanding-json-schema)) that is needed to initialize the metadata slots. 
The code below first loads the Zarr store, sets the schemas, and then adds metadata about samples and individuals. The only metadata we have is the link between individual and population, but any descriptive data could be added.  

In [18]:
ds = zarr.load(zarr_file_name)
population_set = set(samples_df.loc[ds["sample_id"]]["population"].values)  # populations table contains more samples than are present in Zarr file so take care not to add them

# Save populations and individuals metadata
zarr.save(f"{zarr_file_name}/populations_metadata_schema", schema)
zarr.save(f"{zarr_file_name}/individuals_metadata_schema", schema)
metadata = []

for row in population_df.itertuples(index=False):
    if row.population not in population_set:
        # Uncomment print statements if you want to see what is added / skipped
        # print(f"Population {row.population} not present in samples; skipping")
        continue
    data = json.dumps(row._asdict())
    # print(f"Adding population metadata: {data}")
    metadata.append(data.encode())
zarr.save(f"{zarr_file_name}/populations_metadata", metadata)

# Assign samples to population
ds = zarr.load(zarr_file_name)
num_individuals = ds["sample_id"].shape[0]
individuals_pop = np.full(num_individuals, tskit.NULL, dtype=np.int32)
populations = [
    json.loads(x.decode())["population"] for x in ds["populations_metadata"]
]

# Individual metadata here just consists of the population data, so in a way is redundant.
# However, it is included to show that *any* metadata related to individuals could be added here, e.g., phenotype, geolocation, etc
metadata = []
for i, name in enumerate(ds["sample_id"]):
    pop = samples_df.loc[name].population
    data = json.dumps(samples_df.loc[name].to_dict())
    # print(f"Individual {name}, population {pop}")
    individuals_pop[i] = populations.index(pop)
    metadata.append(data.encode())
    # print(f"Adding individual metadata: {data}")

zarr.save(f"{zarr_file_name}/individuals_population", individuals_pop)
zarr.save(f"{zarr_file_name}/individuals_metadata", metadata)

<dl class="exercise"><dt>Exercise 1</dt>
<dd>Load the data set and look at the 10 first entries of variables individuals_population and populations_metadata.</dd>
</dl>

In [11]:
# Use zarr.load to load data set and print 10 first entries of individuals_population and populations_metadata. Recall that you can slice an array a with syntax a[0:10]


In [14]:
# Execute code block with <shift>+Return to display question; press on one of the buttons to answer
workbook.question("metadata")

<IPython.core.display.Javascript object>

## Set inference parameters

We have now added metadata to our data set, but before getting on with the inference itself, we need to load the variant data into a format that tsinfer understands. Moreover, we often want to change the inference parameters, such as excluding samples of poor quality, or filtering out sites that reside in problematic genomic regions. To this end, tsinfer defines a function [VariantData](https://tskit.dev/tsinfer/docs/latest/usage.html#variantdata-and-ancestral-alleles) that takes as arguments a variant data sources (VCF file name, ZARR store), and optionally user-defined sample masks, site masks, and even ancestral allele states stored in regular Python data structures. We create this information in the following sections.

### Define ancestral allele state

An additional requirement of the input data is that the [ancestral state is known](https://tskit.dev/tsinfer/docs/stable/inference.html#data-requirements). This is not necessarily the same as the `REF` column in a VCF (or equivalently, the `0`th column of the `variant_allele` Zarr field). Most often, the sequence data has been mapped to a reference sequence from an individual in a focal population, and the allelic states of the reference could be derived alleles with respect to some outgroup population. Therefore, we need to determine the ancestral state somehow. Here we will adopt a simple method based on maximum parsimony, where we use outgroup samples contained in the data store to determine the ancestral allele. If a majority (say, 80%) of the outgroups are called as the `ALT` allele, we simply set the ancestral allele to the `ALT` allele, else we use the `REF` allele. 

<div class="alert alert-block alert-info"><b>Note:</b> There are caveats to using maximum parsimony, as discussed in <a href="https://academic.oup.com/genetics/article/209/3/897/5930981">Keightley and Jackson (2018)</a>, but nevertheless it is a widely used approach due to its simplicity. </div>

As of `tsinfer 0.4.0a`, if the VCF contains a field called `variant_AA`, it will be converted to and treated as ancestral allele information by tsinfer. However, the flexibility of tsinfer allows you to calculate your own ancestral alleles and choose which version to use later on in the inference process. Our example data does not contain the `variant_AA` field so we must do the ancestral allele calculation anyway. As outgroup, we will use the `tree` population (see phylogeny below).

<center><img src="img/sparrows.png"/></center>

The following code shows how to do this based on samples belonging to the `tree` outgroup population.

In [20]:
# For convenience generate two dictionaries that map from population name to id and the corresponding reverse mapping
pop2id = {json.loads(x.decode())["population"]:i for i, x in enumerate(ds["populations_metadata"])}
id2pop = {v:k for k, v in pop2id.items()}

In [21]:
# Find the indices of the tree individuals
tree_pop_indices = np.where(ds["individuals_population"] == pop2id["tree"])

In [22]:
n_outgroups = len(tree_pop_indices) * 2  # We have diploid samples so two alleles per outgroup individual
threshold = 0.8  # Set a (customizable) threshold
n_changes = 0
ancestral_allele = ds["variant_allele"][:, 0]  # The variant_allele variable holds the REF/ALT pairs for each bi-allelic SNP; initialize ancestral state to REF allele for all sites
for i, gt in enumerate(tqdm(ds["call_genotype"][:, tree_pop_indices, :])):  # numpy.ndarray multiindex slicing to select only tree individuals from second dimension (of three)
    if sum(gt.flatten()) / n_outgroups >= threshold:
        ancestral_allele[i] = ds["variant_allele"][i, 1]  # Swap ancestral state to ALT if outgroups consistently called as ALT
        n_changes = n_changes + 1
print(f"Ancestral allele: changed {n_changes} out of {len(ancestral_allele)} sites ({n_changes/len(ancestral_allele)*100:.2f}%)")

100%|█████████████████████████████████████████| 122686/122686 [00:00<00:00, 234302.38it/s]

Ancestral allele: changed 20979 out of 122686 sites (17.10%)





One improvement you could make here is to identify sites where the outgroup "votes" are inconsistent. This information could be used to mask sites that are deemed untrustworthy. We refrain from doing so here, but in the next section, we setup a sample mask that restrict the inference to a sample subset.

### Add sample masks

We now define a sample mask to exclude samples belonging to the outgroup `tree` and the `iago` population (Cabo Verde sparrows). A mask array is a boolean array (`False`/`True`) where samples to be excluded are assigned `True`.

In [23]:
sample_mask = (ds["individuals_population"] == pop2id["iago"]) | (ds["individuals_population"] == pop2id["tree"])
print(f"Masking {sum(sample_mask)} samples from tree population")

Masking 48 samples from tree population


## Run inference

Now we have all the relevant information we need to proceed with inference. However, note how easy it would be to apply different sample masks or allelic states to update an inference: simply update the parameters that you pass to the `tsinfer.VariantData` function below.

In [24]:
# Setup a VariantData object. You can easily change parameters without having to modify the underlying data store.
vdata = tsinfer.VariantData(zarr_file_name, ancestral_allele=ancestral_allele, sample_mask=sample_mask)

The `VariantData` object is then passed on to [tsinfer.infer](https://tskit.dev/tsinfer/docs/stable/api.html#tsinfer.infer) that runs the [full inference pipeline](https://tskit.dev/tsinfer/docs/stable/inference.html#sec-inference). With one thread, this may take up to ten minutes, so meanwhile, take a break or increase the number of threads if you can and are impatient!

<div class="alert alert-block alert-info"><b>Note:</b> Although there is functionality to run the full pipeline in a simple function call, it is not recommended to do so in more complex settings. For instance, if the ancestor matching step (<code>ma-match</code>) takes long to complete, a potential solution is to trim ancestors with <a href="https://tskit.dev/tsinfer/docs/stable/api.html#tsinfer.AncestorData.truncate_ancestors">ancestors.truncate_ancestors</a>. </div>

In [25]:
%%time
ts = tsinfer.infer(vdata, num_threads=1, progress_monitor=True)

ga-add   (1/6)  0%|                                             | 0.00/123k [00:00, ?it/s]

ga-gen   (2/6)  0%|                                            | 0.00/39.9k [00:00, ?it/s]

ma-match (3/6)  0%|                                            | 0.00/39.9k [00:00, ?it/s]

ms-muts  (4/6)  0%|                                            | 0.00/43.8k [00:00, ?it/s]

ms-match (5/6)  0%|                                              | 0.00/870 [00:00, ?it/s]

ms-paths (6/6)  0%|                                              | 0.00/870 [00:00, ?it/s]

ms-muts  (7/6)  0%|                                            | 0.00/43.8k [00:00, ?it/s]

ms-xsites (8/6)  0%|                                           | 0.00/78.9k [00:00, ?it/s]

CPU times: user 13min 27s, sys: 14.4 s, total: 13min 42s
Wall time: 2min 25s


The inference pipeline generates a tree sequence object that you by now should be familiar with. As a last inference step, we date the ancestral nodes of the tree sequence using [tsdate](https://tsdate.readthedocs.io/en/latest/index.html) using a [bird mutation rate of 2.3e-9](https://genome.cshlp.org/content/26/9/1211.short) mutations per site per year. The input to [tsdate.date](https://tskit.dev/tsdate/docs/latest/python-api.html#tsdate.date) must be a simplified tree. The function [tsdate.preprocess_ts](https://tskit.dev/tsdate/docs/latest/python-api.html#tsdate.preprocess_ts) does this, but can also do other things, like removing data-poor regions.

In [26]:
import tsdate
dated_ts = tsdate.date(tsdate.preprocess_ts(ts), mutation_rate=2.3e-9)

## Investigate tree sequences

Start by taking a look at the tree sequence objects.

In [29]:
display(ts), display(dated_ts);

Tree Sequence,Unnamed: 1
Trees,39592
Sequence Length,19699968.0
Time Units,uncalibrated
Sample Nodes,870
Total Size,28.1 MiB
Metadata,dict

Table,Rows,Size,Has Metadata
Edges,437695,13.4 MiB,
Individuals,435,56.3 KiB,✅
Migrations,0,8 Bytes,
Mutations,69546,2.5 MiB,
Nodes,61383,3.0 MiB,✅
Populations,9,521 Bytes,✅
Provenances,1,596 Bytes,
Sites,122686,6.0 MiB,✅


Tree Sequence,Unnamed: 1
Trees,39445
Sequence Length,19699968.0
Time Units,generations
Sample Nodes,870
Total Size,38.4 MiB
Metadata,dict

Table,Rows,Size,Has Metadata
Edges,545351,16.6 MiB,
Individuals,435,56.3 KiB,✅
Migrations,0,8 Bytes,
Mutations,69546,5.3 MiB,✅
Nodes,88060,6.3 MiB,✅
Populations,9,521 Bytes,✅
Provenances,3,2.0 KiB,
Sites,122686,6.0 MiB,✅


Note the difference in time units! Also, even though the raw BCF consisted of variation data from a subregion 18.7-19.7Mbp, the sequence length of the tree sequence objects is 19699968bp, corresponding to the coordinate of the last variant. This length is **not** identical to the chromosome/contig length; chromosome 8 is 49Mbp long.

### tsqc

TODO: add comment / edge plot

TODO: could be that even here we get lots of ancestors which slows down the inference process

### Setup parameters for plotting windowed summary statistics

Now that we have more populations to look at we add styling with more colors. Let's start by collecting the population metadata into a dictionary:


In [None]:
popmd = {p.id:p.metadata for p in ts.populations()}
print(popmd)

Recall that we applied a sample mask to exclude `tree` and `iago` samples, but note that this information is retained in the tree sequence object. We add a color mapping to all populations nevertheless and keep track of the population ids present in the tree sequence.

In [None]:
import matplotlib.pyplot as plt  # Plotting library
import matplotlib.colors as mcolors  # Color library
mpl_colors = mcolors.TABLEAU_COLORS
pop_in_ts, count = np.unique(ts.individuals_population, return_counts=True)  # Get population ids and counts
for pop, color in zip(ts.populations(), list(mpl_colors)[:9]):
    popmd[pop.id]["color"] = color
sample_sets = [ts.samples(i) for i in pop_in_ts]

As before, we also make a CSS style for use with subsequent tree sequence plots.

In [None]:
styles = []
for popid, md in popmd.items():
    # target the symbols only (class "sym")
    s = f".node.p{popid} > .sym " + "{" + f"fill: {md['color']}" + "}"
    styles.append(s)
    # print(f'"{s}" applies to nodes from population {md["population"]} (id {popid})')
css_string = " ".join(styles)
# print(css_string)

We finally define windows for our region of interest. 

In [None]:
window_size = 10_000
roi = (18_700_000, 19_700_000)  # Coordinates used to subset the original VCF
start_index = int(roi[0] / window_size)  # we don't want to plot regions without data
num_windows = int(roi[1] / window_size)
window_size = ts.sequence_length / num_windows
windows = np.linspace(0, roi[1], num_windows + 1)
windows[-1] = ts.sequence_length
ticks = np.arange(187, 198, 2) / 10  # Get tick marks in Mb

### Windowed genetic diversity


We begin by plotting the windowed genetic diversity (aka `pi`), or sample heterozygosity. Diversity is calculated with the [diversity](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.TreeSequence.diversity) function, which is a [one-way](https://tskit.dev/tskit/docs/stable/stats.html#one-way-methods) method that calculates a statistic on a single sample set. By providing a list of samples sets, we can calculate windowed for all populations on the fly.

In [None]:
pi_win = ts.diversity(
    sample_sets=sample_sets,
    windows=windows,
)
fig, ax = plt.subplots(1, 1, figsize=(15, 5))

for i, pop in enumerate(pop_in_ts):
    x = pi_win[:, i]
    plt.plot(range(len(windows[start_index:-1])), x[start_index:], color=popmd[pop]["color"], label=popmd[pop]["population"])
plt.xlabel("Window")
plt.ylabel("Diversity (pi)")
ax.xaxis.set_ticks(np.arange(0, 101, 20))
ax.xaxis.set_ticklabels(ticks)
plt.xlabel("Position (Mbp)")
plt.legend(bbox_to_anchor=(1, 1), loc="upper left")
plt.show()

TODO: comment on low diversity in introduced_house

### Tajima's D plot

Next we plot another one-way statistic, namely Tajima's D, to scan for signs of selection. As a rule of thumb, [values smaller than -2](https://en.wikipedia.org/wiki/Tajima%27s_D) are significant and could indicate signals of selective sweeps.

In [None]:
tajd_win = ts.Tajimas_D(
    sample_sets=sample_sets,
    windows=windows,
)

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(15, 5))
for i, pop in enumerate(pop_in_ts):
    x = tajd_win[:, i]
    plt.plot(range(len(windows[start_index:-1])), x[start_index:], color=popmd[pop]["color"], label=popmd[pop]["population"])
plt.xlabel("Window")
plt.ylabel("Tajima's D")
ax.xaxis.set_ticks(np.arange(0, 101, 20))
ax.xaxis.set_ticklabels(ticks)
plt.xlabel("Position (Mbp)")
plt.legend(bbox_to_anchor=(1, 1), loc="upper left")
plt.show()

### Fixation indices

The fixation index $F_{ST}$ is used to assess population differentiation and can identify differentiated regions. It is a [multi-way method](https://tskit.dev/tskit/docs/stable/stats.html#sec-stats-sample-sets-multi-way) that compares 2 or more samples. We therefore need to make pairs of sample sets. To avoid too many comparisons, we group the populations into species House, Italian, Spanish, and Bactrianus sparrow.

In [None]:
house = np.concatenate((ts.samples(0), ts.samples(1), ts.samples(3)))
italian = np.concatenate((ts.samples(5), ts.samples(7)))
spanish = ts.samples(2)
bactrianus = ts.samples(8)

In [None]:
paired_sample_sets = [house, italian, spanish, bactrianus]
pair_comparisons = [(0,1), (0,2), (0,3), (1,2), (1,3), (2,3)]
fst_win = ts.Fst(
    sample_sets=paired_sample_sets,
    indexes=pair_comparisons,
    windows=windows,
)

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(15, 5))
comparisons = ["house_vs_italian", "house_vs_spanish", "house_vs_bactrianus", "italian_vs_spanish", "italian_vs_bactrianus", "spanish_vs_bactrianus"]
for i, comp in enumerate(comparisons):
    x = fst_win[:, i]
    plt.plot(range(len(windows[start_index:-1])), x[start_index:], label=comp, color=list(mpl_colors.values())[i])
ax.xaxis.set_ticks(np.arange(0, 101, 20))
ax.xaxis.set_ticklabels(ticks)
plt.xlabel("Position (Mbp)")
plt.ylabel("Fst")
plt.legend(bbox_to_anchor=(1, 1), loc="upper left")
plt.show()

### Divergence




Finally, we calculate and plot divergence ($D_{XY}$).

In [None]:
dxy_win = ts.divergence(
    sample_sets=[house, italian, spanish, bactrianus],
    indexes=[(0,1), (0,2), (0,3), (1,2), (1,3), (2,3)],
    windows=windows,
)

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(15, 5))
comparisons = ["house_vs_italian", "house_vs_spanish", "house_vs_bactrianus", "italian_vs_spanish", "italian_vs_bactrianus", "spanish_vs_bactrianus"]
for i, comp in enumerate(comparisons):
    x = dxy_win[:, i]
    plt.plot(range(len(windows[start_index:-1])), x[start_index:], label=comp, color=list(mpl_colors.values())[i])
plt.xlabel("Window")
plt.ylabel("Divergence (dxy)")
ax.xaxis.set_ticks(np.arange(0, 101, 20))
ax.xaxis.set_ticklabels(ticks)
plt.xlabel("Position (Mbp)")
plt.legend(bbox_to_anchor=(1, 1), loc="upper left")
plt.show()

## Tree sequence analyses

TODO: look at plots of trees etc

## GNN plot