## Setup

To access material for this workbook please execute the two notebook cells immediately below (e.g. use the shortcut <b>&lt;shift&gt;+&lt;return&gt;</b>). The first cell can be skipped if you are running this notebook locally and have already installed all the necessary packages. The second cell should print out "Your notebook is ready to go!"

In [None]:
if 'pyodide_kernel' in str(get_ipython()):  # specify packages to install under JupyterLite
    %pip install -q -r jlite-requirements.txt
elif 'google.colab' in str(get_ipython()):  # specify package location for loading in Colab
    from google.colab import drive
    drive.mount('/content/drive')
    %run /content/drive/MyDrive/GARG_workshop/Notebooks/add_module_path.py
else:  # install packages on your local machine (-q = "quiet": don't print out installation steps)
    !python -m pip install -q -r https://github.com/ebp-nor/GARG/raw/main/jlite/requirements.txt

In [None]:
# Load questions etc for this workbook
from IPython.display import SVG
import tskit
import ARG_workshop
workbook = ARG_workshop.Workbook1D()
display(workbook.setup)

### Using this workbook

This workbook is intended to be used by executing each cell as you go along. Code cells (like those above) can be modified and re-executed to perform different behaviour or additional analysis. You can use this to complete various programming exercises, some of which have associated questions to test your understanding. Exercises are marked like this:
<dl class="exercise"><dt>Exercise XXX</dt>
<dd>Here is an exercise: normally there will be a code cell below this box for you to work in</dd>
</dl>

# Workbook 1-D: Sites, mutations, and variation

So far we have been using _tskit_ to store genetic genealogies, but we have not tackled the topic of genetic variation. _Tskit_ stores genetic variation data in an unusual manner. Rather than storing a DNA sequence for each of the sample genomes, _tskit_ records genetic differences transmitted through the genealogy. In particular this is done by placing **mutations** on the genealogy.

Each mutation occupies a row of the *mutation table*. It is placed in the local tree by recording its associated *node*, and it is located along the genome by indexing into the *site table*: each site in this table has a position and an ancestral state.

To illustrate, we'll use [msprime.sim_mutations()](https://tskit.dev/msprime/docs/stable/mutations.html) to add mutations to one of our simulated tree sequences, and display the *site* and *mutation* tables,as well as the local trees. Various sorts of mutation such as indels, microsats, etc can be generated, but for the moment we'll just use single base pair substitutions, under the [Jukes & Cantor model](https://tskit.dev/msprime/docs/stable/api.html#msprime.JC69) in which there is an equal probability of mutating between any of the 4 bases.

In [None]:
import msprime

no_muts_ts = ARG_workshop.FwdWrightFisherRecombSim(5, seq_len=100, recombination_rate=1e-3, random_seed=6).run(30)
ts = msprime.sim_mutations(no_muts_ts, rate=1e-3, random_seed=20)
display(ts.tables.sites)
display(ts.tables.mutations)
display(ts.draw_svg(y_axis=True, size=(1000, 400)))

The `sim_mutations()` routine simply adds neutral mutations randomly onto the genealogy at a rate proportional to the "area" of each edge (i.e. the span of the edge multiplied by the time difference between its parent and child node). This is fine for neutral mutations, that are not expected to affect the genealogical structure. However, if a mutation has some selective advantage, then it will change the probability of a genetic segment being transmitted into the next generation, and affect the genealogy. Therefore, selectively advantageous or dieadvantageous mutations normally need to simulated at the same time as the genealogy, in forward time. The _SLiM_ software package will produce tree sequences simulated in this manner.

<dl class="exercise"><dt>Exercise 1</dt>
<dd>Instead of the mutation IDs, it can be helpful to label the mutations with the previous state, the position and the derived state. In the cell below, use the <code>ARG_workshop.mutation_labels(ts)</code> function to create nicer mutation labels to pass to the <code>draw_svg()</code> function</dd>
</dl>

In [None]:
# Use draw_svg to replot the tree sequence with `mutation_labels=ARG_workshop.mutation_labels(ts)`


In [None]:
# Execute code block with <shift>+Return to display question; press on one of the buttons to answer
workbook.question("sites_and_mutations")

### _Tskit_ stores genetic variation efficiently
If you are storing genetic data for (say) a million genomes, using _tskit_ takes less space than conventional methods. That's because each variable site is encoded by a single or possibly a few mutations, rather than storing a million separate letters. Of course, you need to store the genealogy as well, but this too can be stored efficiently in _tskit_ (for example, adjacent trees are expected to differ by only a few (often one) SPRs, meaning only few edges change from tree-to-tree, even for genealogies of millions of samples).

### Decoding genetic variation

There are two ways to go through the sites programatically:
* the `.sites()` iterator returns a _tskit_ [Site](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.Site) object, which includes a list of mutations
* the `.variants()` iterator additionally "decodes" the mutations, returning the [Site](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.Site), a list of alleles, and an array specifying which samples have those alleles.

Below is an example of obtaining the `.alleles` and `.genotypes` attributes of the `.variants()` iterator.

In [None]:
import numpy as np
for v in ts.variants():
    pos = f"{v.site.position:3g}"  # The "site" object is also an attribute of a variant
    print(pos, v.alleles, v.genotypes, "==", np.array(v.alleles)[v.genotypes], sep="\t")

It is also possible (but even less efficient) to decode the *haplotypes*: that is, the sequence of letters for each of a list of sample genomes:

In [None]:
nodes = [0, 1]
for node, hap in zip(nodes, ts.haplotypes(samples=nodes)):
    print(f"Sample {node}", hap)

<dl class="exercise"><dt>Exercise 2</dt>
<dd>Repeat the analysis above put print out the haplotype for sample node 9, and check it's the same as the last column printed above (when iterating using the <code>.variants()</code> method</dd></dl>

In [None]:
# Exercise 2: check on the haplotype of the last sample node (id=9)

In [None]:
# Execute code block with <shift>+Return to display question; press on one of the buttons to answer
workbook.question("last_haplotype")

The `.haplotypes()` method above only returns the *variable sites* along a 100 base pairs genome. If we want to show the whole sequence we can use the `.aligments()` method, as below:

<div class="alert alert-block alert-info"><b>Note:</b> Iit's also possible to store a reference sequence in the tree sequence, in which case this can be used instead of the missing_data_charater. However, support for reference sequences is [preliminary in tskit](https://tskit.dev/tskit/docs/stable/data-model.html#reference-sequence).</div>

In [None]:
for node, hap in zip(nodes, ts.alignments(samples=nodes, missing_data_character=".")):
    print(f"Sample {node}", hap)

## Built-in statistics

It requires some computational effort to decode mutations and figure out the genotypes of the samples. However, this is often not necesary. Often, it's enough to know the local tree and the placement of the mutations within it. _Tskit_ contains a set of built-in [statistical routines](https://tskit.dev/tskit/docs/stable/stats.html) that implement efficient calculations on genetic variation encoded in the tree sequence. These move along the genome, using the local tree directly and caching information used to calculate the statistics. This information is updated incrementally, when relevant parts of the local tree change, which makes the calculations much more efficient than running calculations on the decoded genotypes.

We'll run a larger example, based on the tutorials. This has a [selective sweep](https://tskit.dev/msprime/docs/stable/ancestry.html#sec-ancestry-models-selective-sweeps) halfway along the 10 Mb genome. Despite what we said earlier about neutral mutations, it *is* possible to roughly simulate a basic selective sweep backwards in time, although it is only a limited and slightly crude approximation. Nevertheless, we will use the approach here because it is significantly faster than running a forward-time simulation using SLiM.

In [None]:
pop_size=10_000
seq_len=10_000_000
mu = 1e-8
p = 1 / (2 * pop_size)
sweep_model = msprime.SweepGenicSelection(
    position=seq_len/2, start_frequency=p, end_frequency=1 - p, s=0.1, dt=1e-6)

base_ts = msprime.sim_ancestry(
    500,
    model=[sweep_model, msprime.StandardCoalescent()],
    population_size=pop_size,
    sequence_length=seq_len,
    recombination_rate=2e-8,
    random_seed=1,  # only needed for repeatabilty
    )
ts = msprime.sim_mutations(base_ts, rate=mu, random_seed=2)
ts

<dl class="exercise"><dt>Exercise 3</dt>
<dd>Use <code>ts.diversity()</code> to calculate the average sitewise diversity of this simulated tree sequence. This should be instantaneous, even for 1000 genomes over 25k sites</dd>
</dl>

In [None]:
# Exercise 3: use this box to calculate the diversity
ts.diversity()

In [None]:
# Execute code block with <shift>+Return to display question; press on one of the buttons to answer
workbook.question("diversity")

Since there has been a selective sweep, we expect a "diversity dip" around the region of the sweep. The stats API can report statistics windowed over the genome: we'll pick 40 evently-spaced windows

In [None]:
from matplotlib import pyplot as plt
import numpy as np

genomic_windows = np.linspace(0, ts.sequence_length, 40)
plt.stairs(ts.diversity(windows=genomic_windows), genomic_windows, baseline=None, label="site")
plt.xlabel("Genome position")
plt.ylabel("Diversity");

## Site versus branch length statistics

The plot above counts diversity using pairwise mutational differences. However, the true (underlying) difference between two sample nodes is simply the branch length distance between them. The mutational difference simply reflects this branch distance, under the assumption that most mutations are at novel sites. In a sense, the mutational distance will tend to the branch-length distance as we get more and more mutational information (i.e. as the mutation rate increases). We can see this by overlaying mutations at different rates onto the underlying genealogy. We can switch to measuring branch length by setting `mode="branch"` when calling a statistical method; the default sitewise version will need to be divided by the mutation rate to be comparable. 


In [None]:
branch_diversity = base_ts.diversity(windows=genomic_windows, mode="branch")
plt.stairs(branch_diversity, genomic_windows, baseline=None, label="branch", color="black", lw=2)
for mut_rate in [1e-10, 1e-9, 1e-8, 1e-7]:
    ts = msprime.sim_mutations(base_ts, rate=mut_rate)
    site_diversity = ts.diversity(windows=genomic_windows) / mut_rate
    plt.stairs(site_diversity, genomic_windows, baseline=None, label=rf"site: $\mu={mut_rate}$")
plt.xlabel("Genome position")
plt.ylabel("Diversity")
plt.legend();

In [None]:
# Execute code block with <shift>+Return to display question; press on one of the buttons to answer
workbook.question("diversity")

### One-way and multi-way methods

The `diversity()` method returns an average across all the samples in the tree sequence (or a specified subset, if node IDs are provided).. However, some statistical methods divide the samples into different sets, and compare within and between sets. One example of this is the classic $F_{st}$ statistic, which is available using `.Fst(sample_sets)`. For instance, we can simulate a classic stepping-stone model, and calculate the $F_{st}$ between populations:

In [None]:
### TODO

In [None]:
pop_sizes = [3e4, 2e4, 3e4, 1e4, 10e4]  # pick some variable sizes for each population 
model = msprime.Demography.stepping_stone_model(pop_sizes, migration_rate=0.001, boundaries=True)
ts = msprime.sim_ancestry(
    {'pop_0': 20, 'pop_1': 20, 'pop_2': 20, 'pop_3': 20, 'pop_4': 20},
    sequence_length=1e6, demography=model, recombination_rate=1e-8)
np.lower_tri
ts.Fst(
    sample_sets=[
        ts.samples(population=0),
        ts.samples(population=1),
        ts.samples(population=2),
    ],
    indexes = [(0, 1), (0, 2), (1, 2)],
    mode="branch",
)

In [None]:
ts.samples(population=0)

In [None]:
### The allele frequency spectrum

We can efficiently calculate the allele frequency spectrum (AFS)

In [None]:
windowed_afs = ts.allele_frequency_spectrum(windows=genomic_windows, polarised=True)

By the way, it's also possible to plot windowed versions of the AFS, but they aren't very sensitive to 

In [None]:
im = plt.pcolormesh(genomic_windows, np.arange(0, ts.num_samples+2), windowed_afs.T)
bar = plt.colorbar(im, ax=plt.gca())
plt.yscale("log")
plt.ylim(1, 1000)

In [None]:
## Mutation times
Each mutation in a tree sequence can also (optionally) be associated with a time

In [None]:
plt.scatter(ts.sites_position[ts.mutations_site], ts.mutations_time, alpha=0.1, s=1)
plt.yscale("log")

In [None]:
## Multiple / recurrent mutations
