In [None]:
import sys
import tskit
import msprime


if "pyodide" in sys.modules:
    import tqdm
    import micropip
    await micropip.install('jupyterquiz')
    await micropip.install('demesdraw')


import workshop
workbook = workshop.setup_msprime_simulations()
display(workbook.setup)

# An introduction to simulations with msprime

In this exercise we will acquaint ourselves with the extremely efficient and versatile coalescent simulator [msprime](https://tskit.dev/msprime/docs/stable/intro.html). It ows much of its efficiency to the [tskit](https://tskit.dev/) (tree sequence kit) format to efficiently store and process genetic and phylogenetic data. Together with other software that use this file format, it makes up an ecosystem of high performant population genetic tools.

We will start by reproducing simulations similar to those in the previous exercise, after which we move on to more advanced examples. Many of the examples here are taken from the [msprime quickstart](https://tskit.dev/msprime/docs/stable/quickstart.html) and documentation.


## Basic simulations

Briefly, coalescent simulations in msprime are done by calling two functions in succession which, by coincidence 😉, are called `sim_ancestry` and `sim_mutations`. 

### Getting to know the tree sequence object

Let's first simulate the [ancestry](https://tskit.dev/msprime/docs/stable/ancestry.html#) of 5 samples. The call to [msprime.sim_ancestry](https://tskit.dev/msprime/docs/stable/api.html#msprime.sim_ancestry) will return a so-called [tree sequence](https://tskit.dev/learn/), which we will call `ts`. 

In [None]:
ts = msprime.sim_ancestry(samples=5, ploidy=1, random_seed=123456)

By default, `msprime` assumes a ploidy of 2, which is why we have to manually pass the `ploidy` parameter. In addition, by setting the `random_seed`, we make sure simulation output can be reproduced. Let's print the output object:

In [None]:
ts

The `ts` object is an instance of the [Tree Sequence](https://tskit.dev/tskit/docs/latest/python-api.html#the-treesequence-class) class. Briefly, it consists of metadata, such as the `Sequence Length` or `Time Units`, and a number of tables, such as the `Edges` (the equivalent of our `branches`), `Nodes`, and `Mutations` table. The metadata and table entries can be accessed with identically-named properties or functions on `ts` (where spaces have been replaced by underscores), e.g.,

In [None]:
print(ts.sequence_length)
print(ts.time_units)
for ind in ts.individuals():
    print(ind)

So, an individual carries an `id`, a unique number that identifies the individual. This is a common feature of the `tskit` data structures in that all objects carry a unique numeric identifier `id`. Furthermore, there are references to an individual's `parents`, which is a list of integers corresponding to parent ids, and similarly for `nodes`, the nodes of the tree. Finally, there is additional information such as a `metadata` slot. 

The `ts.individuals()` function accesses individuals from the `Individuals` table, one by one. As we here have simulated a small genealogy, you can also print the table directly (but don't do it for large simulations!):

In [None]:
ts.tables.individuals

In addition to the properties and functions that map to the metadata and table names, there are a number of convenience functions that provide shortcut access to quantities of interest, e.g., `ts.num_individuals` and `ts.num_populations`:

In [None]:
ts.num_individuals, ts.num_populations

You can find all properties and functions defined on `ts` by using the python builtin `dir`:

In [None]:
dir(ts)

Let's find some more information from the tables.

In [None]:
workbook.question("tskit_tables")

There (of course) exists functionality to easily plot a genealogy. The `ts` object has several `draw_` functions, on of which produces svg output:

In [None]:
ts.draw_svg()

Apart from showing the genealogy, there is a genome coordinate system, 
showing the simulations assume a sequence of length 1 nucleotide by default.

### Adding mutations

As before, we add mutations with a `sim_mutations` function, [msprime.sim_mutations](https://tskit.dev/msprime/docs/stable/api.html#msprime.sim_mutations):

In [None]:
mutated_ts = msprime.sim_mutations(ts, rate=0.5, random_seed=54321)

Here, we specify the mutation rate via the `rate` parameter, which according to the docs is "The rate of mutation per unit of sequence length per unit time" (try varying this parameter and see how it affects the illustrated genealogy below):

In [None]:
mutated_ts.draw_svg(size=(500, 300))

Here we increase the size of the plot to see the details better. To begin with, a mutation is indicated with a red `x`. In addition, the mutations are numbered, such that the ordering along a genetic sequence is explicit. Finally, at genome position 0 you see <em>&or;</em> marks that indicate the position of a mutation.

The latter point can also be illustrated by printing all the mutations, as follows (note the information in `site`):


In [None]:
for mut in mutated_ts.mutations():
    print(mut)

As before, there are shorthand functions and properties to access quantities of interest, e.g.,


In [None]:
mutated_ts.mutations_time

### Summary statistics

There is support for calculating a variety of summary statistics on tree sequences. For instance, to calculate the diversity of `mutated_ts` you can run


In [None]:
mutated_ts.diversity()

<dl class="exercise"><dt>Exercise 1</dt>
    <dd>OPTIONAL: verify the diversity using the equation
        
$$
\pi = \frac{\sum_{i=1}^{n-1}i(n-i)\xi_i}{n(n-1)/2}
$$
        
Here, $\xi_i$ is the tally of the number of mutations that occur in $i$ samples (the *site frequency spectrum*). Recall that the `diversity` function reports a *per-site* statistic!
    </dd>
    </dl>

## More realistic simulations

So far we have basically demonstrated how our homemade simulations look in msprime. However, the msprime versions of [sim_ancestry](https://tskit.dev/msprime/docs/stable/api.html#msprime.sim_ancestry) and [sim_mutations](https://tskit.dev/msprime/docs/stable/api.html#msprime.sim_mutations) provide many more options than our functions, and simulations can accomodate much more complex and realistic scenarios, such as *recombination*, *migration*, *demographic changes*, in some cases *selection*, and more. Briefly skim the API documentation by following the links above to get an overview of what these functions can do.

### Diploid simulations

Until now, we have focused on haploid individuals. In order to introduce recombination, we shift to diploids, which is in fact the [default setting in msprime](https://tskit.dev/msprime/docs/stable/ancestry.html#ploidy). Since a node corresponds to one chromosome, this means an individual is related to two nodes in a tree.



In [None]:
ts = msprime.sim_ancestry(samples=2, random_seed=23423)
print(ts.tables.individuals)
print(ts.tables.nodes)

Note the individual ids and how they relate to the node ids.

### Sequence length

By default sequences in `msprime` correspond to nucleotides. Let's specify a 10kb sequence with the parameter `sequence_length`. Note how the genome coordinates change in the resulting plot.


In [None]:
ts = msprime.sim_ancestry(samples=2,
                          sequence_length=10_000,
                          random_seed=123456)
ts.draw_svg()

### Recombination


Sofar the tree sequence consists of one tree, which corresponds to a non-recombining sequence. We can set the `recombination_rate` parameter to add recombination. Note how we now have two trees, one for each non-recombining sequence.

In [None]:
ts = msprime.sim_ancestry(samples=2,
                          sequence_length=10_000,
                          recombination_rate=1e-5, # set a high rate 
                          random_seed=12353
                         )
ts.draw_svg(size=(600, 300))

In [None]:
workbook.question("recombination")

### Population information

Sofar, we have not mentioned population size in our simulations, but this is something we would like to do since this parameter affects the dynamics of the system. We can set the population size with the `population_size` option, which corresponds to the *effective population size* $N_e$. Note that we decrease the sequence length and recombination rate to speed up simulation:

In [None]:
ts_small = msprime.sim_ancestry(
    samples=2,
    sequence_length=1_000,
    recombination_rate=1e-8, 
    population_size=20_000, # similar to human Ne
    random_seed=12123
)
ts_small.draw_svg()

<dl class="exercise"><dt>Exercise 2</dt>
    <dd>To demonstrate the speed of msprime, simulate a large tree sequence of 20,000 diploid individuals with 1Mbp genomes and a recombination rate of 1e-8. Use the random seed 42, save the output to a variable <tt>large_ts</tt> and print the table.

<div class="alert alert-block alert-info"><b>Tip:</b>
    Make sure you DON'T display the SVG trees! Each tree is huge, and there are a lot of them.</div>        
</dd>
</dl>

In [None]:
# Exercise: Set `large_ts` to a new large tree sequence, generated using msprime.sim_ancestry() with
# specific parameters (random_seed=42, etc.), then output the tree sequence summary table to screen.


In [None]:
workbook.question("large_ancestry")

### Adding mutations

Now we add mutations to the simulated ancestry. Note that there are more mutations than sites.


In [None]:
mts_small = msprime.sim_mutations(
    ts_small, # Use the small tree so we can visualize mutations
    rate=1e-7, # set a high mutation rate 
    random_seed=22
)

In [None]:
mts_small.draw_svg(size=(600, 300))

In [None]:
mts_small

<dl class="exercise"><dt>Exercise</dt>
    
Print out the mutations table of the <code>mts_small</code> object.
 
</dd>
</dl>

In [None]:
workbook.question("mutation_id")

We'll now add mutations to the large ancestry simulated previously. Recall that we have a sample size of 20,000, corresponding to 40,000 chromosomes. A 1Mbp sequence may contain in the order of 10,000 variant positions, which in a [Variant Call Format](https://en.wikipedia.org/wiki/Variant_Call_Format) (vcf) file would constitute a 40,000 by 10,000 matrix, requiring ca 700 Mb (uncompressed) storage. The resulting tree sequence, however, only requires 8Mb.

<dl class="exercise"><dt>Exercise 4</dt>
    <dd>
        
Add mutations to the `large_ts` tree sequence simulated previously. Use a random seed 276 and mutation rate 1e-8.
    </dd>
    </dl>

In [None]:
# Exercise 4: add mutations to large_ts with random seed 276 and 
# mutation rate 1e-8
%time # Time the command


In [None]:
workbook.question("large_mutation")

Note that it only takes a second or two two generate this reasonably large data set which fits handily on a laptop. With the efficient storage that the [tskit](https://tskit.dev) library provide, together with efficient simulators like `msprime`, it is now possible to simulate data for realistically large populations and genome sizes