# Getting started with tskit
This is the step-by-step tutorial found [here](https://tskit.dev/tutorials/getting_started.html). Here we generate an alignment using [msprime](https://tskit.dev/msprime/docs/stable/intro.html), which is a python package to generate data to be used with *tskit* stuff

> A number of different software programs can generate tree sequences. For the purposes of this tutorial we’ll use msprime to create an example tree sequence representing the genetic genealogy of a 10Mb chromosome in twenty diploid individuals. To make it a bit more interesting, we’ll simulate the effects of a selective sweep in the middle of the chromosome, then throw some neutral mutations onto the resulting tree sequence.

Let's load the necessary packages to generate the data:

In [None]:
import itertools

import msprime

Now generate a *tree sequence* with *msprime*:

In [None]:
pop_size=10_000
seq_length=10_000_000

sweep_model = msprime.SweepGenicSelection(
    position=seq_length/2, start_frequency=0.0001, end_frequency=0.9999, s=0.25, dt=1e-6)

ts = msprime.sim_ancestry(
    20,
    model=[sweep_model, msprime.StandardCoalescent()],
    population_size=pop_size,
    sequence_length=seq_length,
    recombination_rate=1e-8,
    random_seed=1234,  # only needed for repeatabilty
    )
# Optionally add finite-site mutations to the ts using the Jukes & Cantor model, creating SNPs
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=4321)
ts

We have thousand of trees in `ts` object. We have *20 diploid* individuals, so 40 nodes (one for genome? have I *two* genomes per individual as described by the tutorial?)

## Process metadata

Let's explore generated metadata:

In [None]:
for sample in itertools.islice(ts.samples(), 3):
    print(sample)

In [None]:
for individual in itertools.islice(ts.individuals(), 3):
    print(individual)

In [None]:
for population in ts.populations():
    print(population)

## Processing trees

Iterate over the *trees* with the `trees()` method:

In [None]:
for tree in ts.trees():
    print(f"Tree {tree.index} covers {tree.interval}")
    if tree.index >= 4:
        print("...")
        break
print(f"Tree {ts.last().index} covers {ts.last().interval}")

There are also `last()` and `first()` methods to access to the *last* and *first* trees respectively. Check if trees coalesce (not always true for [forward simulations](https://tskit.dev/tutorials/forward_sims.html#sec-tskit-forward-simulations))

In [None]:
import time
elapsed = time.time()
for tree in ts.trees():
    if tree.has_multiple_roots:
        print("Tree {tree.index} has not coalesced")
        break
else:
    elapsed = time.time() - elapsed
    print(f"All {ts.num_trees} trees coalesced")
    print(f"Checked in {elapsed:.6g} secs")

Now that we know all trees have coalesced, we know that at each position in the 
genome all the 40 sample nodes must have one most recent common ancestor (MRCA). 
Below, we iterate over the trees, finding the IDs of the root (MRCA) node for e
ach tree. The time of this root node can be found via the `tskit.TreeSequence.node()`
method, which returns a `Node` object whose attributes include the node time:

In [None]:
import matplotlib.pyplot as plt

kb = [0]  # Starting genomic position
mrca_time = []
for tree in ts.trees():
    kb.append(tree.interval.right/1000)  # convert to kb
    mrca = ts.node(tree.root)  # For msprime tree sequences, the root node is the MRCA
    mrca_time.append(mrca.time)
plt.stairs(mrca_time, kb, baseline=None)
plt.xlabel("Genome position (kb)")
plt.ylabel("Time of root (or MRCA) in generations")
plt.yscale("log")
plt.show()

It’s obvious that there’s something unusual about the trees in the middle of
this chromosome, where the selective sweep occurred.

Although tskit is designed so that is it rapid to pass through trees sequentially,
it is also possible to pull out individual trees from the middle of a tree sequence
via the `TreeSequence.at()` method. Here’s how you can use that to extract the 
tree at location - the position of the sweep - and draw it using the `Tree.draw_svg()`
method:

In [None]:
swept_tree = ts.at(5_000_000)  # or you can get e.g. the nth tree using ts.at_index(n)
intvl = swept_tree.interval
print(f"Tree number {swept_tree.index}, which runs from position {intvl.left} to {intvl.right}:")
# Draw it at a wide size, to make room for all 40 tips
swept_tree.draw_svg(size=(1000, 200))

This tree shows the classic signature of a recent expansion or selection event,
with many long terminal branches, resulting in an excess of singleton mutations.

It can often be helpful to slim down a tree sequence so that it represents the
genealogy of a smaller subset of the original samples. This can be done using
the powerful `TreeSequence.simplify()` method.

The `TreeSequence.draw_svg()` method allows us to draw more than one tree:
either the entire tree sequence, or (by using the `x_lim` parameter) a smaller
region of the genome:

In [None]:
reduced_ts = ts.simplify([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])  # simplify to the first 10 samples
print("Genealogy of the first 10 samples for the first 5kb of the genome")
reduced_ts.draw_svg(x_lim=(0, 5000))

These are much more standard-looking coalescent trees, with far longer branches
higher up in the tree, and therefore many more mutations at higher-frequencies.

> In this tutorial we refer to objects, such as sample nodes, by their numerical
> IDs. These can change after simplification, and it is often more meaningful
> to [work with metadata](https://tskit.dev/tutorials/metadata.html#sec-tutorial-metadata), 
> such as sample and population names, which can be permanently attached to
> objects in the tree sequence. Such metadata is often incorporated automatically
> by the tools generating the tree sequence.

## Processing sites and mutations

For many purposes it may be better to focus on the genealogy of your samples,
rather than the sites and mutations that define the genome sequence itself.
Nevertheless, tskit also provides efficient ways to return Site object and Mutation
objects from a tree sequence. For instance, under the finite sites model of mutation
that we used above, multiple mutations can occur at some sites, and we can identify
them by iterating over the sites using the `TreeSequence.sites()` method:

In [None]:
import numpy as np

num_muts = np.zeros(ts.num_sites, dtype=int)

for site in ts.sites():
    num_muts[site.id] = len(site.mutations)  # site.mutations is a list of mutations at the site

# Print out some info about mutations per site
for nmuts, count in enumerate(np.bincount(num_muts)):
    info = f"{count} sites"
    if nmuts > 1:
        info += f", with IDs {np.where(num_muts==nmuts)[0]},"
    print(info, f"have {nmuts} mutation" + ("s" if nmuts != 1 else ""))

## Processing genotypes
At each site, the sample nodes will have a particular allelic state (or be flagged
as Missing data). The `TreeSequence.variants()` method gives access to the full
variation data. For efficiency, the genotypes at a site are returned as a numpy
array of integers:

In [None]:
np.set_printoptions(linewidth=200)  # print genotypes on a single line

print("Genotypes")
for v in ts.variants():
    print(f"Site {v.site.id}: {v.genotypes}")
    if v.site.id >= 4:  # only print up to site ID 4
        print("...")
        break

> Tree sequences are optimised to look at all samples at one site, then all samples
> at an adjacent site, and so on along the genome. It is much less efficient look
> at all the sites for a single sample, then all the sites for the next sample,
> etc. In other words, **you should generally iterate over sites**, not samples.
> Nevertheless, all the alleles for a single sample can be obtained via the
> `TreeSequence.haplotypes()` method.

To find the actual allelic states at a site, you can refer to the alleles
provided for each Variant: the genotype value is an index into this list.
Here’s one way to print them out; for clarity this example also prints out the IDs
of both the sample nodes (i.e. the genomes) and the diploid individuals in which
each sample node resides.

In [None]:
samp_ids = ts.samples()

print("  ID of diploid individual: ", " ".join([f"{ts.node(s).individual:3}" for s in samp_ids]))
print("       ID of (sample) node: ", " ".join([f"{s:3}" for s in samp_ids]))

for v in ts.variants():
    site = v.site
    alleles = np.array(v.alleles)
    print(f"Site {site.id} (ancestral state '{site.ancestral_state}')",  alleles[v.genotypes])
    if site.id >= 4:  # only print up to site ID 4
        print("...")
        break