# TSINFER tutorial
Supposing to have phased haplotype data for five samples at six sites like this:

```text
sample  haplotype
0       AGCGAT
1       TGACAG
2       AGACAC
3       ACCGCT
4       ACCGCT
```

Before to derive a `tstree` object that model these data, in need to import data
with `tsinfer`: this requires to know the ancestral alleles first:

In [None]:
import string
import numpy as np
import tsinfer
from tskit import MISSING_DATA

with tsinfer.SampleData(sequence_length=6) as sample_data:
    sample_data.add_site(0, [0, 1, 0, 0, 0], ["A", "T"], ancestral_allele=0)
    sample_data.add_site(1, [0, 0, 0, 1, 1], ["G", "C"], ancestral_allele=0)
    sample_data.add_site(2, [0, 1, 1, 0, 0], ["C", "A"], ancestral_allele=0)
    sample_data.add_site(3, [0, 1, 1, 0, 0], ["G", "C"], ancestral_allele=MISSING_DATA)
    sample_data.add_site(4, [0, 0, 0, 1, 1], ["A", "C"], ancestral_allele=0)
    sample_data.add_site(5, [0, 1, 2, 0, 0], ["T", "G", "C"], ancestral_allele=0)

`tsinfer.Sampledata` is the object required for inferring a `tstree` object. Using 
the `add_site()` method a can add information for each SNP respectively. The first
argument is the *SNP position*: here for simplicity we track SNP in positional order
but it can be any positive value (even float). The only requirement is that this 
position should be unique and added in increasing order. The 2nd argument is for 
the *genotypes* of each sample in this position: is and index of the allele I can 
find in the 3rd argument. If I have a missing data, I need to use the `tskit.MISSING_DATA`
The last argument is the index of the ancestral allele. Not all the sites are used
to infer the *tree* object: sites with missing data or ancestral alleles or sites with
more than 2 genotypes are not considered by will be modeled in the resulting tree.
Once we have the `SampleData` instance, we can infer a `tstree` object using
`tsinfer.infer`:

In [None]:
ts = tsinfer.infer(sample_data)

This `ts` object is a full *Tree Sequence* object:

In [None]:
ts

This *Tree sequence* object can be analyzed as usual:

In [None]:
print("==Haplotypes==")
for sample_id, h in enumerate(ts.haplotypes()):
    print(sample_id, h, sep="\t")
ts.draw_svg(y_axis=True)

If I understand correctly, `tsinfer` can impute missing data (check this). For the
data I put, there's a *root* note with three *childs*: this is also known as *polytomy*.
Every *internal* node represent an ancestral sequence, By default, the time of those
nodes is not measured in years or generations, but is the frequency of the shared
derived alleles on which the ancestral sequence is based. This is why the time is
*uncalibrated* in the graph above.

In [None]:
# Extra code to label and order the tips alphabetically rather than numerically
labels = {i: string.ascii_lowercase[i] for i in range(ts.num_nodes)}
genome_order = [n for n in ts.first().nodes(order="minlex_postorder") if ts.node(n).is_sample()]
labels.update({n: labels[i] for i, n in enumerate(genome_order)})
style1 = (
    ".node:not(.sample) > .sym, .node:not(.sample) > .lab {visibility: hidden;}"
    ".mut {font-size: 12px} .y-axis .tick .lab {font-size: 85%}")
sz = (800, 250)  # size of the plot, slightly larger than the default

# ticks = [0, 5000, 10000, 15000, 20000]
# get max generations time:
max_time = ts.node(ts.get_num_nodes() - 1).time
ticks = np.linspace(0, max_time, 5)
ts.draw_svg(
    size=sz, node_labels=labels, style=style1, y_label="Time ago (uncalibrated)",
    y_axis=True, y_ticks=ticks)