# TSINFER with SMARTER test data
I've created a test dataset consisting of 10 samples and genotype data for the
chromosome *26* of the SMARTER database. I've created a nextflow pipeline to prepare
a dataset and generate a *phased/imputed* genotype with *Beagle*. You can simply test the 
pipeline using the nextflow test profile: first collect test input file in the `data` directory:

```bash
wget https://github.com/cnr-ibba/nf-treeseq/raw/master/tests/Oar_v3.1_chr26.fna.gz?download= -O data/Oar_v3.1_chr26.fna.gz
wget https://raw.githubusercontent.com/cnr-ibba/nf-treeseq/master/tests/test_dataset.tsv -O data/test_dataset.tsv
wget https://raw.githubusercontent.com/cnr-ibba/nf-treeseq/master/tests/test_outgroup.tsv -O data/test_outgroup.tsv
wget https://github.com/cnr-ibba/nf-treeseq/raw/master/tests/test_dataset.bed?download= -O data/test_dataset.bed
wget https://github.com/cnr-ibba/nf-treeseq/raw/master/tests/test_dataset.bim?download= -O data/test_dataset.bim
wget https://github.com/cnr-ibba/nf-treeseq/raw/master/tests/test_dataset.fam?download= -O data/test_dataset.fam
```

Then run the pipeline with the test profile:

```bash
nextflow run cnr-ibba/nf-treeseq -r v0.2.1 -profile test,singularity --plink_bfile data/test_dataset \
    --plink_keep data/test_dataset.tsv --genome data/Oar_v3.1_chr26.fna.gz \
    --outdir results-estsfs/test --with_estsfs --outgroup1 data/test_outgroup.tsv
```

Now try to read and determine a *tstree* object with *tsdata*:

In [None]:
import json

import tsinfer
import tsdate
import cyvcf2
from tqdm.notebook import tqdm
from tskit import MISSING_DATA

from tskitetude import get_project_dir
from tskitetude.helper import add_populations, add_diploid_individuals, get_ancestors_alleles

Define some useful stuff:

In [None]:
def get_chromosome_lengths(vcf):
    results = {}
    for seqname, seqlen in zip(vcf.seqnames, vcf.seqlens):
        results[seqname] = seqlen

    return results

vcf_location = get_project_dir() / "results-estsfs/test/focal/test_dataset.focal.26.vcf.gz"
samples_location = get_project_dir() / "results-estsfs/test/tsinfer/test_dataset.focal.26.samples"

vcf = cyvcf2.VCF(vcf_location)
chromosome_lengths = get_chromosome_lengths(vcf)

I've derived ancient alleles with `est-sfs`. Try to load data from my results:

In [None]:
ancestors_alleles = get_ancestors_alleles(get_project_dir() / "results-estsfs/test/estsfs/samples-merged.26.ancestral.csv")

Now try to define a custom function to deal with VCF data:

In [None]:
def add_diploid_sites(vcf, samples, ancestors_alleles):
    """
    Read the sites in the vcf and add them to the samples object.
    """
    # You may want to change the following line, e.g. here we allow
    # "*" (a spanning deletion) to be a valid allele state
    allele_chars = set("ATGCatgc*")
    pos = 0
    progressbar = tqdm(total=samples.sequence_length, desc="Read VCF", unit='bp')

    for variant in vcf:  # Loop over variants, each assumed at a unique site
        progressbar.update(variant.POS - pos)

        if pos == variant.POS:
            print(f"Duplicate entries at position {pos}, ignoring all but the first")
            continue

        else:
            pos = variant.POS

        if any([not phased for _, _, phased in variant.genotypes]):
            raise ValueError("Unphased genotypes for variant at position", pos)

        alleles = [variant.REF.upper()] + [v.upper() for v in variant.ALT]
        ancestral_allele = ancestors_alleles.get((variant.CHROM, variant.POS), MISSING_DATA)

        # Check we have ATCG alleles
        for a in alleles:
            if len(set(a) - allele_chars) > 0:
                print(f"Ignoring site at pos {pos}: allele {a} not in {allele_chars}")
                continue

        # Map original allele indexes to their indexes in the new alleles list.
        genotypes = [g for row in variant.genotypes for g in row[0:2]]
        samples.add_site(pos, genotypes, alleles, ancestral_allele=ancestral_allele)

Add individual and populations to empty samples data:

In [None]:
with tsinfer.SampleData(
        path=str(samples_location), sequence_length=chromosome_lengths["26"]) as samples:
    samples_tsv = get_project_dir() / "data/test_dataset.tsv"
    pop_lookup = add_populations(samples_tsv, samples)
    indv_lookup = add_diploid_individuals(samples_tsv, pop_lookup, samples)
    add_diploid_sites(vcf, samples, ancestors_alleles)

In [None]:
print(
    "Sample file created for {} samples ".format(samples.num_samples)
    + "({} individuals) ".format(samples.num_individuals)
    + "with {} variable sites.".format(samples.num_sites),
    flush=True,
)

# Do the inference
sparrow_ts = tsinfer.infer(samples)

print(
    "Inferred tree sequence `{}`: {} trees over {} Mb".format(
        "sparrow_ts", sparrow_ts.num_trees, sparrow_ts.sequence_length / 1e6
    )
)
# Check the metadata
for sample_node_id in sparrow_ts.samples():
    individual_id = sparrow_ts.node(sample_node_id).individual
    population_id = sparrow_ts.node(sample_node_id).population
    print(
        "Node",
        sample_node_id,
        "labels a chr26 sampled from individual",
        json.loads(sparrow_ts.individual(individual_id).metadata),
        "in",
        json.loads(sparrow_ts.population(population_id).metadata),
    )

Try to infer *dates* on my tree:

In [None]:
# Removes unary nodes (currently required in tsdate), keeps historical-only sites
inferred_ts = tsdate.preprocess_ts(sparrow_ts, filter_sites=False)
dated_ts = tsdate.date(inferred_ts, mutation_rate=1e-8, Ne=1e4)

dated_ts