# TSINFER with SMARTER test data
I've created a test dataset consisting of 10 samples and genotype data for the
chromosome *26* of the SMARTER database. I've created a nextflow pipeline to prepare
a dataset and generate a *phased/imputed* genotype with *Beagle*. You can simply test the 
pipeline using the nextflow test profile: first collect test input file in the `data` directory:

```bash
wget https://github.com/cnr-ibba/nf-treeseq/raw/master/tests/Oar_v3.1_chr26.fna.gz?download= -O data/Oar_v3.1_chr26.fna.gz
wget https://raw.githubusercontent.com/cnr-ibba/nf-treeseq/master/tests/test_dataset.tsv -O data/test_dataset.tsv
wget https://raw.githubusercontent.com/cnr-ibba/nf-treeseq/master/tests/test_outgroup.tsv -O data/test_outgroup.tsv
wget https://github.com/cnr-ibba/nf-treeseq/raw/master/tests/test_dataset.bed?download= -O data/test_dataset.bed
wget https://github.com/cnr-ibba/nf-treeseq/raw/master/tests/test_dataset.bim?download= -O data/test_dataset.bim
wget https://github.com/cnr-ibba/nf-treeseq/raw/master/tests/test_dataset.fam?download= -O data/test_dataset.fam
```

Then run the pipeline with the test profile:

```bash
nextflow run cnr-ibba/nf-treeseq -r v0.2.1 -profile test,singularity --plink_bfile data/test_dataset \
    --plink_keep data/test_dataset.tsv --genome data/Oar_v3.1_chr26.fna.gz \
    --outdir results-estsfs/test --with_estsfs --outgroup1 data/test_outgroup.tsv
```

Now try to read and determine a *tstree* object with *tsdata*:

In [1]:
import json

import tsinfer
import tsdate
import cyvcf2
from tqdm.notebook import tqdm
from tskit import MISSING_DATA

from tskitetude import get_project_dir
from tskitetude.helper import add_populations, add_diploid_individuals, get_ancestors_alleles

Define some useful stuff:

In [2]:
def get_chromosome_lengths(vcf):
    results = {}
    for seqname, seqlen in zip(vcf.seqnames, vcf.seqlens):
        results[seqname] = seqlen

    return results

vcf_location = get_project_dir() / "results-estsfs/test/focal/test_dataset.focal.26.vcf.gz"
samples_location = get_project_dir() / "results-estsfs/test/tsinfer/test_dataset.focal.26.samples"

vcf = cyvcf2.VCF(vcf_location)
chromosome_lengths = get_chromosome_lengths(vcf)

I've derived ancient alleles with `est-sfs`. Try to load data from my results:

In [3]:
ancestors_alleles = get_ancestors_alleles(get_project_dir() / "results-estsfs/test/estsfs/samples-merged.26.ancestral.csv")

Now try to define a custom function to deal with VCF data:

In [4]:
def add_diploid_sites(vcf, samples, ancestors_alleles):
    """
    Read the sites in the vcf and add them to the samples object.
    """
    # You may want to change the following line, e.g. here we allow
    # "*" (a spanning deletion) to be a valid allele state
    allele_chars = set("ATGCatgc*")
    pos = 0
    progressbar = tqdm(total=samples.sequence_length, desc="Read VCF", unit='bp')

    for variant in vcf:  # Loop over variants, each assumed at a unique site
        progressbar.update(variant.POS - pos)

        if pos == variant.POS:
            print(f"Duplicate entries at position {pos}, ignoring all but the first")
            continue

        else:
            pos = variant.POS

        if any([not phased for _, _, phased in variant.genotypes]):
            raise ValueError("Unphased genotypes for variant at position", pos)

        alleles = [variant.REF.upper()] + [v.upper() for v in variant.ALT]
        ancestral_allele = ancestors_alleles.get((variant.CHROM, variant.POS), MISSING_DATA)

        # Check we have ATCG alleles
        for a in alleles:
            if len(set(a) - allele_chars) > 0:
                print(f"Ignoring site at pos {pos}: allele {a} not in {allele_chars}")
                continue

        # Map original allele indexes to their indexes in the new alleles list.
        genotypes = [g for row in variant.genotypes for g in row[0:2]]
        samples.add_site(pos, genotypes, alleles, ancestral_allele=ancestral_allele)

Add individual and populations to empty samples data:

In [5]:
with tsinfer.SampleData(
        path=str(samples_location), sequence_length=chromosome_lengths["26"]) as samples:
    samples_tsv = get_project_dir() / "data/test_dataset.tsv"
    pop_lookup = add_populations(samples_tsv, samples)
    indv_lookup = add_diploid_individuals(samples_tsv, pop_lookup, samples)
    add_diploid_sites(vcf, samples, ancestors_alleles)

  return zarr.LMDBStore(self.path, subdir=False, map_size=map_size)
  with tsinfer.SampleData(


Read VCF:   0%|          | 0/44077779.0 [00:00<?, ?bp/s]

  store = zarr.LMDBStore(


In [6]:
print(
    "Sample file created for {} samples ".format(samples.num_samples)
    + "({} individuals) ".format(samples.num_individuals)
    + "with {} variable sites.".format(samples.num_sites),
    flush=True,
)

# Do the inference
sparrow_ts = tsinfer.infer(samples)

print(
    "Inferred tree sequence `{}`: {} trees over {} Mb".format(
        "sparrow_ts", sparrow_ts.num_trees, sparrow_ts.sequence_length / 1e6
    )
)
# Check the metadata
for sample_node_id in sparrow_ts.samples():
    individual_id = sparrow_ts.node(sample_node_id).individual
    population_id = sparrow_ts.node(sample_node_id).population
    print(
        "Node",
        sample_node_id,
        "labels a chr26 sampled from individual",
        json.loads(sparrow_ts.individual(individual_id).metadata),
        "in",
        json.loads(sparrow_ts.population(population_id).metadata),
    )

Sample file created for 18 samples (9 individuals) with 680 variable sites.


2025-12-09 16:41:31,858 - root - INFO - Max encoded genotype matrix size=12.0 KiB
2025-12-09 16:41:31,859 - tsinfer.inference - INFO - Starting addition of 680 sites
2025-12-09 16:41:31,891 - tsinfer.inference - INFO - Finished adding sites
2025-12-09 16:41:31,892 - tsinfer.inference - INFO - Ancestor builder peak RAM: 1.0 MiB
2025-12-09 16:41:31,901 - tsinfer.inference - INFO - Starting build for 525 ancestors
2025-12-09 16:41:31,920 - tsinfer.inference - INFO - Finished building ancestors
2025-12-09 16:41:32,052 - tsinfer.inference - INFO - Mismatch prevented by setting constant high recombination and low mismatch probabilities
2025-12-09 16:41:32,054 - tsinfer.inference - INFO - Summary of recombination probabilities between sites: min=0.01; max=0.01; median=0.01; mean=0.01
2025-12-09 16:41:32,055 - tsinfer.inference - INFO - Summary of mismatch probabilities over sites: min=1e-20; max=1e-20; median=1e-20; mean=1e-20
2025-12-09 16:41:32,055 - tsinfer.inference - INFO - Matching usin

Inferred tree sequence `sparrow_ts`: 506 trees over 44.077779 Mb
Node 0 labels a chr26 sampled from individual {'sample_id': 'UYOA-TEX-000000001'} in {'breed': 'TEX'}
Node 1 labels a chr26 sampled from individual {'sample_id': 'UYOA-TEX-000000001'} in {'breed': 'TEX'}
Node 2 labels a chr26 sampled from individual {'sample_id': 'GROA-FRZ-000000170'} in {'breed': 'FRZ'}
Node 3 labels a chr26 sampled from individual {'sample_id': 'GROA-FRZ-000000170'} in {'breed': 'FRZ'}
Node 4 labels a chr26 sampled from individual {'sample_id': 'UYOA-MER-000000224'} in {'breed': 'MER'}
Node 5 labels a chr26 sampled from individual {'sample_id': 'UYOA-MER-000000224'} in {'breed': 'MER'}
Node 6 labels a chr26 sampled from individual {'sample_id': 'UYOA-CRR-000000320'} in {'breed': 'CRR'}
Node 7 labels a chr26 sampled from individual {'sample_id': 'UYOA-CRR-000000320'} in {'breed': 'CRR'}
Node 8 labels a chr26 sampled from individual {'sample_id': 'UYOA-CRL-000000380'} in {'breed': 'CRL'}
Node 9 labels a c

Try to infer *dates* on my tree:

In [8]:
# Removes unary nodes (currently required in tsdate), keeps historical-only sites
inferred_ts = tsdate.preprocess_ts(sparrow_ts.simplify(), filter_sites=False)
dated_ts = tsdate.date(inferred_ts, method="inside_outside", mutation_rate=1e-8, Ne=1e4)

dated_ts

2025-12-09 16:43:43,054 - tsdate.util - INFO - Beginning preprocessing
2025-12-09 16:43:43,059 - tsdate.util - INFO - Minimum_gap: None and erase_flanks: None
2025-12-09 16:43:43,061 - tsdate.util - INFO - REMOVING TELOMERE: Snip topology from 0 to first site at 166515.0.
2025-12-09 16:43:43,078 - tsdate.util - INFO - REMOVING TELOMERE: Snip topology from 44004282.0 to end of sequence at 44077779.0.
2025-12-09 16:43:44,761 - tsdate.core - INFO - Inserted node and mutation metadata in 0.03436160087585449 seconds
2025-12-09 16:43:44,764 - root - INFO - Modified ages of 38 nodes to satisfy constraints
2025-12-09 16:43:44,765 - tsdate.core - INFO - Constrained node ages in 0.00 seconds
2025-12-09 16:43:44,768 - root - INFO - Set ages of 0 nonsegregating mutations to root times.


Tree Sequence,Unnamed: 1
Trees,505
Sequence Length,44 077 779
Time Units,generations
Sample Nodes,18
Total Size,341.8 KiB
Metadata,dict

Table,Rows,Size,Has Metadata
Edges,4 208,131.5 KiB,
Individuals,9,591 Bytes,✅
Migrations,0,8 Bytes,
Mutations,968,35.0 KiB,
Nodes,1 037,105.2 KiB,✅
Populations,9,224 Bytes,✅
Provenances,4,2.8 KiB,
Sites,660,33.5 KiB,✅

Provenance Timestamp,Software Name,Version,Command,Full record
"09 December, 2025 at 04:43:44 PM",tsdate,0.2.4,inside_outside,Details  dict  schema_version: 1.0.0  software:  dict  name: tsdate version: 0.2.4  parameters:  dict  mutation_rate: 1e-08 recombination_rate: None time_units: None progress: None population_size: 10000.0 eps: 1e-10 outside_standardize: True ignore_oldest_root: False probability_space: logarithmic num_threads: None cache_inside: False command: inside_outside  environment:  dict  os:  dict  system: Linux node: core release: 5.15.0-130-generic version: #140-Ubuntu SMP Wed Dec 18 17:59:53 UTC 2024 machine: x86_64  python:  dict  implementation: CPython version: 3.12.12  libraries:  dict  tskit:  dict  version: 1.0.0b3  resources:  dict  elapsed_time: 1.6438274383544922 user_time: 66.18 sys_time: 4.06 max_memory: 680747008
"09 December, 2025 at 04:43:43 PM",tsdate,0.2.4,preprocess_ts,Details  dict  schema_version: 1.0.0  software:  dict  name: tsdate version: 0.2.4  parameters:  dict  minimum_gap: 1000000 erase_flanks: True split_disjoint: True filter_populations: False filter_individuals: False filter_sites: False  delete_intervals:  list  list  0  166515.0  list  44004282.0  44077779.0  command: preprocess_ts  environment:  dict  os:  dict  system: Linux node: core release: 5.15.0-130-generic version: #140-Ubuntu SMP Wed Dec 18 17:59:53 UTC 2024 machine: x86_64  python:  dict  implementation: CPython version: 3.12.12  libraries:  dict  tskit:  dict  version: 1.0.0b3  resources:  dict  elapsed_time: 0.06455469131469727 user_time: 64.56 sys_time: 4.04 max_memory: 680747008
"09 December, 2025 at 04:43:43 PM",tskit,1.0.0b3,simplify,Details  dict  schema_version: 1.0.0  software:  dict  name: tskit version: 1.0.0b3  parameters:  dict  command: simplify TODO: add simplify parameters  environment:  dict  os:  dict  system: Linux node: core release: 5.15.0-130-generic version: #140-Ubuntu SMP Wed Dec 18 17:59:53 UTC 2024 machine: x86_64  python:  dict  implementation: CPython version: 3.12.12  libraries:  dict  kastore:  dict  version: 2.1.1
"09 December, 2025 at 04:41:33 PM",tsinfer,0.5.0,infer,Details  dict  schema_version: 1.0.0  software:  dict  name: tsinfer version: 0.5.0  parameters:  dict  mismatch_ratio: None path_compression: True precision: None post_process: None command: infer  environment:  dict  libraries:  dict  zarr:  dict  version: 2.18.7  numcodecs:  dict  version: 0.15.0  lmdb:  dict  version: 1.7.5  tskit:  dict  version: 1.0.0b3  os:  dict  system: Linux node: core release: 5.15.0-130-generic version: #140-Ubuntu SMP Wed Dec 18 17:59:53 UTC 2024 machine: x86_64  python:  dict  implementation: CPython  version:  list  3  12  12  resources:  dict  elapsed_time: 1.8074344918131828 user_time: 1.6400000000000006 sys_time: 0.16999999999999993 max_memory: 680386560
