# MicroHapulator: CLI Demo

**MicroHapulator** is software for forensic analysis of microhaplotype sequence data.
Features include the following:

- simulating simple (single-contributor) and complex (multi-contributor) DNA samples
- simulated MPS sequencing of user-specified microhap panels
- genotyping of DNA profiles from simple and complex DNA samples
- tools for deterministic and probabilistic interpretation of simple and complex samples

MicroHapulator relies on microhap marker definitions and allele frequencies from [MicroHapDB](https://github.com/bioforensics/MicroHapDB) and MPS error models included with [InSilicoSeq](https://github.com/HadrienG/InSilicoSeq/).

This notebook provides an interactive demonstration of MicroHapulator's command line interface.
Any line ending with `\` indicates that the command is continued on the next line.

In [1]:
head -n 5 beta-panel.txt

mh01KK-205
mh01CP-016
mh01KK-117
mh02KK-138
mh02KK-136
mh03CP-005
mh03KK-007
mh03KK-150
mh04KK-030
mh04KK-013


In [2]:
microhapdb marker --panel beta-panel.txt --format fasta > beta-panel.fasta

In [3]:
head -n 10 beta-panel.fasta

>mh01KK-205 PermID=MHDBM-1f7eaca2 GRCh38:chr1:18396197-18396351 variants=48,69,157,202 Xref=SI664550A
ATGGGGTAATTTGGGGTCCAGAGCACCAGTTCTCATGAATCTGAGGAATTCTTCCTCCTAGCTACTTCCTTCCTTTTCCC
TCATTACATCCCTGCCAAGGACAAATTCTGCCATTTGCATGGCAGGACTCCTCCAAAAAGGGGCTTCCTCCCTTTCCGTT
AGTAAAGGAAGAGGTTACCTGAGACTTGACTTAACCTCCTTGGGAGGGAACATGCTTTCACTGTTGCGAATTGTTAAGTC
AGGTCCAGAGT
>mh01CP-016 PermID=MHDBM-021e569a GRCh38:chr1:55559012-55559056 variants=103,141,147 Xref=SI664876L
TGAGAGAGCCCAGTGACCTAAGCAGCTCCAACCCTGAGACTGGATCTAATGATGATCCAGATAATCCAGTGCCCAGCTTA
GAGCCTGGCACACAACAAGTGCTTATAATGAAAGCATTAGTGAGTAAAAGAGTGATCCCTGGCTTTGAACTCCCTCTAAG
TGTACCCCCAGGCATCTGTTCTTCCCTCAGTCACAATGCTGACCCCACTTCATGACTGGTCTCCTCTCCTTTGATTGTGC
ACACAAGGGCC


In [4]:
mhpl8r sim --seed 24680 --out sim-profile-1.json Iberian Iberian beta-panel.txt

[MicroHapulator] running version 0.3+25.gf76b495.dirty
[MicroHapulator::sim] simulated microhaplotype variation at 50 markers
[MicroHapulator::sim] profile JSON written to sim-profile-1.json


In [5]:
head -n 25 sim-profile-1.json

{
    "markers": {
        "mh01CP-016": {
            "genotype": [
                {
                    "allele": "T,G,G",
                    "haplotype": 0
                },
                {
                    "allele": "T,G,G",
                    "haplotype": 1
                }
            ]
        },
        "mh01KK-117": {
            "genotype": [
                {
                    "allele": "A,A,C,T",
                    "haplotype": 0
                },
                {
                    "allele": "A,G,C,T",
                    "haplotype": 1
                }
            ]


In [6]:
mhpl8r seq --threads 2 --num-reads 10000 --seed 666410524 --out sim-reads-1a.fastq.gz sim-profile-1.json

[MicroHapulator] running version 0.3+25.gf76b495.dirty
[MicroHapulator::seq] Individual seed=666410524 numreads=10000


In [7]:
gunzip -c sim-reads-1a.fastq.gz | wc -l 

40000


In [8]:
bwa index beta-panel.fasta
samtools faidx beta-panel.fasta

[bwa_index] Pack FASTA... 0.00 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.00 seconds elapse.
[bwa_index] Update BWT... 0.00 sec
[bwa_index] Pack forward-only FASTA... 0.00 sec
[bwa_index] Construct SA from BWT and Occ... 0.00 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index beta-panel.fasta
[main] Real time: 0.040 sec; CPU: 0.015 sec


In [9]:
bwa mem beta-panel.fasta sim-reads-1a.fastq.gz | samtools view -bS - | samtools sort -o sim-reads-1a.bam -
samtools index sim-reads-1a.bam

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 10000 sequences (3010000 bp)...
[M::mem_process_seqs] Processed 10000 reads in 1.334 CPU sec, 1.335 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem beta-panel.fasta sim-reads-1a.fastq.gz
[main] Real time: 1.468 sec; CPU: 1.380 sec


In [10]:
mhpl8r type --out obs-profile-1a.json beta-panel.fasta sim-reads-1a.bam

[MicroHapulator] running version 0.3+25.gf76b495.dirty
[MicroHapulator::type] discarded 1624 reads with gaps or missing data at positions of interest


In [11]:
head -n 25 obs-profile-1a.json

{
    "markers": {
        "mh01CP-016": {
            "allele_counts": {
                "T,G,G": 186
            },
            "genotype": [
                {
                    "allele": "T,G,G",
                    "haplotype": null
                }
            ],
            "max_coverage": 200,
            "mean_coverage": 199.0,
            "min_coverage": 158,
            "num_discarded_reads": 12
        },
        "mh01KK-117": {
            "allele_counts": {
                "A,A,C,G": 1,
                "A,A,C,T": 77,
                "A,G,C,G": 1,
                "A,G,C,T": 77
            },
            "genotype": [


In [13]:
mhpl8r seq --threads 2 --num-reads 2000 --seed 3374532379 --out sim-reads-1b.fastq.gz sim-profile-1.json

[MicroHapulator] running version 0.3+25.gf76b495.dirty
[MicroHapulator::seq] Individual seed=3374532379 numreads=2000


In [18]:
bwa mem beta-panel.fasta sim-reads-1b.fastq.gz | samtools view -bS - | samtools sort -o sim-reads-1b.bam -
samtools index sim-reads-1b.bam
mhpl8r type --out obs-profile-1b.json beta-panel.fasta sim-reads-1b.bam

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 2000 sequences (602000 bp)...
[M::mem_process_seqs] Processed 2000 reads in 0.249 CPU sec, 0.249 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem beta-panel.fasta sim-reads-1b.fastq.gz
[main] Real time: 0.273 sec; CPU: 0.260 sec
[MicroHapulator] running version 0.3+25.gf76b495.dirty
[MicroHapulator::type] discarded 268 reads with gaps or missing data at positions of interest


In [26]:
mhpl8r dist obs-profile-1a.json obs-profile-1b.json

[MicroHapulator] running version 0.3+25.gf76b495.dirty
{
    "hamming_distance": 0
}

In [19]:
mhpl8r prob Iberian obs-profile-1a.json obs-profile-1b.json

[MicroHapulator] running version 0.3+25.gf76b495.dirty
{
    "rmp_likelihood_ratio": "2.604E+59"
}

In [20]:
mhpl8r sim --seed 13579 --out sim-profile-2.json Iberian Iberian beta-panel.txt
mhpl8r seq --threads 2 --num-reads 2000 --seed 3963949764 --out sim-reads-2.fastq.gz sim-profile-2.json
bwa mem beta-panel.fasta sim-reads-2.fastq.gz | samtools view -bS - | samtools sort -o sim-reads-2.bam -
samtools index sim-reads-2.bam
mhpl8r type --out obs-profile-2.json beta-panel.fasta sim-reads-2.bam

[MicroHapulator] running version 0.3+25.gf76b495.dirty
[MicroHapulator::sim] simulated microhaplotype variation at 50 markers
[MicroHapulator::sim] profile JSON written to sim-profile-2.json
[MicroHapulator] running version 0.3+25.gf76b495.dirty
[MicroHapulator::seq] Individual seed=3963949764 numreads=2000
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 2000 sequences (602000 bp)...
[M::mem_process_seqs] Processed 2000 reads in 0.244 CPU sec, 0.244 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem beta-panel.fasta sim-reads-2.fastq.gz
[main] Real time: 0.264 sec; CPU: 0.256 sec
[MicroHapulator] running version 0.3+25.gf76b495.dirty
[MicroHapulator::type] discarded 314 reads with gaps or missing data at positions of interest


In [27]:
mhpl8r dist obs-profile-1b.json obs-profile-2.json
mhpl8r prob Iberian obs-profile-1b.json obs-profile-2.json

[MicroHapulator] running version 0.3+25.gf76b495.dirty
{
    "hamming_distance": 41
}[MicroHapulator] running version 0.3+25.gf76b495.dirty
{
    "rmp_likelihood_ratio": "2.604E-115"
}

In [30]:
mhpl8r sim --seed 1234 --out mix-profile-1.json SA004250L SA004250L beta-panel.txt
mhpl8r sim --seed 5678 --out mix-profile-2.json SA004250L SA004250L beta-panel.txt
mhpl8r sim --seed 1029 --out mix-profile-3.json SA004250L SA004250L beta-panel.txt

mhpl8r seq --proportions 0.7 0.2 0.1 --num-reads 20000 --threads 2 --out mix-reads.fastq.gz \
    mix-profile-1.json mix-profile-2.json mix-profile-3.json

[MicroHapulator] running version 0.3+25.gf76b495.dirty
[MicroHapulator::sim] simulated microhaplotype variation at 50 markers
[MicroHapulator::sim] profile JSON written to mix-profile-1.json
[MicroHapulator] running version 0.3+25.gf76b495.dirty
[MicroHapulator::sim] simulated microhaplotype variation at 50 markers
[MicroHapulator::sim] profile JSON written to mix-profile-2.json
[MicroHapulator] running version 0.3+25.gf76b495.dirty
[MicroHapulator::sim] simulated microhaplotype variation at 50 markers
[MicroHapulator::sim] profile JSON written to mix-profile-3.json
[MicroHapulator] running version 0.3+25.gf76b495.dirty
[MicroHapulator::seq] Individual seed=2442208038 numreads=14000
[MicroHapulator::seq] Individual seed=3693503682 numreads=4000
[MicroHapulator::seq] Individual seed=1629003916 numreads=2000


In [31]:
bwa mem beta-panel.fasta mix-reads.fastq.gz | samtools view -bS - | samtools sort -o mix-reads.bam -
samtools index mix-reads.bam
mhpl8r type --out obs-profile-mix.json beta-panel.fasta mix-reads.bam

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 20000 sequences (6020000 bp)...
[M::mem_process_seqs] Processed 20000 reads in 2.424 CPU sec, 2.425 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem beta-panel.fasta mix-reads.fastq.gz
[main] Real time: 2.650 sec; CPU: 2.508 sec
[MicroHapulator] running version 0.3+25.gf76b495.dirty
[MicroHapulator::type] discarded 1577 reads with gaps or missing data at positions of interest


In [33]:
mhpl8r contrib --json obs-profile-mix.json

[MicroHapulator] running version 0.3+25.gf76b495.dirty
{
    "min_num_contrib": 3,
    "num_loci_max_alleles": 1,
    "perc_loci_max_alleles": 0.02
}

In [34]:
mhpl8r contain obs-profile-mix.json mix-profile-1.json
mhpl8r contain obs-profile-mix.json mix-profile-2.json
mhpl8r contain obs-profile-mix.json mix-profile-3.json

[MicroHapulator] running version 0.3+25.gf76b495.dirty
{
    "containment": 1.0,
    "contained_alleles": 78,
    "total_alleles": 78
}[MicroHapulator] running version 0.3+25.gf76b495.dirty
{
    "containment": 0.9398,
    "contained_alleles": 78,
    "total_alleles": 83
}[MicroHapulator] running version 0.3+25.gf76b495.dirty
{
    "containment": 0.9605,
    "contained_alleles": 73,
    "total_alleles": 76
}

In [36]:
mhpl8r sim --seed 3847 --out mix-profile-4.json SA004250L SA004250L beta-panel.txt
mhpl8r sim --seed 5656 --out mix-profile-5.json SA004250L SA004250L beta-panel.txt
mhpl8r contain obs-profile-mix.json mix-profile-4.json
mhpl8r contain obs-profile-mix.json mix-profile-5.json

[MicroHapulator] running version 0.3+25.gf76b495.dirty
[MicroHapulator::sim] simulated microhaplotype variation at 50 markers
[MicroHapulator::sim] profile JSON written to mix-profile-4.json
[MicroHapulator] running version 0.3+25.gf76b495.dirty
[MicroHapulator::sim] simulated microhaplotype variation at 50 markers
[MicroHapulator::sim] profile JSON written to mix-profile-5.json
[MicroHapulator] running version 0.3+25.gf76b495.dirty
{
    "containment": 0.7024,
    "contained_alleles": 59,
    "total_alleles": 84
}[MicroHapulator] running version 0.3+25.gf76b495.dirty
{
    "containment": 0.7326,
    "contained_alleles": 63,
    "total_alleles": 86
}