# MicroHapulator Demo

**MicroHapulator** is a package for simulating and analyzing microhaplotype sequence data for forensic analysis.
In addition to simulating targeted sequencing of selected microhap loci, it can be used for calling genotypes and performing sample matching and mixture analysis.
MicroHapulator relies on microhap annotations and population allele frequencies from [MicroHapDB](https://github.com/bioforensics/microhapdb) and the Illumina error models included with [InSilicoSeq](https://github.com/HadrienG/InSilicoSeq/).

This notebook provides an interactive demonstration of MicroHapulator's command-line interface and Python API.
The demo includes both shell commands and Python code.
All shell commands begin with a `!` character.
Any line ending with `\` indicates that the command is continued on the next line.

> **NOTE**: *Unless otherwise specified, locus and population IDs used throughout the notebook are derived from MicroHapDB data.*

In [1]:
!microhapdb --table population | head
!microhapdb --table locus | head

             ID                Name  Source
0   MHDBP000001              Adygei  ALFRED
1   MHDBP000002              Africa    LOVD
2   MHDBP000003   African Americans  ALFRED
3   MHDBP000004   African Americans  ALFRED
4   MHDBP000005     Afro-Caribbeans  ALFRED
5   MHDBP000006                 Ami  ALFRED
6   MHDBP000007                Asia    LOVD
7   MHDBP000008              Atayal  ALFRED
8   MHDBP000009             Bengali  ALFRED
              ID Reference  Chrom      Start        End   AvgAe  Source
0    MHDBL000001    GRCh38   chr1    1551453    1551679  2.6731  ALFRED
1    MHDBL000002    GRCh38   chr1    3826567    3826827  3.1300  ALFRED
2    MHDBL000003    GRCh38   chr1    4167403    4167574  2.5802  ALFRED
3    MHDBL000004    GRCh38   chr1   11794399   11794419  1.8063  ALFRED
4    MHDBL000005    GRCh38   chr1   12788891   12788908  2.5194  ALFRED
5    MHDBL000006    GRCh38   chr1   14503432   14503449  2.4662  ALFRED
6    MHDBL000007    GRCh38   chr1   18396197   18396352 

## Preliminaries

Before using MicroHapulator for the first time, the human reference genome must be downloaded to a dedicated package directory.
Use the following command to execute the download.

In [2]:
!mhpl8r getrefr

[MicroHapulator] running version 0.2
[MicroHapulator::getrefr] Downloading GRCh38 reference
[MicroHapulator::getrefr] Decompressing reference
[MicroHapulator::getrefr] Indexing reference


## Simulate and analyze a simple sample

First, let's simulate a simple sample with a single contributor.
Providing a seed with the `--hap-seed` flag makes sure we simulate the exact same "random" genotype.
The locus identifier `MHDBP000037` indicates that we want to use population allele frequency data simulate an Iberian individual.
We specify the desired panel with the `--panel` flag.
In this case we'll use a preset panel of 50 microhap loci (nicknamed "beta") with reasonably optimal discriminating power.
However, MicroHapulator supports simulation of any arbitrary panel by providing a list of locus identifiers.

In [3]:
!mhpl8r sim --panel beta --hap-seed 24680 --seq-threads 2 --num-reads 10000 \
    --genotype sim.gt.bed --out sim.fastq.gz MHDBP000037

[MicroHapulator] running version 0.2
[MicroHapulator::sim] simulated microhaplotype variation at 50 loci
INFO:iss.app:Starting iss generate
INFO:iss.app:Using kde ErrorModel
INFO:iss.util:Stitching input files together
INFO:iss.app:Using lognormal abundance distribution
INFO:iss.app:Using 2 cpus for read generation
INFO:iss.app:Generating 20000 reads
INFO:iss.app:Generating reads for record: MHDBL000007:hap1
INFO:iss.app:Generating reads for record: MHDBL000007:hap2
INFO:iss.app:Generating reads for record: MHDBL000009:hap1
INFO:iss.app:Generating reads for record: MHDBL000009:hap2
INFO:iss.app:Generating reads for record: MHDBL000013:hap1
INFO:iss.app:Generating reads for record: MHDBL000013:hap2
INFO:iss.app:Generating reads for record: MHDBL000019:hap1
INFO:iss.app:Generating reads for record: MHDBL000019:hap2
INFO:iss.app:Generating reads for record: MHDBL000020:hap1
INFO:iss.app:Generating reads for record: MHDBL000020:hap2
INFO:iss.app:Generating reads for record: MHDBL000027:hap

Let's confirm that MicroHapulator simulated 10,000 reads as requested.
We'll do this by counting the number of lines in the reads file.

In [4]:
!gunzip -c sim.fastq.gz | wc -l

40000


Looks good: 40,000 lines at 4 lines per record = 10,000 reads!

In addition to the simulated reads, MicroHapulator can also output the fully phased simulated haplotypes using the `--genotype` flag.
We can use this later to see if it's possible to infer the correct genotypes directly from the reads.
Let's take a look at the first several lines of this file.

In [5]:
!head -n 15 sim.gt.bed

MHDBL000007	98	99	T|T
MHDBL000007	119	120	C|T
MHDBL000007	207	208	A|A
MHDBL000007	252	253	G|G
MHDBL000009	153	154	T|T
MHDBL000009	191	192	G|G
MHDBL000009	197	198	G|G
MHDBL000013	82	83	A|A
MHDBL000013	139	140	G|A
MHDBL000013	242	243	C|C
MHDBL000013	268	269	C|T
MHDBL000019	162	163	C|C
MHDBL000019	168	169	C|T
MHDBL000019	187	188	C|C
MHDBL000020	80	81	G|G


This BED file indicates the position and genotype of each microhaplotype.
For example, microhap MHDBL000007 has a heterozygous genotype of `T,C,A,G` and `T,T,A,G` while microhap MHDBL000009 has a homozygous genotype of `T,G,G`.

Now, to compute genotypes we need to first align the reads back to the loci of interest.
First let's retrieve the locus sequences for our "beta" panel.

In [6]:
!mhpl8r refr --out beta-panel.fasta beta

[MicroHapulator] running version 0.2


This creates a Fasta file with a single record for each microhap locus, including its absolute coordinates in the human reference genome and the offsets of the variants that define the microhap.

In [7]:
!head -n 6 beta-panel.fasta

>MHDBL000007 GRCh38:chr1:18396099-18396450 variants=98:119:207:252
GCTGAGGGAAGTCTGGGCTCTGATGCAGAGAGACCTAGAAGAAAGCACTAATGGGGTAATTTGGGGTCCAGAGCACCAGTTCTCATGAATCTGAGGAATTCTTCCTCCTAGCTACTTCCTTCCTTTTCCCTCATTACATCCCTGCCAAGGACAAATTCTGCCATTTGCATGGCAGGACTCCTCCAAAAAGGGGCTTCCTCCCTTTCCGTTAGTAAAGGAAGAGGTTACCTGAGACTTGACTTAACCTCCTTGGGAGGGAACATGCTTTCACTGTTGCGAATTGTTAAGTCAGGTCCAGAGTGATCCAGTCACTTATCATGAGTCATACAGTAACCAGAGGTTGAGTTGACT
>MHDBL000009 GRCh38:chr1:55558859-55559210 variants=153:191:197
CCAGAAGCCTAGGCCTCTGGGAATAGCATTATGTCCTAGGCGTAAATGGATGAGAGAGCCCAGTGACCTAAGCAGCTCCAACCCTGAGACTGGATCTAATGATGATCCAGATAATCCAGTGCCCAGCTTAGAGCCTGGCACACAACAAGTGCTTATAATGAAAGCATTAGTGAGTAAAAGAGTGATCCCTGGCTTTGAACTCCCTCTAAGTGTACCCCCAGGCATCTGTTCTTCCCTCAGTCACAATGCTGACCCCACTTCATGACTGGTCTCCTCTCCTTTGATTGTGCACACAAGGGCCAGTCTTGTGTCTTATTTTAGTATCTTTAGCACCTAGAATAGTATCTGGCA
>MHDBL000013 GRCh38:chr1:204664129-204664480 variants=82:139:242:268
GAGGTCATTGCTGCCCCTGCCTCAGTTAAAAAATTAGAAATCCTCCCCACCCAGCTCTGTTTGTCTCCCCACAAAGCATTGCAGAAGAAAA

Subsequent steps require that we index the genome for read mapping and for random access, so we'll take care of that now.

In [8]:
!bwa index beta-panel.fasta
!faidx beta-panel.fasta > /dev/null

[bwa_index] Pack FASTA... 0.00 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.00 seconds elapse.
[bwa_index] Update BWT... 0.00 sec
[bwa_index] Pack forward-only FASTA... 0.00 sec
[bwa_index] Construct SA from BWT and Occ... 0.00 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index beta-panel.fasta
[main] Real time: 0.020 sec; CPU: 0.025 sec


Now let's map the reads to the reference loci.

In [9]:
!bwa mem beta-panel.fasta sim.fastq.gz \
    | samtools view -bS \
    | samtools sort -o sim.bam -
!samtools index sim.bam

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 10000 sequences (3010000 bp)...
[M::mem_process_seqs] Processed 10000 reads in 1.437 CPU sec, 1.440 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem beta-panel.fasta sim.fastq.gz
[main] Real time: 1.558 sec; CPU: 1.479 sec


With a sorted and indexed BAM file, we're ready to call genotypes.
This is done with the `mhpl8r type` command.

In [10]:
!mhpl8r type --out sim.genotype.json beta-panel.fasta sim.bam

[MicroHapulator] running version 0.2
[MicroHapulator::type] discarded 1535 reads with gaps or missing data at positions of interest


This command computes various bits of genotype, mapping, and coverage information and outputs it in JSON format.
Let's take a quick look at the top of the output file.

In [11]:
!head -n 20 sim.genotype.json

{
    "MHDBL000007": {
        "allele_counts": {
            "T,C,A,G": 87,
            "T,G,A,G": 1,
            "T,T,A,A": 1,
            "T,T,A,G": 86
        },
        "genotype": [
            "T,C,A,G",
            "T,T,A,G"
        ],
        "max_coverage": 200,
        "mean_coverage": 171.9,
        "min_coverage": 3,
        "num_discarded_reads": 25
    },
    "MHDBL000009": {
        "allele_counts": {
            "T,A,G": 1,


As shown here, the output includes a nested data structure with locus IDs as keys for each record of genotype data.
If we wanted to analyze this data programmatically, Python's standard JSON library provides a convenient way to load and process the data.

In [12]:
import json
with open('sim.genotype.json', 'r') as fh:
    genotype = json.load(fh)
print('Observed genotype for locus "MHDBL000007"', genotype['MHDBL000007']['genotype'], end='\n\n')
print('All loci', *genotype.keys(), end='\n\n')
print('Per locus mean coverage:', [genotype[l]['mean_coverage'] for l in genotype.keys()])

Observed genotype for locus "MHDBL000007" ['T,C,A,G', 'T,T,A,G']

All loci MHDBL000007 MHDBL000009 MHDBL000013 MHDBL000019 MHDBL000020 MHDBL000027 MHDBL000030 MHDBL000036 MHDBL000038 MHDBL000048 MHDBL000051 MHDBL000055 MHDBL000056 MHDBL000063 MHDBL000066 MHDBL000068 MHDBL000072 MHDBL000080 MHDBL000085 MHDBL000092 MHDBL000095 MHDBL000103 MHDBL000105 MHDBL000108 MHDBL000111 MHDBL000113 MHDBL000115 MHDBL000129 MHDBL000131 MHDBL000132 MHDBL000135 MHDBL000142 MHDBL000145 MHDBL000147 MHDBL000148 MHDBL000153 MHDBL000156 MHDBL000163 MHDBL000167 MHDBL000185 MHDBL000187 MHDBL000195 MHDBL000197 MHDBL000199 MHDBL000200 MHDBL000204 MHDBL000205 MHDBL000207 MHDBL000211 MHDBL000212

Per locus mean coverage: [171.9, 171.8, 171.7, 172.2, 172.2, 171.8, 172.3, 172.4, 172.3, 172.3, 172.2, 172.4, 172.3, 172.3, 172.3, 171.8, 172.2, 172.2, 171.9, 171.8, 171.6, 172.3, 172.7, 171.8, 172.3, 171.7, 171.9, 171.8, 171.8, 172.4, 172.0, 171.7, 171.7, 172.3, 172.4, 171.9, 171.9, 171.4, 171.8, 172.3, 171.7, 172.6, 172.

By comparing to the BED file we peeked at earlier, we can visually confirm that MicroHapulator inferred the correct genotype for locus `MHDBL000007`.
If we want to check for agreement across all loci, we can do so with a few lines of Python code.

In [13]:
import microhapulator
with open('sim.gt.bed', 'r') as fh:
    simgt = microhapulator.genotype.SimulatedGenotype(frombed=fh)
obsgt = microhapulator.genotype.ObservedGenotype('sim.genotype.json')
obsgt == simgt

True

Awesome! `True` indicates perfect agreement.

## Simulate a mixture (multiple contributor) sample

MicroHapulator can also simulate and analyze mixtures.
Using the `mhpl8r mixture` command, we can specify each individual in the mixture with a dedicated `--indiv` flag followed by one or two population identifiers.
Earlier we used MicroHapDB IDs, but were we show that IDs from the database of origin (ALFRED in this case) are supported as well.
This mixture will contain 50,000 reads from three individuals of Mexican, Finn, and Punjabi origin.
By default, the mixture contains even contributions from each individual, but here we specify uneven contributions using the `--proportions` flag.

In [14]:
!mhpl8r mixture --indiv SA004049R --indiv SA004110G --indiv SA004240K --proportions 0.7 0.2 0.1 \
    --num-reads 50000 --out mixture.fastq.gz --panel beta

[MicroHapulator] running version 0.2
[MicroHapulator::mixture] Individual population=SA004049R numreads=35000
[MicroHapulator::sim] simulated microhaplotype variation at 50 loci
INFO:iss.app:Starting iss generate
INFO:iss.app:Using kde ErrorModel
INFO:iss.app:Setting random seed to 2106520924
INFO:iss.util:Stitching input files together
INFO:iss.app:Using lognormal abundance distribution
INFO:iss.app:Using 2 cpus for read generation
INFO:iss.app:Generating 70000 reads
INFO:iss.app:Generating reads for record: MHDBL000007:hap1
INFO:iss.app:Generating reads for record: MHDBL000007:hap2
INFO:iss.app:Generating reads for record: MHDBL000009:hap1
INFO:iss.app:Generating reads for record: MHDBL000009:hap2
INFO:iss.app:Generating reads for record: MHDBL000013:hap1
INFO:iss.app:Generating reads for record: MHDBL000013:hap2
INFO:iss.app:Generating reads for record: MHDBL000019:hap1
INFO:iss.app:Generating reads for record: MHDBL000019:hap2
INFO:iss.app:Generating reads for record: MHDBL000020:h

INFO:iss.app:Generating reads for record: MHDBL000038:hap1
INFO:iss.app:Generating reads for record: MHDBL000038:hap2
INFO:iss.app:Generating reads for record: MHDBL000048:hap1
INFO:iss.app:Generating reads for record: MHDBL000048:hap2
INFO:iss.app:Generating reads for record: MHDBL000051:hap1
INFO:iss.app:Generating reads for record: MHDBL000051:hap2
INFO:iss.app:Generating reads for record: MHDBL000055:hap1
INFO:iss.app:Generating reads for record: MHDBL000055:hap2
INFO:iss.app:Generating reads for record: MHDBL000056:hap1
INFO:iss.app:Generating reads for record: MHDBL000056:hap2
INFO:iss.app:Generating reads for record: MHDBL000063:hap1
INFO:iss.app:Generating reads for record: MHDBL000063:hap2
INFO:iss.app:Generating reads for record: MHDBL000066:hap1
INFO:iss.app:Generating reads for record: MHDBL000066:hap2
INFO:iss.app:Generating reads for record: MHDBL000068:hap1
INFO:iss.app:Generating reads for record: MHDBL000068:hap2
INFO:iss.app:Generating reads for record: MHDBL000072:ha

INFO:iss.app:Generating reads for record: MHDBL000103:hap2
INFO:iss.app:Generating reads for record: MHDBL000105:hap1
INFO:iss.app:Generating reads for record: MHDBL000105:hap2
INFO:iss.app:Generating reads for record: MHDBL000108:hap1
INFO:iss.app:Generating reads for record: MHDBL000108:hap2
INFO:iss.app:Generating reads for record: MHDBL000111:hap1
INFO:iss.app:Generating reads for record: MHDBL000111:hap2
INFO:iss.app:Generating reads for record: MHDBL000113:hap1
INFO:iss.app:Generating reads for record: MHDBL000113:hap2
INFO:iss.app:Generating reads for record: MHDBL000115:hap1
INFO:iss.app:Generating reads for record: MHDBL000115:hap2
INFO:iss.app:Generating reads for record: MHDBL000129:hap1
INFO:iss.app:Generating reads for record: MHDBL000129:hap2
INFO:iss.app:Generating reads for record: MHDBL000131:hap1
INFO:iss.app:Generating reads for record: MHDBL000131:hap2
INFO:iss.app:Generating reads for record: MHDBL000132:hap1
INFO:iss.app:Generating reads for record: MHDBL000132:ha

As before, let's align the reads to the loci of interest.

In [15]:
!bwa mem beta-panel.fasta mixture.fastq.gz \
    | samtools view -bS \
    | samtools sort -o mixture.bam -
!samtools index mixture.bam

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 33224 sequences (10000424 bp)...
[M::process] read 16726 sequences (5034526 bp)...
[M::mem_process_seqs] Processed 33224 reads in 5.370 CPU sec, 5.358 real sec
[M::mem_process_seqs] Processed 16726 reads in 2.500 CPU sec, 2.478 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem beta-panel.fasta mixture.fastq.gz
[main] Real time: 8.083 sec; CPU: 7.994 sec


And compute the genotype.

In [16]:
!mhpl8r type --out mixture.genotype.json beta-panel.fasta mixture.bam

[MicroHapulator] running version 0.2
[MicroHapulator::type] discarded 7305 reads with gaps or missing data at positions of interest


Let's take a look at the inferred genotypes for the mixture sample.

In [17]:
!head -n 25 mixture.genotype.json

{
    "MHDBL000007": {
        "allele_counts": {
            "C,C,A,G": 1,
            "C,T,A,A": 1,
            "T,C,A,G": 317,
            "T,C,G,G": 1,
            "T,T,A,A": 365,
            "T,T,A,G": 1,
            "T,T,G,A": 1,
            "T,T,G,G": 126
        },
        "genotype": [
            "T,C,A,G",
            "T,T,A,A",
            "T,T,G,G"
        ],
        "max_coverage": 1000,
        "mean_coverage": 858.9,
        "min_coverage": 10,
        "num_discarded_reads": 187
    },
    "MHDBL000009": {
        "allele_counts": {
            "T,A,A": 810,


Because our first sample was so low coverage, we saw few if any false alleles.
Here we have 5 times higher coverage, and so we being to see some false alleles.
However, it is trivial to distinguish the true alleles from the false based on coverage, as MicroHapulator has done.

Now that genotypes have been called, MicroHapulator can estimate the number of contributors to the sample by looking at the number of alleles observed at each locus.

In [18]:
!mhpl8r contrib --json mixture.genotype.json

[MicroHapulator] running version 0.2
{
    "min_num_contrib": 3,
    "num_loci_max_alleles": 2,
    "perc_loci_max_alleles": 0.04
}

In this case, MicroHapulator's estimate of 3 contributors is correct!

## Python API

The `mhpl8r` command is the most convenient way to invoke MicroHapulator, but all of the `mhpl8r` commands above can also be easily scripted using MicroHapulator's Python API.
The following Python code shows how this would be done.

In [19]:
# Simulate simple sample
simulator = microhapulator.sim.sim(
    ['MHDBP000037'], ['beta'], hapseed=24680, gtfile='sim.gt.bed',
    seqthreads=2, numreads=1000,
)
with microhapulator.open('sim-again.fastq.gz', 'w') as fh:
    for record in simulator:
        n, defline, sequence, qualities = record
        print(defline, sequence, '+\n', qualities, sep='', end='', file=fh)
print('DEBUG simulated', n, 'reads!')

# Align reads separately using shell or Python's subprocess module

# Infer genotype
genotype = microhapulator.type.type('sim.bam', 'beta-panel.fasta')
print(genotype.data['MHDBL000187'])

[MicroHapulator::sim] simulated microhaplotype variation at 50 loci


DEBUG simulated 1000 reads!
{'mean_coverage': 171.7, 'min_coverage': 1, 'max_coverage': 200, 'num_discarded_reads': 20, 'allele_counts': {'A,G,G': 91, 'T,A,T': 88}, 'genotype': ['A,G,G', 'T,A,T']}


[MicroHapulator::type] discarded 1535 reads with gaps or missing data at positions of interest


There is not yet any deliberate documentation of the Python API, but all `mhpl8r` subcommand modules observe the following pattern.

- a "main" function that implements the core operation; the name of this function matches the name of the module; for example, for `mhpl8r mixture` the main function is `microhapulator.mixture.mixture`
- a "driver" function that accepts a single `args` argument, typically obtained from calling `parse_args()` on an `argparse.ArgumentParser` object; the name of this driver function is always (ironically) `main`

By inspecting the MicroHapulator code and observing these patterns, it should be straightforward to determine the Python code needed to replace a shell command that invokes `mhpl8r`.