# MicroHapulator: CLI Demo

**MicroHapulator** is software for forensic analysis of microhaplotype sequence data.
Features include the following:

- simulating simple (single-contributor) and complex (multi-contributor) DNA samples
- simulated MPS sequencing of user-specified microhap panels
- genotyping of DNA profiles from simple and complex DNA samples
- tools for deterministic and probabilistic interpretation of simple and complex samples

MicroHapulator relies on microhap marker definitions and allele frequencies from [MicroHapDB](https://github.com/bioforensics/MicroHapDB) and MPS error models included with [InSilicoSeq](https://github.com/HadrienG/InSilicoSeq/).

This notebook provides an interactive demonstration of MicroHapulator's command line interface.
Note that any line ending with `\` indicates that the command is continued on the next line.

## Scenario

Consider a hypothetical forensic investigation, during the course of which two evidentiary DNA samples have been collected, and we must determine whether one or both of these match a reference sample from a person of interest.
We will call the evidentiary samples "e1" and "e2", and the reference sample "r1".
For now we can assume e1 and e2 are both single-source DNA samples.

Since we do not have real data for this hypothetical scenario, we will use MicroHapulator to simulate the genotype of each relevant individual as well as targeted MPS sequencing of each sample.
We will then demonstrate how to genotype, analyze, and interpret the DNA profiles obtained from each sample.

## Panel

In this demo we will assay a panel of 50 microhaplotype markers.
The identifiers for these markers are stored in a file called `beta-panel.txt`.
We can peek at the first few lines of this file to get an idea of its contents.

In [1]:
head -n 5 beta-panel.txt

mh01KK-205
mh01CP-016
mh01KK-117
mh02KK-138
mh02KK-136


Details about each of these markers is stored in MicroHapDB, so we'll use the `microhapdb` command to create a Fasta file containing the target amplicon sequence of each marker.
Later, we'll map MPS reads from each sample to these sequences to infer genotype profiles.

In [2]:
microhapdb marker --panel beta-panel.txt --format fasta > beta-panel.fasta

We can also peek at the first several lines of this file.
Each Fasta record includes the marker's name, amplicon sequence, and several bits of metadata.

In [3]:
head -n 10 beta-panel.fasta

>mh01KK-205 PermID=MHDBM-1f7eaca2 GRCh38:chr1:18396197-18396351 variants=48,69,157,202 Xref=SI664550A
ATGGGGTAATTTGGGGTCCAGAGCACCAGTTCTCATGAATCTGAGGAATTCTTCCTCCTAGCTACTTCCTTCCTTTTCCC
TCATTACATCCCTGCCAAGGACAAATTCTGCCATTTGCATGGCAGGACTCCTCCAAAAAGGGGCTTCCTCCCTTTCCGTT
AGTAAAGGAAGAGGTTACCTGAGACTTGACTTAACCTCCTTGGGAGGGAACATGCTTTCACTGTTGCGAATTGTTAAGTC
AGGTCCAGAGT
>mh01CP-016 PermID=MHDBM-021e569a GRCh38:chr1:55559012-55559056 variants=103,141,147 Xref=SI664876L
TGAGAGAGCCCAGTGACCTAAGCAGCTCCAACCCTGAGACTGGATCTAATGATGATCCAGATAATCCAGTGCCCAGCTTA
GAGCCTGGCACACAACAAGTGCTTATAATGAAAGCATTAGTGAGTAAAAGAGTGATCCCTGGCTTTGAACTCCCTCTAAG
TGTACCCCCAGGCATCTGTTCTTCCCTCAGTCACAATGCTGACCCCACTTCATGACTGGTCTCCTCTCCTTTGATTGTGC
ACACAAGGGCC


## Simulating genotypes

Next we will simulate genotype profiles for two individuals, "i1" and "i2."
MicroHapulator can use allele frequency distributions for any population or cohort in MicroHapDB to simulate genotypes that realisticly reflect true allele frequencies.
We do this by invoking the `mhpl8r sim` command, which has two required inputs: the population from which each parental haplotype will be sampled, and the panel of marker loci that will be sampled.
We can also use an arbitrary "seed" value to set MicroHapulator's random number generator to a predictable state, so that the "random" profile it generates can be reproduced.

Here we use the `Iberian` population allele frequencies (ALFRED population "SA004108N"), and store the genotypes in files named `genotype-i1.json` and `genotype-i2.json`.

In [1]:
mhpl8r sim --seed 24680 --out genotype-i1.json Iberian Iberian beta-panel.txt
mhpl8r sim --seed 13579 --out genotype-i2.json Iberian Iberian beta-panel.txt

[MicroHapulator] running version 0.3+28.g2cc0a4b.dirty
[MicroHapulator::sim] simulated microhaplotype variation at 50 markers
[MicroHapulator::sim] profile JSON written to genotype-i1.json
[MicroHapulator] running version 0.3+28.g2cc0a4b.dirty
[MicroHapulator::sim] simulated microhaplotype variation at 50 markers
[MicroHapulator::sim] profile JSON written to genotype-i2.json


Let's peek at the top of one of these files to explore its contents.
As you can see, it contains marker names and genotypes in JSON (JavaScript Object Notation) format, which is easily consumed by computers and also human readable (if a bit verbose).
For example, this profile is homozygous for the `T,G,G` allele at marker `mh01CP-016`, and heterzygous for the alleles `A,A,C,T` and `A,G,C,T` at marker `mh01KK-117`.
Here we can only see information for the first two markers, but information for all 50 markers is present in the remainder of the file.

In [2]:
head -n 25 genotype-i1.json

{
    "markers": {
        "mh01CP-016": {
            "genotype": [
                {
                    "allele": "T,G,G",
                    "haplotype": 0
                },
                {
                    "allele": "T,G,G",
                    "haplotype": 1
                }
            ]
        },
        "mh01KK-117": {
            "genotype": [
                {
                    "allele": "A,A,C,T",
                    "haplotype": 0
                },
                {
                    "allele": "A,G,C,T",
                    "haplotype": 1
                }
            ]


## Simulated MPS Sequencing

In a real forensic case, case workers would extract DNA from each sample, amplify the targeted microhap loci, and then sequence them on their MPS platform of choice.
For this hypothetical scenario, we will simulate Illumina MiSeq sequencing of the evidentiary and reference samples.
This is done with the `mhpl8r seq` command, which has only one required input for single-source samples: a genotype profile.
Optionally, we can also specify the number of reads to simulate (MicroHapulator does 500,000 by default), the number of threads to use to speed up the simulation, and the name of the file to which the reads will be written (by default, they are printed directly to the terminal).

As with `mhpl8r sim` we could also specify an abritrary seed if we want to reproduce the same exact sequences for the same exact genotype profile.
Using *two different seeds* to simulate reads from the same genotype profile twice represents sequencing two independent samples of the same individual.

First, we will create our reference sample "r1" by simulating reads from genotype "i1."

In [4]:
mhpl8r seq --num-reads 10000 --out reads-r1.fastq.gz --seed 666410524 --threads 2 genotype-i1.json

[MicroHapulator] running version 0.3+28.g2cc0a4b.dirty
[MicroHapulator::seq] Individual seed=666410524 numreads=10000


We will also create evidentiary sample "e1" from the same genotype, using a different seed to represent sequencing of an independent sample.

## Single-source Sample

Let's begin by simulating a simple single-source sample, such as one we would expect to obtain from a reference sample.
We use the `mhpl8r sim` command to simulate the DNA profile of the sample.
This command selects marker alleles randomly for each haplotype using the specified allele frequency distribution.
In this example, we use the `Iberian` population allele frequencies (ALFRED population "SA004108N") for both parental haplotypes.
The `--seed` argument allows us to set MicroHapulator's random number generator to a particular state, so that we can reproduce the same exact "random" genotype profile later.
Finally, we store the simulated profile in a file called `sim-profile-1.json`.

In [4]:
mhpl8r sim --seed 24680 --out sim-profile-1.json Iberian Iberian beta-panel.txt

[MicroHapulator] running version 0.3+27.g6033b4e
[MicroHapulator::sim] simulated microhaplotype variation at 50 markers
[MicroHapulator::sim] profile JSON written to sim-profile-1.json


Let's peek at the top of this file to get an idea of its contents.
It contains marker names and genotypes in JSON (JavaScript Object Notation) format, which is easily consumed by computers and also human readable (if a bit verbose).
For example, this profile is homozygous for the `T,G,G` allele at marker `mh01CP-016`, and heterzygous for the alleles `A,A,C,T` and `A,G,C,T` at marker `mh01KK-117`.

In [5]:
head -n 25 sim-profile-1.json

{
    "markers": {
        "mh01CP-016": {
            "genotype": [
                {
                    "allele": "T,G,G",
                    "haplotype": 0
                },
                {
                    "allele": "T,G,G",
                    "haplotype": 1
                }
            ]
        },
        "mh01KK-117": {
            "genotype": [
                {
                    "allele": "A,A,C,T",
                    "haplotype": 0
                },
                {
                    "allele": "A,G,C,T",
                    "haplotype": 1
                }
            ]


We can simulate Illumina MiSeq reads from this sample with the `mhpl8r seq` command.
We instruct the computer to use 2 threads here, but on larger computers it's possible to speed up the simulation by using 32 or 64 threads.
Simulating 10,000 reads gives us about 200x coverage of each marker, and again we seed the random number generator for reproducibility.
The reads will be written in Fastq format to a file named `sim-reads-1a.fastq.gz`

In [6]:
mhpl8r seq --threads 2 --num-reads 10000 --seed 666410524 --out sim-reads-1a.fastq.gz sim-profile-1.json

[MicroHapulator] running version 0.3+27.g6033b4e
[MicroHapulator::seq] Individual seed=666410524 numreads=10000


Let's confirm that MicroHapulator simulated 10,000 reads as requested.
We'll do this by counting the number of lines in the reads file.

In [7]:
gunzip -c sim-reads-1a.fastq.gz | wc -l 

40000


Looks good: 40,000 lines at 4 lines per record = 10,000 reads!

Now we need to map these reads back to the target amplicon sequences.
First we need to index the amplicon sequences.
These commands only need to be invoked once.

In [8]:
bwa index beta-panel.fasta
samtools faidx beta-panel.fasta

[bwa_index] Pack FASTA... 0.00 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.00 seconds elapse.
[bwa_index] Update BWT... 0.00 sec
[bwa_index] Pack forward-only FASTA... 0.00 sec
[bwa_index] Construct SA from BWT and Occ... 0.00 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index beta-panel.fasta
[main] Real time: 0.015 sec; CPU: 0.015 sec


Now we use the `bwa mem` algorithm to align the simulated reads to the amplicon sequences.
The `samtools view`, `samtools sort`, and `samtools index` command will format the `bwa mem` output and create a well-formed BAM file for subsequent analysis.

In [9]:
bwa mem beta-panel.fasta sim-reads-1a.fastq.gz | samtools view -bS - | samtools sort -o sim-reads-1a.bam -
samtools index sim-reads-1a.bam

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 10000 sequences (3010000 bp)...
[M::mem_process_seqs] Processed 10000 reads in 1.320 CPU sec, 1.325 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem beta-panel.fasta sim-reads-1a.fastq.gz
[main] Real time: 1.459 sec; CPU: 1.367 sec


With the reads aligned, we are ready to infer a genotype profile for this sample.
We do this with the `mhpl8r type` command.
We provide it the amplicon sequences and the aligned reads, and it will write the inferred profile to a file named `obs-profile-1a.json`.

In [10]:
mhpl8r type --out obs-profile-1a.json beta-panel.fasta sim-reads-1a.bam

[MicroHapulator] running version 0.3+27.g6033b4e
[MicroHapulator::type] discarded 1624 reads with gaps or missing data at positions of interest


If we peek at the top of this file, we'll see many similarities to the mock profile in `sim-profile-1.json` file that we used to simulate the reads—specifically, marker names and genotype information in JSON format.
In contrast, this new file includes additional count/coverage data, and inter-marker haplotype phasing is absent.

In [11]:
head -n 25 obs-profile-1a.json

{
    "markers": {
        "mh01CP-016": {
            "allele_counts": {
                "T,G,G": 186
            },
            "genotype": [
                {
                    "allele": "T,G,G",
                    "haplotype": null
                }
            ],
            "max_coverage": 200,
            "mean_coverage": 199.0,
            "min_coverage": 158,
            "num_discarded_reads": 12
        },
        "mh01KK-117": {
            "allele_counts": {
                "A,A,C,G": 1,
                "A,A,C,T": 77,
                "A,G,C,G": 1,
                "A,G,C,T": 77
            },
            "genotype": [


We can simulate sequencing of another DNA sample from this same individual by re-running the `mhpl8r seq` command with a different random seed.
Let's imagine this sample represents an evidentiary sample that, fortunately for us, is a clean and simple single-contributor sample.

In [12]:
mhpl8r seq --threads 2 --num-reads 2000 --seed 3374532379 --out sim-reads-1b.fastq.gz sim-profile-1.json

[MicroHapulator] running version 0.3+27.g6033b4e
[MicroHapulator::seq] Individual seed=3374532379 numreads=2000


We use the same procedure as before to map the reads and infer the genotype profile.
For this second sample the profile is stored in the file `obs-profile-1b.json`.

In [13]:
bwa mem beta-panel.fasta sim-reads-1b.fastq.gz | samtools view -bS - | samtools sort -o sim-reads-1b.bam -
samtools index sim-reads-1b.bam
mhpl8r type --out obs-profile-1b.json beta-panel.fasta sim-reads-1b.bam

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 2000 sequences (602000 bp)...
[M::mem_process_seqs] Processed 2000 reads in 0.256 CPU sec, 0.263 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem beta-panel.fasta sim-reads-1b.fastq.gz
[main] Real time: 0.286 sec; CPU: 0.267 sec
[MicroHapulator] running version 0.3+27.g6033b4e
[MicroHapulator::type] discarded 268 reads with gaps or missing data at positions of interest


How then do we analyze and compare the reference sample and the evidentiary sample?
MicroHapulator implements two operations for comparing single-source samples.
The first is the `mhpl8r dist` operation, which computes the naïve Hamming distance between two sample profiles.
The Hamming distance simply represents the number of markers at which the two profiles differ.
A Hamming distance of 0 represents a perfect match, which a distance of (in the case of this panel) 50 represents a mismatch at every marker.
Let's use the `mhpl8r dist` command to compare the reference sample and the evidentiary sample.

In [14]:
mhpl8r dist obs-profile-1a.json obs-profile-1b.json

[MicroHapulator] running version 0.3+27.g6033b4e
{
    "hamming_distance": 0
}

MicroHapulator confirms that these profiles are a perfect match.

The Hamming distance is simple to compute but difficult to interpret and defend.
It would be more helpful if we could report a likelihood ratio (LR) suggesting the strength of the profile match.
MicroHapulator's second operation for comparing single-source samples is the `mhpl8r prob` command, which compares assesses the relative likelihood of the following propositions.

- $H_p$: the reference sample and evidentiary sample were derived from the same individual
- $H_d$: the reference sample and evidentiary sample were derived from two unrelated individuals in the population

The probability $P(H_p) = \epsilon^R$, where $\epsilon$ is the per-marker rate of genotyping error and $R$ is the number of allele mismatches between the reference and evidentiary samples.
The probability $P(H_d)$ is the random match probability (RMP) of the profile.
Note that in cases of a perfect match, $P(H_p) = 1$ and thus the LR is the reciprocal of the RMP.

Now let's use `mhpl8r prob` to compare the reference and evidentiary samples.

In [15]:
mhpl8r prob Iberian obs-profile-1a.json obs-profile-1b.json

[MicroHapulator] running version 0.3+27.g6033b4e
{
    "likelihood_ratio": "2.604E+59"
}

Here we see a very large LR of $2.6 \times 10^{59}$, strongly supporting $H_p$ over $H_d$.

Let us now create another sample, but instead of simulating reads from the same individual we'll simulate a different individual, again using the `Iberian` allele frequencies for both parental haplotypes.
We use `mhpl8r sim` to create a new mock profile (using a different seed to represent a different individual), `mhpl8r seq` to simulate Illumina sequencing, `bwa mem` and `samtools` to align the reads, and `mhpl8r type` to infer the genotype profile.

In [16]:
mhpl8r sim --seed 13579 --out sim-profile-2.json Iberian Iberian beta-panel.txt
mhpl8r seq --threads 2 --num-reads 2000 --seed 3963949764 --out sim-reads-2.fastq.gz sim-profile-2.json
bwa mem beta-panel.fasta sim-reads-2.fastq.gz | samtools view -bS - | samtools sort -o sim-reads-2.bam -
samtools index sim-reads-2.bam
mhpl8r type --out obs-profile-2.json beta-panel.fasta sim-reads-2.bam

[MicroHapulator] running version 0.3+27.g6033b4e
[MicroHapulator::sim] simulated microhaplotype variation at 50 markers
[MicroHapulator::sim] profile JSON written to sim-profile-2.json
[MicroHapulator] running version 0.3+27.g6033b4e
[MicroHapulator::seq] Individual seed=3963949764 numreads=2000
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 2000 sequences (602000 bp)...
[M::mem_process_seqs] Processed 2000 reads in 0.260 CPU sec, 0.269 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem beta-panel.fasta sim-reads-2.fastq.gz
[main] Real time: 0.288 sec; CPU: 0.272 sec
[MicroHapulator] running version 0.3+27.g6033b4e
[MicroHapulator::type] discarded 314 reads with gaps or missing data at positions of interest


If we compare the profile from this new individual from the evidentiary sample from the first individual, we see a very different story.
The naïve Hamming distance shows differences at 41/50 markers, and the LR test statistic is very small, $2.6 \times 10^{-115}$, strongly supporting $H_d$ over $H_p$ for this analysis.

In [17]:
mhpl8r dist obs-profile-1b.json obs-profile-2.json
mhpl8r prob Iberian obs-profile-1b.json obs-profile-2.json

[MicroHapulator] running version 0.3+27.g6033b4e
{
    "hamming_distance": 41
}[MicroHapulator] running version 0.3+27.g6033b4e
{
    "likelihood_ratio": "2.604E-115"
}

## Multiple-contributor Sample

In [18]:
mhpl8r sim --seed 1234 --out mix-profile-1.json SA004250L SA004250L beta-panel.txt
mhpl8r sim --seed 5678 --out mix-profile-2.json SA004250L SA004250L beta-panel.txt
mhpl8r sim --seed 1029 --out mix-profile-3.json SA004250L SA004250L beta-panel.txt

mhpl8r seq --proportions 0.7 0.2 0.1 --num-reads 20000 --threads 2 --out mix-reads.fastq.gz \
    mix-profile-1.json mix-profile-2.json mix-profile-3.json

[MicroHapulator] running version 0.3+27.g6033b4e
[MicroHapulator::sim] simulated microhaplotype variation at 50 markers
[MicroHapulator::sim] profile JSON written to mix-profile-1.json
[MicroHapulator] running version 0.3+27.g6033b4e
[MicroHapulator::sim] simulated microhaplotype variation at 50 markers
[MicroHapulator::sim] profile JSON written to mix-profile-2.json
[MicroHapulator] running version 0.3+27.g6033b4e
[MicroHapulator::sim] simulated microhaplotype variation at 50 markers
[MicroHapulator::sim] profile JSON written to mix-profile-3.json
[MicroHapulator] running version 0.3+27.g6033b4e
[MicroHapulator::seq] Individual seed=4210471426 numreads=14000
[MicroHapulator::seq] Individual seed=1473365171 numreads=4000
[MicroHapulator::seq] Individual seed=751608051 numreads=2000


In [19]:
bwa mem beta-panel.fasta mix-reads.fastq.gz | samtools view -bS - | samtools sort -o mix-reads.bam -
samtools index mix-reads.bam
mhpl8r type --out obs-profile-mix.json beta-panel.fasta mix-reads.bam

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 20000 sequences (6020000 bp)...
[M::mem_process_seqs] Processed 20000 reads in 2.510 CPU sec, 2.535 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem beta-panel.fasta mix-reads.fastq.gz
[main] Real time: 2.778 sec; CPU: 2.590 sec
[MicroHapulator] running version 0.3+27.g6033b4e.dirty
[MicroHapulator::type] discarded 1792 reads with gaps or missing data at positions of interest


In [20]:
mhpl8r contrib --json obs-profile-mix.json

[MicroHapulator] running version 0.3+27.g6033b4e.dirty
{
    "min_num_contrib": 3,
    "num_loci_max_alleles": 1,
    "perc_loci_max_alleles": 0.02
}

In [21]:
mhpl8r contain obs-profile-mix.json mix-profile-1.json
mhpl8r contain obs-profile-mix.json mix-profile-2.json
mhpl8r contain obs-profile-mix.json mix-profile-3.json

[MicroHapulator] running version 0.3+27.g6033b4e.dirty
{
    "containment": 1.0,
    "contained_alleles": 78,
    "total_alleles": 78
}[MicroHapulator] running version 0.3+27.g6033b4e.dirty
{
    "containment": 0.9639,
    "contained_alleles": 80,
    "total_alleles": 83
}[MicroHapulator] running version 0.3+27.g6033b4e.dirty
{
    "containment": 0.9211,
    "contained_alleles": 70,
    "total_alleles": 76
}

In [22]:
mhpl8r sim --seed 3847 --out mix-profile-4.json SA004250L SA004250L beta-panel.txt
mhpl8r sim --seed 5656 --out mix-profile-5.json SA004250L SA004250L beta-panel.txt
mhpl8r contain obs-profile-mix.json mix-profile-4.json
mhpl8r contain obs-profile-mix.json mix-profile-5.json

[MicroHapulator] running version 0.3+27.g6033b4e.dirty
[MicroHapulator::sim] simulated microhaplotype variation at 50 markers
[MicroHapulator::sim] profile JSON written to mix-profile-4.json
[MicroHapulator] running version 0.3+27.g6033b4e.dirty
[MicroHapulator::sim] simulated microhaplotype variation at 50 markers
[MicroHapulator::sim] profile JSON written to mix-profile-5.json
[MicroHapulator] running version 0.3+27.g6033b4e.dirty
{
    "containment": 0.7024,
    "contained_alleles": 59,
    "total_alleles": 84
}[MicroHapulator] running version 0.3+27.g6033b4e.dirty
{
    "containment": 0.6977,
    "contained_alleles": 60,
    "total_alleles": 86
}