# MicroHapulator: CLI Demo

**MicroHapulator** is software for forensic analysis of microhaplotype sequence data.
Features include the following:

- simulating simple (single-contributor) and complex (multi-contributor) DNA samples
- simulated MPS sequencing of user-specified microhap panels
- genotyping of DNA profiles from simple and complex DNA samples
- tools for deterministic and probabilistic interpretation of simple and complex samples

MicroHapulator relies on microhap marker definitions and allele frequencies from [MicroHapDB](https://github.com/bioforensics/MicroHapDB) and MPS error models included with [InSilicoSeq](https://github.com/HadrienG/InSilicoSeq/).

## Synopsis

This notebook provides an interactive demonstration of the command line interface to MicroHapulator's analysis and interpretation tools.
Bioinformaticians and programmers may also be interested in MicroHapulator's Python API, which is described in [demo-api.ipynb](demo-api.ipynb).
MicroHapulator can also be used to construct mock genotypes and simulate MPS sequencing of simple and complex samples—see [demo-sim.ipynb](demo-sim.ipynb) for more details.

## Preliminaries

In the two scenarios below, we will be comparing reference samples and evidentiary samples that have been measured with an MPS assay.
The assay targets a panel of 50 microhaplotype markers.
To genotype and interpret our samples, we must first map the reads from each sample to the target amplicon sequences.
This requires a bit of prep work on our part.
(These steps only need to be performed once for a particular panel design.)

The identifiers for the 50 markers in our panel have been stored in a file called `beta-panel.txt`.
We can peek at the first few lines of this file to get an idea of its contents.

In [1]:
head -n 5 beta-panel.txt

mh01KK-205
mh01CP-016
mh01KK-117
mh02KK-138
mh02KK-136


Detailed information about each of these markers is stored in MicroHapDB, so we'll used the `microhapdb` command to create a Fasta file of the target amplicon sequences.
This is the file we'll use as a reference later when we're mapping MPS reads.

In [2]:
microhapdb marker --panel beta-panel.txt --format fasta > beta-panel.fasta

We can also peek at the first several lines of this file. Each Fasta record includes the marker's name, amplicon sequence, and several bits of metadata.

In [3]:
head -n 10 beta-panel.fasta

>mh01KK-205 PermID=MHDBM-1f7eaca2 GRCh38:chr1:18396197-18396351 variants=48,69,157,202 Xref=SI664550A
ATGGGGTAATTTGGGGTCCAGAGCACCAGTTCTCATGAATCTGAGGAATTCTTCCTCCTAGCTACTTCCTTCCTTTTCCC
TCATTACATCCCTGCCAAGGACAAATTCTGCCATTTGCATGGCAGGACTCCTCCAAAAAGGGGCTTCCTCCCTTTCCGTT
AGTAAAGGAAGAGGTTACCTGAGACTTGACTTAACCTCCTTGGGAGGGAACATGCTTTCACTGTTGCGAATTGTTAAGTC
AGGTCCAGAGT
>mh01CP-016 PermID=MHDBM-021e569a GRCh38:chr1:55559012-55559056 variants=103,141,147 Xref=SI664876L
TGAGAGAGCCCAGTGACCTAAGCAGCTCCAACCCTGAGACTGGATCTAATGATGATCCAGATAATCCAGTGCCCAGCTTA
GAGCCTGGCACACAACAAGTGCTTATAATGAAAGCATTAGTGAGTAAAAGAGTGATCCCTGGCTTTGAACTCCCTCTAAG
TGTACCCCCAGGCATCTGTTCTTCCCTCAGTCACAATGCTGACCCCACTTCATGACTGGTCTCCTCTCCTTTGATTGTGC
ACACAAGGGCC


Next, we need to create an index of the file to facilitate rapid lookups during the read mapping procedure.
With this, the panel is ready to use for inferring genotype profiles from MPS reads.

In [4]:
bwa index beta-panel.fasta
samtools faidx beta-panel.fasta

[bwa_index] Pack FASTA... 0.00 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.00 seconds elapse.
[bwa_index] Update BWT... 0.00 sec
[bwa_index] Pack forward-only FASTA... 0.00 sec
[bwa_index] Construct SA from BWT and Occ... 0.00 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index beta-panel.fasta
[main] Real time: 0.015 sec; CPU: 0.015 sec


Finally, we need to download the MPS reads for our mock scenarios.

In [5]:
curl -L https://osf.io/x67ze/download > reads-EVD1.fastq.gz
curl -L https://osf.io/ukpqb/download > reads-EVD2.fastq.gz
curl -L https://osf.io/b6h3w/download > reads-EVD3.fastq.gz
curl -L https://osf.io/sj5d9/download > reads-REF1.fastq.gz
curl -L https://osf.io/x48sn/download > reads-REF2.fastq.gz
curl -L https://osf.io/p76v9/download > reads-REF3.fastq.gz
curl -L https://osf.io/b4fn8/download > reads-REF4.fastq.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   459  100   459    0     0    805      0 --:--:-- --:--:-- --:--:--   803
100  149k  100  149k    0     0  91826      0  0:00:01  0:00:01 --:--:--  887k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   459  100   459    0     0   1176      0 --:--:-- --:--:-- --:--:--  1176
100 96050  100 96050    0     0  76533      0  0:00:01  0:00:01 --:--:--  615k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   459  100   459    0     0    843      0 --:--:-- --:--:-- --:--:--   842
100 6730k  100 6730k    0     0  1249k      0  0:00:05  0:00:05 --:--:-- 1849k
  % Total    % Received % Xferd  Average Speed   Tim

## Scenario 1

In this scenario, we have collected two evidentiary samples in the course of a forensic investigation.
These samples have been labeled **EVD1** and **EVD2**.
We have been assured that these are both single-source DNA samples.
We also have a reference sample labeled **REF1** collected from a person of interest in the investigation.
Each sample was assayed with our 50 microhap MPS panel, and the reads were stored in three files: `reads-EVD1.fastq.gz`, `reads-EVD2.fastq.gz`, and `reads-REFR1.fastq.gz`.

In [6]:
ls -1 reads-EVD1.fastq.gz reads-EVD2.fastq.gz reads-REF1.fastq.gz

reads-EVD1.fastq.gz
reads-EVD2.fastq.gz
reads-REF1.fastq.gz


The first step in our analysis is to map the reads in each sample to the target amplicon sequences in `beta-panel.fasta`.
For this we use the `bwa mem` algorithm, but other algorithms such as `bowtie2` would also be appopriate to use here.
We also use `samtools` to convert the uncompress alignments in SAM format to sorted, compressed, and indexed read alignments in BAM format.

First we process sample **EVD1** with the following commands.

In [7]:
bwa mem beta-panel.fasta reads-EVD1.fastq.gz | samtools view -bS - | samtools sort -o reads-EVD1.bam -
samtools index reads-EVD1.bam

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 5000 sequences (1505000 bp)...
[M::mem_process_seqs] Processed 5000 reads in 0.623 CPU sec, 0.626 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem beta-panel.fasta reads-EVD1.fastq.gz
[main] Real time: 0.677 sec; CPU: 0.643 sec


We can repeat this for **EVD2** and **REF1**.

In [8]:
bwa mem beta-panel.fasta reads-EVD2.fastq.gz | samtools view -bS - | samtools sort -o reads-EVD2.bam -
samtools index reads-EVD2.bam

bwa mem beta-panel.fasta reads-REF1.fastq.gz | samtools view -bS - | samtools sort -o reads-REF1.bam -
samtools index reads-REF1.bam

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 5000 sequences (1505000 bp)...
[M::mem_process_seqs] Processed 5000 reads in 0.603 CPU sec, 0.604 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem beta-panel.fasta reads-EVD2.fastq.gz
[main] Real time: 0.655 sec; CPU: 0.622 sec
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 33224 sequences (10000424 bp)...
[M::process] read 16726 sequences (5034526 bp)...
[M::mem_process_seqs] Processed 33224 reads in 4.408 CPU sec, 4.372 real sec
[M::mem_process_seqs] Processed 16726 reads in 2.175 CPU sec, 2.155 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem beta-panel.fasta reads-REF1.fastq.gz
[main] Real time: 6.799 sec; CPU: 6.706 sec


We now have a `.bam` file for each sample.

In [9]:
ls -1 reads-EVD1.bam reads-EVD2.bam reads-REF1.bam

reads-EVD1.bam
reads-EVD2.bam
reads-REF1.bam


With the reads aligned, we have everything we need to infer a genotype profile for these samples.
This is done with the `mhpl8r type` command.
We provide it the amplicon sequences and the aligned reads, and it will stored the inferred profile in a file named `profile-EVD1.json`.

In [10]:
mhpl8r type --out profile-EVD1.json beta-panel.fasta reads-EVD1.bam

[MicroHapulator] running version 0.3+34.gb44b810.dirty
[MicroHapulator::type] discarded 654 reads with gaps or missing data at positions of interest


We should peek at the top of this file to get an idea of its contents.
The genotype profile is stored in JavaScript Object Notation (JSON), and lists the read counts for each allele observed at each microhap locus.
These are used to make preliminary genotype calls.
For example, marker `mh01CP-016` is called as homozygous for the allele `T,G,G` and `mh01KK-117` is heterozygous for alleles `A,A,C,T` and `A,G,C,T`.
The haplotypes *within* each marker have been resolved by the reads spanning its variants, but MicroHapulator does not attempt to resolve the haplotypes *between* markers (as indicated by the `"haplotype": null` elements).
Only the first two markers are shown here, but the rest of the file contains genotype calls for the remaining 48 markers.

In [11]:
head -n 30 profile-EVD1.json

{
    "markers": {
        "mh01CP-016": {
            "allele_counts": {
                "T,G,G": 96
            },
            "genotype": [
                {
                    "allele": "T,G,G",
                    "haplotype": null
                }
            ],
            "max_coverage": 100,
            "mean_coverage": 99.7,
            "min_coverage": 86,
            "num_discarded_reads": 4
        },
        "mh01KK-117": {
            "allele_counts": {
                "A,A,C,T": 36,
                "A,G,C,T": 36
            },
            "genotype": [
                {
                    "allele": "A,A,C,T",
                    "haplotype": null
                },
                {
                    "allele": "A,G,C,T",
                    "haplotype": null


Now we repeat this step for **EVD2** and **REF1**.

In [12]:
mhpl8r type --out profile-EVD2.json beta-panel.fasta reads-EVD2.bam
mhpl8r type --out profile-REF1.json beta-panel.fasta reads-REF1.bam

[MicroHapulator] running version 0.3+34.gb44b810.dirty
[MicroHapulator::type] discarded 666 reads with gaps or missing data at positions of interest
[MicroHapulator] running version 0.3+34.gb44b810.dirty
[MicroHapulator::type] discarded 7206 reads with gaps or missing data at positions of interest


We have now inferred a genotype profile for each sample.

In [13]:
ls -1 profile-EVD1.json profile-EVD2.json profile-REF1.json

profile-EVD1.json
profile-EVD2.json
profile-REF1.json


How then do we compare evidentiary samples with the reference sample?
MicroHapulator implements two operations for comparing single-source samples.
The first is the `mhpl8r dist` operation, which computes a naïve Hamming distance between two sample profiles.
Here, we define the Hamming distance as the number of markers at which the two profiles differ.
A Hamming distance of 0 represents a perfect match, while a distance of 50 (in the case of this panel) represents a mismatch at every marker.
Let's use the `mhpl8r dist` command to compare the reference sample and the evidentiary sample.

In [14]:
mhpl8r dist profile-REF1.json profile-EVD1.json

[MicroHapulator] running version 0.3+34.gb44b810.dirty
{
    "hamming_distance": 0
}

Our first glance suggests that these samples are likely a match.
However, while Hamming distance may be simple to interpret, it doesn't provide any sense of confidence and would be difficult to defend in any formal legal context.
It would be more helpful if we could compute a likelihood ratio (LR) that quantifies the strength of the profile match.
The second operation MicroHapulator implements for comparing single-source samples is the `mhpl8r prob` command, which assesses the relative likelihood of the following propositions.

- $H_p$: the reference sample and evidentiary sample were derived from the same individual
- $H_d$: the reference sample and evidentiary sample were derived from two unrelated individuals in the population

The probability $P(H_p) = \epsilon^R$, where $\epsilon$ is a per-marker rate of genotyping error (default: 0.001) and $R$ is the number of allele mismatches between the reference and evidentiary samples.
The probability $P(H_d)$ is the random match probability (RMP) of the profile.
Note that in cases of a perfect match, $P(H_p) = 1$ and thus the LR is the reciprocal of the RMP.

Now let's use `mhpl8r prob` to compare the reference and evidentiary samples.
We specify that MicroHapulator should use the `Iberian` population allele frequency distribution for computing this LR.

In [15]:
mhpl8r prob Iberian profile-REF1.json profile-EVD1.json

[MicroHapulator] running version 0.3+34.gb44b810.dirty
{
    "likelihood_ratio": "2.604E+59"
}

The result is a very large LR of $2.6 \times 10^{59}$, strongly supporting $H_p$ over $H_d$.
This gives us very strong evidence that **EVD1** and **REF1** are from the same individual.

Now, we repeat these comparisons for **EVD2**.

In [16]:
mhpl8r dist profile-REF1.json profile-EVD2.json

[MicroHapulator] running version 0.3+34.gb44b810.dirty
{
    "hamming_distance": 41
}

In [17]:
mhpl8r prob Iberian profile-REF1.json profile-EVD2.json

[MicroHapulator] running version 0.3+34.gb44b810.dirty
{
    "likelihood_ratio": "2.604E-115"
}

Here we see a very different story.
The Hamming distance shows differences at 41/50 markers, and the LR test statistic is very small, $2.6 \times 10^{-115}$, strongly supporting $H_d$ over $H_p$ for this sample.
The evidence is very strong that **EVD2** and **REF1** do not correspond to the same individual.

## Scenario 2

In this scenario, we have collected an evidentiary sample (**EVD3**) in the course of a forensic investigation, and there is some suspicion that this sample has multiple DNA contributors.
We have also collected reference samples from three persons of interest in the investigation, labeled **REF2**, **REF3**, and **REF4**.
As in the previous scenario, all four samples have been assayed with our 50 microhap MPS panel.
Reads are available in the following files.

In [18]:
ls -1 reads-EVD3.fastq.gz reads-REF2.fastq.gz reads-REF3.fastq.gz reads-REF4.fastq.gz

reads-EVD3.fastq.gz
reads-REF2.fastq.gz
reads-REF3.fastq.gz
reads-REF4.fastq.gz


As before, we will use `bwa mem` and `samtools` to align, sort, and index the reads for each sample.

In [19]:
bwa mem beta-panel.fasta reads-EVD3.fastq.gz | samtools view -bS - | samtools sort -o reads-EVD3.bam -
samtools index reads-EVD3.bam

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 33224 sequences (10000424 bp)...
[M::process] read 16726 sequences (5034526 bp)...
[M::mem_process_seqs] Processed 33224 reads in 4.188 CPU sec, 4.303 real sec
[M::mem_process_seqs] Processed 16726 reads in 2.139 CPU sec, 2.152 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem beta-panel.fasta reads-EVD3.fastq.gz
[main] Real time: 6.711 sec; CPU: 6.450 sec


In [20]:
bwa mem beta-panel.fasta reads-REF2.fastq.gz | samtools view -bS - | samtools sort -o reads-REF2.bam -
samtools index reads-REF2.bam

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 33224 sequences (10000424 bp)...
[M::process] read 16726 sequences (5034526 bp)...
[M::mem_process_seqs] Processed 33224 reads in 4.256 CPU sec, 4.296 real sec
[M::mem_process_seqs] Processed 16726 reads in 2.072 CPU sec, 2.209 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem beta-panel.fasta reads-REF2.fastq.gz
[main] Real time: 6.784 sec; CPU: 6.450 sec


In [21]:
bwa mem beta-panel.fasta reads-REF3.fastq.gz | samtools view -bS - | samtools sort -o reads-REF3.bam -
samtools index reads-REF3.bam

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 33224 sequences (10000424 bp)...
[M::process] read 16726 sequences (5034526 bp)...
[M::mem_process_seqs] Processed 33224 reads in 4.219 CPU sec, 4.215 real sec
[M::mem_process_seqs] Processed 16726 reads in 2.070 CPU sec, 2.047 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem beta-panel.fasta reads-REF3.fastq.gz
[main] Real time: 6.542 sec; CPU: 6.414 sec


In [22]:
bwa mem beta-panel.fasta reads-REF4.fastq.gz | samtools view -bS - | samtools sort -o reads-REF4.bam -
samtools index reads-REF4.bam

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 33224 sequences (10000424 bp)...
[M::process] read 16726 sequences (5034526 bp)...
[M::mem_process_seqs] Processed 33224 reads in 3.990 CPU sec, 3.942 real sec
[M::mem_process_seqs] Processed 16726 reads in 1.987 CPU sec, 1.961 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem beta-panel.fasta reads-REF4.fastq.gz
[main] Real time: 6.172 sec; CPU: 6.100 sec


Next we use `mhpl8r type` to infer genotype profiles for each sample.

In [23]:
mhpl8r type --out profile-EVD3.json beta-panel.fasta reads-EVD3.bam
mhpl8r type --out profile-REF2.json beta-panel.fasta reads-REF2.bam
mhpl8r type --out profile-REF3.json beta-panel.fasta reads-REF3.bam
mhpl8r type --out profile-REF4.json beta-panel.fasta reads-REF4.bam

[MicroHapulator] running version 0.3+34.gb44b810.dirty
[MicroHapulator::type] discarded 3674 reads with gaps or missing data at positions of interest
[MicroHapulator] running version 0.3+34.gb44b810.dirty
[MicroHapulator::type] discarded 7483 reads with gaps or missing data at positions of interest
[MicroHapulator] running version 0.3+34.gb44b810.dirty
[MicroHapulator::type] discarded 7280 reads with gaps or missing data at positions of interest
[MicroHapulator] running version 0.3+34.gb44b810.dirty
[MicroHapulator::type] discarded 7456 reads with gaps or missing data at positions of interest


We must now evaluate the evidentiary sample and see if we can confirm the presence of multiple DNA contributors.
The `mhpl8r contrib` command implements a simple check for determining the minimum number of contributors by scanning the sample profile to determine the maximum number of alleles $N_{\text{al}}$ present at any single locus.
From this, it can calculate the minimum number of sample contributors $C_{\text{min}}$ as follows.

$$
C_{\text{min}} = \left\lceil\frac{N_{\text{al}}}{2}\right\rceil
$$

In [24]:
mhpl8r contrib -j profile-EVD3.json

[MicroHapulator] running version 0.3+34.gb44b810.dirty
{
    "min_num_contrib": 3,
    "num_loci_max_alleles": 3,
    "perc_loci_max_alleles": 0.06
}

The profile supports the presence of at least three DNA contributors in this evidentiary sample.
We must now determine which of the reference samples, if any, is a contributor.
For this we use the `mhpl8r contain` command, which calculates the "containment" of one sample profile in another.
Complete containment (or near-complete containment, allowing for genotyping error) suggests the *plausibility* that a simple single-contributor profile—the "query"—is a contributor to the a complex mixture profile—the "subject."
(Unfortunately, it cannot give positive confirmation that the query is a contributor.)
On the other hand, lack of complete or near-complete containment is strong evidence that the query is *not* a contributor to the subject.

Let us calculate the containment of sample **REF2** in sample **EVD3**.

In [25]:
mhpl8r contain profile-EVD3.json profile-REF2.json

[MicroHapulator] running version 0.3+34.gb44b810.dirty
{
    "containment": 1.0,
    "contained_alleles": 83,
    "total_alleles": 83
}

This result tells us that 100% of the alleles from **REF2** are present in **EVD3**, and suggests **REF2** is a plausible contributor to **EVD3**.
What can we say about **REF3** and **REF4**?

In [26]:
mhpl8r contain profile-EVD3.json profile-REF3.json

[MicroHapulator] running version 0.3+34.gb44b810.dirty
{
    "containment": 0.7143,
    "contained_alleles": 60,
    "total_alleles": 84
}

In [27]:
mhpl8r contain profile-EVD3.json profile-REF4.json

[MicroHapulator] running version 0.3+34.gb44b810.dirty
{
    "containment": 0.6977,
    "contained_alleles": 60,
    "total_alleles": 86
}

Only about 70% of the alleles in both of these samples are present in **EVD3**, strongly suggesting that they are not contributors to the sample.

As a final note, it important acknowledge several factors that can influence the value of the containment metric.
Minor contributors to a mixture may not be fully captured by the inferred genotype profile without some refinement of analytical thresholds, and thus may have a containment value < 1.0.
The amount of input DNA and the depth of sequencing coverage also influence the ability to recover minor contributors in a sample profile.
On the other hand, numerous alleles from non-contributors will likely be present in a mixture simply by chance, and as the complexity and diversity of the mixture increases so will the containment for non-contributors.
Probabilistic genotyping methods are the preferred approach for robust interpretation of complex mixtures, although these are not yet available in MicroHapulator.