# MicroHapulator: Interactive Demo

<small>Daniel Standage, 2022-01-11</small>

**MicroHapulator** is an application for empirical haplotype calling, analysis, and basic forensic interpretation of microhaplotypes with NGS data.
The software is typically run by entering commands in a shell terminal window.
This notebook provides an interactive environment intended to introduce the software, interleaving narrative text, shell commands that the reader can execute and re-execute, the output of those commands, and explanatory commentary.
To execute the code in the notebook, select the corresponding cell and click the `[> Run]` button at the top of the page (or as a keyboard shortcut, simultaneously press `[shift]` and `[enter]`).

## Overview

> *This demo assumes the reader is familiar with basic terminology and concepts related to biology, genomes, and NGS sequencing.
> A [primer on forensic DNA typing](https://microhapulator.readthedocs.io/en/latest/config.html) is available for the interested reader.*

MicroHapulator calls haplotypes by probing NGS reads aligned to a reference sequence for the corresponding marker.
Consider the following mock example.
The first line shows the reference sequence for the `mh01USC-1pD` marker.
Each subsequent line represents an NGS read aligned to the marker sequence.
The `*` symbols denote the locations of the SNPs present in the marker, and the `.` symbols denote locations where the NGS read matches the reference.

```
                       *                     *           *
AAATAGCTGGGCTAATAATGAACTGAAGCAAAGTCAACTGAAATGTCCTGGGCAGCTCCAGAAACTCCAGAATGGGGAGGA
.......................C.....................C.......
  .....................C.....................C...........A...
       .....A..........C.....................T...........C........
        ...............C.....................T...........C.........
          .............C.....................C...........A...........
            ...........C.....................C...........A.............
                 ......C.....................C...........A.................
                 ......C.....................C...........A.................
                    ...C.....................G...........A....................
                     ..C.....................T...........C.....................
```

We can examine the aligned reads to determine the number of times each allelic combination (haplotype) occurs.
We call this tally a *typing result*.
The typing result for this example is as follows: the `C,C,A` haplotype is observed 5 times, the `C,T,C` haplotype is observed 3 times, and the `C,G,A` haplotype is observed 1 time.
If we then filter out the `C,G,A` haplotype as erroneous (see below), we can infer a diploid `C,C,A / C,T,C` genotype for this marker.
We call this process *genotype prediction*.

It is helpful to point about a few observations about this example.
- The first aligned read does not span all SNPs at the marker, so it is discarded.
- The third aligned read shows an `A` at the 13th position of the marker reference sequence. Whether this reflects true genetic variation or is a technical artifact resulting from sequencing error, it is ignored by MicroHapulator because it is not one of the three SNPs of interest.
- The `C,G,A` haplotype is only observed once and is likely a false haplotype resulting from sequencing error at the second SNP of interest in the haplotype. When dozens or hundreds (or thousands!) of reads are successfully sequenced and aligned to the marker reference, it is usually simple to distinguish signal (true haplotypes) from noise (false haplotypes resulting from sequencing error). When the depth of sequencing coverage is low, as it is in this example, it can be more difficult to distinguish signal from noise. Determining appropriate per-marker thresholds (detection thresholds and analytical thresholds) for filtering will typically require a non-trivial amount of testing with the laboratory's NGS sequencing instrument(s).

While this mock example is helpful in building intuition about haplotype calling, manual visual examination is not feasible for performing this task on dozens of markers and (potentially) millions of NGS reads.
The MicroHapulator software provides tools that automate haplotype calling and genotype prediction, as well as assist with basic interpretation of the forensic typing result.

## Setup

Two mock scenarios are presented in this demo, in which a number of reference and evidentiary samples have been sequenced on an Illumina MiSeq.
In both cases, the sequencing assay targeted a panel of 23 microhaplotype markers.
The identifiers for these markers are shown below.

In [1]:
cat panel.txt

mh01USC-1pD
mh02USC-2pC
mh03USC-3qC
mh04USC-4pA
mh05USC-5pA
mh06USC-6pB
mh07USC-7pB
mh08USC-8pA
mh09USC-9pA
mh0XUSC-XqH
mh10USC-10qC
mh11USC-11pB
mh12USC-12qB
mh13USC-13qA
mh14USC-14qA
mh15USC-15qA
mh16USC-16qB
mh17USC-17pA
mh18USC-18qC
mh19USC-19qB
mh20USC-20qB
mh21USC-21qA
mh22USC-22qB
[?2004h

: 1

Following the instructions in the [MicroHapulator configuration manual](https://microhapulator.readthedocs.io/en/latest/config.html), configuration files were prepared previously with marker reference sequences, microhaplotype SNP definitions, and haplotype frequencies for the population of interest.
These files are listed as follows.
(The full contents of these files are available HERE (LINK FIXME).)

In [2]:
ls -1 refr-seqs.fasta marker-defn.tsv frequencies.tsv

frequencies.tsv
marker-defn.tsv
refr-seqs.fasta
[?2004h

: 1

Prior to haplotype calling, the NGS reads must be mapped to the reference sequences.
This mapping procedure requires the construction of a search index for the reference sequences.
The indexing task only needs to be performed once for any given reference sequence file.

In [3]:
bwa index refr-seqs.fasta

[bwa_index] Pack FASTA... 0.00 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.00 seconds elapse.
[bwa_index] Update BWT... 0.00 sec
[bwa_index] Pack forward-only FASTA... 0.00 sec
[bwa_index] Construct SA from BWT and Occ... 0.00 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index refr-seqs.fasta
[main] Real time: 0.009 sec; CPU: 0.010 sec
[?2004h

: 1

And then of course, we need to download the NGS reads for our mock scenarios.

In [4]:
echo FIXME

FIXME04l
[?2004h

: 1

## Scenario 1

In this scenario, we have collected two evidentiary samples in the course of a forensic investigation.
These samples have been labeled **EVD1** and **EVD2**.
The case worker suspects that these are both single-source DNA samples.
We also have a reference sample labeled **REF1** collected from a person of interest in the investigation.
Each sample was assayed with our 23-plex NGS panel, and the reads were stored in three pairs of files: `EVD1-reads-R*.fastq.gz`, `EVD2-reads-R*.fastq.gz`, and `REF1-reads-R*.fastq.gz`.

In [5]:
ls -1 EVD1-reads-R*.fastq.gz EVD2-reads-R*.fastq.gz REF1-reads-R*.fastq.gz

EVD1-reads-R1.fastq.gz
EVD1-reads-R2.fastq.gz
EVD2-reads-R1.fastq.gz
EVD2-reads-R2.fastq.gz
REF1-reads-R1.fastq.gz
REF1-reads-R2.fastq.gz
[?2004h

: 1

### Preprocessing

The first step in our workflow is to merge overlapping reads pairs into a single fragment per pair.
For this we use the FLASH program.
We'll first merge the reads from sample **EVD1** with the following command.

In [6]:
flash EVD1-reads-R1.fastq.gz EVD1-reads-R2.fastq.gz --min-overlap=100 --max-overlap=325 --output-prefix=EVD1 --allow-outies

[FLASH] Starting FLASH v1.2.11
[FLASH] Fast Length Adjustment of SHort reads
[FLASH]  
[FLASH] Input files:
[FLASH]     EVD1-reads-R1.fastq.gz
[FLASH]     EVD1-reads-R2.fastq.gz
[FLASH]  
[FLASH] Output files:
[FLASH]     ./EVD1.extendedFrags.fastq
[FLASH]     ./EVD1.notCombined_1.fastq
[FLASH]     ./EVD1.notCombined_2.fastq
[FLASH]     ./EVD1.hist
[FLASH]     ./EVD1.histogram
[FLASH]  
[FLASH] Parameters:
[FLASH]     Min overlap:           100
[FLASH]     Max overlap:           325
[FLASH]     Max mismatch density:  0.250000
[FLASH]     Allow "outie" pairs:   true
[FLASH]     Cap mismatch quals:    false
[FLASH]     Combiner threads:      8
[FLASH]     Input format:          FASTQ, phred_offset=33
[FLASH]     Output format:         FASTQ, phred_offset=33
[FLASH]  
[FLASH] Starting reader and writer threads
[FLASH] Starting 8 combiner threads
[FLASH] Processed 2500 read pairs
[FLASH]  
[FLASH] Read combination statistics:
[FLASH]     Total pairs:      2500
[FLASH]     Combined pairs:  

: 1

As noted in the FLASH output, we now have the reads stored with one fragment per read in the file `EVD1.extendedFrags.fastq`.
The next step in our workflow is to map the reads to the target amplicon sequences in `refr-seqs.fasta`.
In this notebook we use the `bwa mem` algorithm, but other algorithms such as `bowtie2` would also be appopriate to use here.
We also use `samtools` to convert the plain text alignments in SAM format to sorted, compressed, and indexed read alignments in BAM format.

In [7]:
bwa mem refr-seqs.fasta EVD1.extendedFrags.fastq | samtools view -b | samtools sort -o EVD1-reads.bam
samtools index EVD1-reads.bam

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 2472 sequences (765992 bp)...
[M::mem_process_seqs] Processed 2472 reads in 0.245 CPU sec, 0.245 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem refr-seqs.fasta EVD1.extendedFrags.fastq
[main] Real time: 0.269 sec; CPU: 0.259 sec
[?2004h[?2004l

: 1

We can now repeat the data preprocessing for **EVD2** and **REF1**.

In [8]:
flash EVD2-reads-R1.fastq.gz EVD2-reads-R2.fastq.gz --min-overlap=100 --max-overlap=325 --output-prefix=EVD2 --allow-outies
bwa mem refr-seqs.fasta EVD2.extendedFrags.fastq | samtools view -b | samtools sort -o EVD2-reads.bam
samtools index EVD2-reads.bam

flash REF1-reads-R1.fastq.gz REF1-reads-R2.fastq.gz --min-overlap=100 --max-overlap=325 --output-prefix=REF1 --allow-outies
bwa mem refr-seqs.fasta REF1.extendedFrags.fastq | samtools view -b | samtools sort -o REF1-reads.bam
samtools index REF1-reads.bam

[FLASH] Starting FLASH v1.2.11
[FLASH] Fast Length Adjustment of SHort reads
[FLASH]  
[FLASH] Input files:
[FLASH]     EVD2-reads-R1.fastq.gz
[FLASH]     EVD2-reads-R2.fastq.gz
[FLASH]  
[FLASH] Output files:
[FLASH]     ./EVD2.extendedFrags.fastq
[FLASH]     ./EVD2.notCombined_1.fastq
[FLASH]     ./EVD2.notCombined_2.fastq
[FLASH]     ./EVD2.hist
[FLASH]     ./EVD2.histogram
[FLASH]  
[FLASH] Parameters:
[FLASH]     Min overlap:           100
[FLASH]     Max overlap:           325
[FLASH]     Max mismatch density:  0.250000
[FLASH]     Allow "outie" pairs:   true
[FLASH]     Cap mismatch quals:    false
[FLASH]     Combiner threads:      8
[FLASH]     Input format:          FASTQ, phred_offset=33
[FLASH]     Output format:         FASTQ, phred_offset=33
[FLASH]  
[FLASH] Starting reader and writer threads
[FLASH] Starting 8 combiner threads
[FLASH] Processed 2500 read pairs
[FLASH]  
[FLASH] Read combination statistics:
[FLASH]     Total pairs:      2500
[FLASH]     Combined pairs:  

: 1

We now have a `.bam` file with aligned reads for each sample.

In [9]:
ls -1 EVD1-reads.bam EVD2-reads.bam REF1-reads.bam

EVD1-reads.bam
EVD2-reads.bam
REF1-reads.bam
[?2004h

: 1

### Haplotype Calling and Genotype Prediction

With the reads aligned to their respective reference sequences, we have everything we need to perform haplotype calling and infer a genotype for these samples.
This is done with the `mhpl8r type` command.
In brief, MicroHapulator iterates over each aligned read, determining both the per-SNP alleles as well as the allele of all SNPs in aggregate, i.e., the marker's haplotype.
We'll call the complete tally of all observed haplotypes the sample's *typing result*.

Due to sequencing errors, some of the haplotypes observed in a typing result will be technical artifacts.
After computing a typing result, the `mhpl8r type` command can also apply naïve static and/or dynamic filters to distinguish true haplotypes from false and determine the genotype of the sample.

In addition to the BAM file containing read alignments, we also need to specify the configuration file containing marker definitions for the 23-plex panel.
MicroHapulator will compute both the typing result and genotype call, storing them in a file named `EVD1-result.json`.

In [10]:
mhpl8r type marker-defn.tsv EVD1-reads.bam --dynamic 0.1 --static 5 --out EVD1-result.json

[MicroHapulator] running version 0.4.1+47.g2b51220.dirty
[MicroHapulator::type] discarded 12 reads with gaps or missing data at positions of interest
[?2004h

: 1

We can peek at the first few lines of this file to get an idea of its contents.
The data is stored in JavaScript Object Notation (JSON), and includes for each marker a handful of coverage statistics, the typing result (haplotype tallies), and a genotype call.
In this case, the first two markers listed (`mh01USC-1pD` and `mh02USC-2pC`) are both called as heterozygous.
Only the first two markers are shown here, but the rest of the file contains haplotype tallies and genotype calls for the remaining 21 markers.

In [11]:
cat EVD1-result.json | head -n 38

{[?2004l
    "markers": {
        "mh01USC-1pD": {
            "genotype": [
                {
                    "haplotype": "C,C,A"
                },
                {
                    "haplotype": "C,C,C"
                }
            ],
            "max_coverage": 109,
            "mean_coverage": 103.8,
            "min_coverage": 4,
            "num_discarded_reads": 0,
            "typing_result": {
                "C,C,A": 55,
                "C,C,C": 54
            }
        },
        "mh02USC-2pC": {
            "genotype": [
                {
                    "haplotype": "A,C,G,T"
                },
                {
                    "haplotype": "G,C,G,T"
                }
            ],
            "max_coverage": 106,
            "mean_coverage": 101.2,
            "min_coverage": 8,
            "num_discarded_reads": 0,
            "typing_result": {
                "A,C,G,T": 53,
                "G,C,G,T": 53
            }
        },
[?2004h

: 1

As a sanity check, it's always critical to examine the *interlocus balance* of a sample. Are there any markers with a disproportionately high or low number of aligned reads? We can assess this with the `mhpl8r balance` command.

In [12]:
mhpl8r balance EVD1-result.json

[MicroHapulator] running version 0.4.1+47.g2b51220.dirty

[0mmh01USC-1pD : [0m▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 109.00
mh03USC-3qC : [0m▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 109.00
mh05USC-5pA : [0m▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 109.00
mh07USC-7pB : [0m▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 109.00
mh09USC-9pA : [0m▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 109.00
mh0XUSC-XqH : [0m▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 109.00
mh12USC-12qB: [0m▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 109.00
mh13USC-13qA: [0m▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 109.00
mh14USC-14qA: [0m▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 108.00
mh20USC-20qB: [0m▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 107.00
mh19USC-19qB: [0m▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 107.00
mh18USC-18qC: [0m▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 107.00
mh16USC-16qB: [0m▇▇▇▇▇▇▇▇▇▇▇▇

: 1

Because this sample data is simulated, the interlocus balance is artificially high in this case. With real data coming off an NGS sequencer, you expect to see some variation in the number of reads aligned to each marker. Some of this variation could be due to amplification dynamics (e.g. if using multiplex PCR to amplify the target loci), and some is simply due to stochastic factors in sequencing. Interlocus balance shouldn't typically be a problem unless it is extreme, but it will have implications for accurate genotype prediction at the markers with the lowest coverage, especially for low input samples.

Now we repeat these steps to complete preprocessing for **EVD2** and **REF1**.

In [13]:
mhpl8r type marker-defn.tsv EVD2-reads.bam --dynamic 0.1 --static 5 --out EVD2-result.json
mhpl8r balance EVD2-result.json
mhpl8r type marker-defn.tsv REF1-reads.bam --dynamic 0.1 --static 5 --out REF1-result.json
mhpl8r balance REF1-result.json

[MicroHapulator] running version 0.4.1+47.g2b51220.dirty
[MicroHapulator::type] discarded 12 reads with gaps or missing data at positions of interest
[MicroHapulator] running version 0.4.1+47.g2b51220.dirty

[0mmh01USC-1pD : [0m▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 109.00
mh0XUSC-XqH : [0m▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 109.00
mh20USC-20qB: [0m▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 109.00
mh19USC-19qB: [0m▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 109.00
mh18USC-18qC: [0m▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 109.00
mh16USC-16qB: [0m▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 109.00
mh15USC-15qA: [0m▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 109.00
mh13USC-13qA: [0m▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 109.00
mh12USC-12qB: [0m▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 109.00
mh11USC-11pB: [0m▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 109.00
mh09USC-9pA : [0m▇▇▇▇▇▇▇▇▇▇

: 1

With a typing result for each samplke and no concerns about interlocus balance, we can now move on to forensic interpretation.

### Interpretation

Forensic interpretation refers to the process of determining the conclusions, if any, that can be responsibly drawn by comparing two or more DNA profiles. How confident can one be that two single-source profiles came from the same individual? How confident can one be that one single-source profile is a contributor to a second mixture profile? A variety of approaches exist for addressing these types of questions, some very simple and some very complex. MicroHapulator implements a handful of tools for simple forensic interpretation tasks. It also supports the export of typing results to a format that can be used by state-of-the-art probabilistic genotyping (probgen) programs. Probgen is beyond the scope of this tutorial, but we will demonstrate the basic interpretation capabilities provided by MicroHapulator.

One question often encountered during a study or investigation is the number of DNA contributors in an evidentiary sample (or sometimes even a sample collected from a person of interest). A single-source profile will have at most two distinct alleles at any given marker, one from each parental haplotype. However, it's likewise possible that profile with two contributors will *also* have no more than two distinct alleles, even if the contributing genotypes are different (e.g. contributor A may be homozygous for one allele and contributor B homozygous for another allele). But chances are that two or more DNA contributors will result in three or more alleles for at least a *small* number of markers. We can use this to estimate the minimum number of contributors for a profile.

The `mhpl8r contrib` command implements a procedure to scan a typing result to determine the maximum number of alleles $N_{\text{al}}$ present at any single locus. From this, it can calculate the minimum number of DNA contributors $C_{\text{min}}$ as follows.

$$
C_{\text{min}} = \left\lceil\frac{N_{\text{al}}}{2}\right\rceil
$$

We begin by applying this to the three profiles in our mock scenario.

In [14]:
mhpl8r contrib EVD1-result.json
mhpl8r contrib EVD2-result.json
mhpl8r contrib REF1-result.json

[MicroHapulator] running version 0.4.1+47.g2b51220.dirty
{
    "min_num_contrib": 1,
    "num_loci_max_alleles": 23,
    "perc_loci_max_alleles": 1.0
[MicroHapulator] running version 0.4.1+47.g2b51220.dirty
{
    "min_num_contrib": 1,
    "num_loci_max_alleles": 23,
    "perc_loci_max_alleles": 1.0
[MicroHapulator] running version 0.4.1+47.g2b51220.dirty
{
    "min_num_contrib": 1,
    "num_loci_max_alleles": 23,
    "perc_loci_max_alleles": 1.0
}[?2004h

: 1

In all three profiles, none of the 23 markers has evidence for more than a single contributor. It would thus be reasonable to pursue additional interpretation under the assumption that these are all single-source samples.

Depending on the details of the investigation, it might be necessary to investigate whether the two evidentiary came from the same individual, or whether one of the the evidentiary samples matches a reference sample collected from a person of interest. The most basic approach to addressing this question is to examine a pair of profiles and determine, marker by marker, whether there are any differences between the genotypes. Given two MicroHapulator typing results, the `mhpl8r diff` command will print any alleles that are present in one profile but not the other.

In [15]:
mhpl8r diff EVD1-result.json REF1-result.json

[MicroHapulator] running version 0.4.1+47.g2b51220.dirty
[?2004h

: 1

In [16]:
mhpl8r diff EVD1-result.json EVD2-result.json

[MicroHapulator] running version 0.4.1+47.g2b51220.dirty
mh01USC-1pD
>>> C,C,C
mh02USC-2pC
>>> A,C,G,T
<<< G,C,T,T
mh03USC-3qC
>>> A,C,C,A,G
<<< G,T,C,G,G
mh04USC-4pA
<<< G,T,A,A
mh06USC-6pB
>>> T,A,A
<<< C,G,A
mh07USC-7pB
>>> A,G,G,C
<<< T,A,A,C
mh08USC-8pA
>>> A,G,T,A
<<< A,G,T,G
mh09USC-9pA
>>> G,T,C,A,C
>>> G,T,C,G,C
<<< A,T,C,A,A
<<< A,T,T,G,C
mh0XUSC-XqH
>>> T,C,T
<<< G,T,T
<<< T,C,C
mh10USC-10qC
<<< G,C,A
mh11USC-11pB
>>> C,A,T,G
<<< C,G,C,G
mh12USC-12qB
>>> A,A,A,T
<<< G,A,A,C
mh13USC-13qA
>>> A,T,A,A
<<< A,C,A,G
mh14USC-14qA
>>> G,C,G
<<< T,C,G
mh15USC-15qA
>>> A,C,A,A
>>> A,T,A,G
<<< A,T,A,A
<<< G,T,A,A
mh16USC-16qB
>>> G,A,T
<<< A,A,T
mh17USC-17pA
>>> T,C,A
<<< T,C,C
mh18USC-18qC
>>> G,A,A,G
mh19USC-19qB
>>> A,G,A
>>> G,G,G
<<< G,C,A
<<< G,G,A
mh20USC-20qB
<<< T,C,G,C
mh21USC-21qA
>>> G,T,G,G
mh22USC-22qB
>>> G,G,C,G,T
[?2004h

: 1

The first result shows that there are no differences between the **EVD1** profile and the **REF1** profile, so there is already strong evidence that these were derived from the same individual.

The second result shows numerous differences between **EVD1** and **EVD2**. It's not uncommon to see minor discrepancies between different samples originating from the same individual—variability in laboratory processing or sample degradation could explain some of these differences. But we would not typically expect to see discordant alleles at almost every marker. The much more likely explanation in this case is that **EVD1** and **EVD2** come from different individuals.

But beyond a basic check for their presence or absence, an exhaustive listing of discordant alleles isn't very helpful (except perhaps for troubleshooting purposes).
What we *really* want is a quantitative measure of our confidence in a match between two profiles.
We derive this measure by assessing the likelihood of two competing "propositions" or explanations for the data.
Depending on the details of the investigation, we might formulate the likelihood ratio (LR) test as follows.

- $H_p$: **REF1** and **EVD1** originated from the same individual
- $H_d$: **REF1** and **EVD1** originated from two unrelated individuals in the population

We then compute the likelihood ratio $LR =\frac{H_p}{H_d}$.
Large LR values are strong evidence in favor of $H_p$, small LR values support $H_d$, and LR values close to 1.0 are inconclusive.

The probability $P(H_p) = \epsilon^R$, where $\epsilon$ is a per-marker rate of genotyping error (default: 0.001) and $R$ is the number of markers with discordant alleles between samples.
The probability $P(H_d)$ is the random match probability (RMP) of the profile, which is essentially the product of the observed haplotype frequencies in the population.
Note that in cases of a perfect match, $P(H_p) = 1$ and the LR is then simply the reciprocal of the RMP.

We can use `mhpl8r prob` both to compute the RMP and to perform the LR test as formulated above.

In [17]:
mhpl8r prob frequencies.tsv EVD1-result.json
mhpl8r prob frequencies.tsv EVD1-result.json REF1-result.json

[MicroHapulator] running version 0.4.1+47.g2b51220.dirty
{
    "random_match_probability": "1.394E-23"
[MicroHapulator] running version 0.4.1+47.g2b51220.dirty
{
    "likelihood_ratio": "7.176E+22"
}[?2004h

: 1

The result is a very large LR of $7.18 \times 10^{22}$, lending strong support to $H_p$ over $H_d$, consistent with our earlier observations.

Now let's consider what would happen if, disregarding our earlier observations, we had formulated the LR test as follows.

- $H_p$: **REF2** and **EVD1** originated from the same individual
- $H_d$: **REF2** and **EVD1** originated from two unrelated individuals in the population

In [18]:
mhpl8r prob frequencies.tsv EVD2-result.json
mhpl8r prob frequencies.tsv EVD2-result.json REF1-result.json

[MicroHapulator] running version 0.4.1+47.g2b51220.dirty
{
    "random_match_probability": "1.066E-23"
[MicroHapulator] running version 0.4.1+47.g2b51220.dirty
{
    "likelihood_ratio": "9.384E-59"
}[?2004h

: 1

Here we see a very different result.
The RMP for **EVD2** is of similar magnitude to that of **EVD1**, but the LR test statistic is very small.
A tremendous amount of error would be required if **EVD2** and **REF1** were from the same individual—much more likely is that these samples originated from two unrelated individuals.

## Scenario 2

In this scenario, we have collected an evidentiary sample (**EVD3**) in the course of a forensic investigation, and there is some suspicion that this sample has multiple DNA contributors.
We have also collected reference samples from three persons of interest in the investigation, labeled **REF2**, **REF3**, and **REF4**.
As in the previous scenario, all four samples have been assayed with our 50 microhap MPS panel.
Reads are available in the following files.

In [18]:
ls -1 reads-EVD3.fastq.gz reads-REF2.fastq.gz reads-REF3.fastq.gz reads-REF4.fastq.gz

reads-EVD3.fastq.gz
reads-REF2.fastq.gz
reads-REF3.fastq.gz
reads-REF4.fastq.gz


As before, we will use `bwa mem` and `samtools` to align, sort, and index the reads for each sample.

In [19]:
bwa mem beta-panel.fasta reads-EVD3.fastq.gz | samtools view -bS - | samtools sort -o reads-EVD3.bam -
samtools index reads-EVD3.bam

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 33224 sequences (10000424 bp)...
[M::process] read 16726 sequences (5034526 bp)...
[M::mem_process_seqs] Processed 33224 reads in 5.250 CPU sec, 5.263 real sec
[M::mem_process_seqs] Processed 16726 reads in 2.613 CPU sec, 2.593 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem beta-panel.fasta reads-EVD3.fastq.gz
[main] Real time: 8.409 sec; CPU: 8.020 sec


In [20]:
bwa mem beta-panel.fasta reads-REF2.fastq.gz | samtools view -bS - | samtools sort -o reads-REF2.bam -
samtools index reads-REF2.bam

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 33224 sequences (10000424 bp)...
[M::process] read 16726 sequences (5034526 bp)...
[M::mem_process_seqs] Processed 33224 reads in 4.867 CPU sec, 4.818 real sec
[M::mem_process_seqs] Processed 16726 reads in 2.628 CPU sec, 2.589 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem beta-panel.fasta reads-REF2.fastq.gz
[main] Real time: 8.040 sec; CPU: 7.652 sec


In [21]:
bwa mem beta-panel.fasta reads-REF3.fastq.gz | samtools view -bS - | samtools sort -o reads-REF3.bam -
samtools index reads-REF3.bam

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 33224 sequences (10000424 bp)...
[M::process] read 16726 sequences (5034526 bp)...
[M::mem_process_seqs] Processed 33224 reads in 5.123 CPU sec, 5.075 real sec
[M::mem_process_seqs] Processed 16726 reads in 2.730 CPU sec, 2.691 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem beta-panel.fasta reads-REF3.fastq.gz
[main] Real time: 8.406 sec; CPU: 8.015 sec


In [22]:
bwa mem beta-panel.fasta reads-REF4.fastq.gz | samtools view -bS - | samtools sort -o reads-REF4.bam -
samtools index reads-REF4.bam

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 33224 sequences (10000424 bp)...
[M::process] read 16726 sequences (5034526 bp)...
[M::mem_process_seqs] Processed 33224 reads in 5.039 CPU sec, 5.005 real sec
[M::mem_process_seqs] Processed 16726 reads in 2.701 CPU sec, 2.647 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem beta-panel.fasta reads-REF4.fastq.gz
[main] Real time: 8.291 sec; CPU: 7.900 sec


Next we use `mhpl8r type` to infer genotype profiles for each sample.

In [23]:
mhpl8r type --out profile-EVD3.json beta-panel.fasta reads-EVD3.bam
mhpl8r type --out profile-REF2.json beta-panel.fasta reads-REF2.bam
mhpl8r type --out profile-REF3.json beta-panel.fasta reads-REF3.bam
mhpl8r type --out profile-REF4.json beta-panel.fasta reads-REF4.bam

[MicroHapulator] running version 0.4.1
[MicroHapulator::type] discarded 3674 reads with gaps or missing data at positions of interest
[MicroHapulator] running version 0.4.1
[MicroHapulator::type] discarded 7483 reads with gaps or missing data at positions of interest
[MicroHapulator] running version 0.4.1
[MicroHapulator::type] discarded 7280 reads with gaps or missing data at positions of interest
[MicroHapulator] running version 0.4.1
[MicroHapulator::type] discarded 7456 reads with gaps or missing data at positions of interest


We must now evaluate the evidentiary sample and see if we can confirm the presence of multiple DNA contributors.
The `mhpl8r contrib` command implements a simple check for determining the minimum number of contributors by scanning the sample profile to determine the maximum number of alleles $N_{\text{al}}$ present at any single locus.
From this, it can calculate the minimum number of sample contributors $C_{\text{min}}$ as follows.

$$
C_{\text{min}} = \left\lceil\frac{N_{\text{al}}}{2}\right\rceil
$$

In [24]:
mhpl8r contrib -j profile-EVD3.json

[MicroHapulator] running version 0.4.1
{
    "min_num_contrib": 3,
    "num_loci_max_alleles": 3,
    "perc_loci_max_alleles": 0.06
}

The profile supports the presence of at least three DNA contributors in this evidentiary sample.
We must now determine which of the reference samples, if any, is a contributor.
For this we use the `mhpl8r contain` command, which calculates the "containment" of one sample profile in another.
Complete containment (or near-complete containment, allowing for genotyping error) suggests the *plausibility* that a simple single-contributor profile—the "query"—is a contributor to the a complex mixture profile—the "subject."
(Unfortunately, it cannot give positive confirmation that the query is a contributor.)
On the other hand, lack of complete or near-complete containment is strong evidence that the query is *not* a contributor to the subject.

Let us calculate the containment of sample **REF2** in sample **EVD3**.

In [25]:
mhpl8r contain profile-EVD3.json profile-REF2.json

[MicroHapulator] running version 0.4.1
{
    "containment": 1.0,
    "contained_alleles": 83,
    "total_alleles": 83
}

This result tells us that 100% of the alleles from **REF2** are present in **EVD3**, and suggests **REF2** is a plausible contributor to **EVD3**.
What can we say about **REF3** and **REF4**?

In [26]:
mhpl8r contain profile-EVD3.json profile-REF3.json

[MicroHapulator] running version 0.4.1
{
    "containment": 0.7143,
    "contained_alleles": 60,
    "total_alleles": 84
}

In [27]:
mhpl8r contain profile-EVD3.json profile-REF4.json

[MicroHapulator] running version 0.4.1
{
    "containment": 0.6977,
    "contained_alleles": 60,
    "total_alleles": 86
}

Only about 70% of the alleles in both of these samples are present in **EVD3**, strongly suggesting that they are not contributors to the sample.

As a final note, it important acknowledge several factors that can influence the value of the containment metric.
Minor contributors to a mixture may not be fully captured by the inferred genotype profile without some refinement of analytical thresholds, and thus may have a containment value < 1.0.
The amount of input DNA and the depth of sequencing coverage also influence the ability to recover minor contributors in a sample profile.
On the other hand, numerous alleles from non-contributors will likely be present in a mixture simply by chance, and as the complexity and diversity of the mixture increases so will the containment for non-contributors.
Probabilistic genotyping methods are the preferred approach for robust interpretation of complex mixtures, although these are not yet available in MicroHapulator.