# MicroHapulator: Interactive Demo

**MicroHapulator** is an application for empirical haplotype calling, analysis, and basic forensic interpretation of microhaplotypes from NGS data.
The software is typically run by entering commands in a shell terminal window.
This notebook provides an interactive introduction to the software, containing not only narrative text but also shell commands that the reader can execute and re-execute, as well as the output of those commands.
To execute the code in the notebook, select the corresponding cell and click the `[> Run]` button at the top of the page (or as a keyboard shortcut, simultaneously press `[shift]` and `[enter]`).

## Overview

> *This demo assumes the reader is familiar with basic terminology and concepts related to biology, genomes, and NGS sequencing.
> A [primer on forensic DNA typing](https://microhapulator.readthedocs.io/en/latest/config.html) is available for the interested reader.*

MicroHapulator calls haplotypes by examining NGS reads aligned to a reference sequence for the corresponding marker.
Consider the following mock example.
The first line shows the reference sequence for the `mh01USC-1pD` marker.
Each subsequent line represents an NGS read aligned to the marker sequence.
The `*` symbols denote the locations of the SNPs present in the marker, and the `.` symbols denote locations where the NGS read matches the reference.

```
                       *                     *           *
AAATAGCTGGGCTAATAATGAACTGAAGCAAAGTCAACTGAAATGTCCTGGGCAGCTCCAGAAACTCCAGAATGGGGAGGA
.......................C.....................C.......
  .....................C.....................C...........A...
       .....A..........C.....................T...........C........
        ...............C.....................T...........C.........
          .............C.....................C...........A...........
            ...........C.....................C...........A.............
                 ......C.....................C...........A.................
                 ......C.....................C...........A.................
                    ...C.....................G...........A....................
                     ..C.....................T...........C.....................
```

We can examine the aligned reads to determine the number of times each allelic combination (haplotype) occurs.
We call this tally a *typing result*.
The typing result for this example is as follows: the `C,C,A` haplotype is observed 5 times, the `C,T,C` haplotype is observed 3 times, and the `C,G,A` haplotype is observed 1 time.
If we then filter out the `C,G,A` haplotype as erroneous (see below), we can infer a diploid `C,C,A / C,T,C` genotype for this marker.
We call this process *genotype prediction*.

It is helpful to point about a few observations about this example.
- The first aligned read does not span all SNPs at the marker, so it is discarded.
- The third aligned read shows an `A` at the 13th position of the marker reference sequence. Whether this reflects true genetic variation or is a technical artifact resulting from sequencing error, it is ignored by MicroHapulator because it is not one of the three SNPs of interest.
- The `C,G,A` haplotype is only observed once and is likely a false haplotype resulting from sequencing error at the second SNP of interest in the haplotype. When dozens or hundreds (or thousands!) of reads are successfully sequenced and aligned to the marker reference, it is simple to distinguish signal (true haplotypes) from noise (false haplotypes resulting from sequencing error). When the depth of sequencing coverage in this example is low, as it is in this example, it can be more difficult to distinguish signal from noise. Determining appropriate per-marker thresholds (detection thresholds and analytical thresholds) for filtering will typically require a non-trivial amount of testing with the laboratory's NGS sequencing instrument(s).

While this mock example is helpful in building intuition about haplotype calling, manual visual examination is not feasible for performing this task on dozens of markers and (potentially) millions of NGS reads.
The MicroHapulator software provides tools that automate haplotype calling and genotype prediction, as well as assist with basic interpretation of the forensic typing result.

## Setup

Two mock scenarios are presented in this demo, in which a number of reference and evidentiary samples have been sequenced on an Illumina MiSeq.
In both cases, the sequencing assay targeted a panel of 23 microhaplotype markers.
The identifiers for these markers are shown below.

In [1]:
cat panel.txt

mh01USC-1pD
mh02USC-2pC
mh03USC-3qC
mh04USC-4pA
mh05USC-5pA
mh06USC-6pB
mh07USC-7pB
mh08USC-8pA
mh09USC-9pA
mh0XUSC-XqH
mh10USC-10qC
mh11USC-11pB
mh12USC-12qB
mh13USC-13qA
mh14USC-14qA
mh15USC-15qA
mh16USC-16qB
mh17USC-17pA
mh18USC-18qC
mh19USC-19qB
mh20USC-20qB
mh21USC-21qA
mh22USC-22qB
[?2004h

: 1

Following the instructions in the [MicroHapulator configuration manual](https://microhapulator.readthedocs.io/en/latest/config.html), configuration files were prepared previously with marker reference sequences, microhaplotype SNP definitions, and haplotype frequencies for the population of interest.
These files are listed as follows

In [2]:
ls -1 refr-seqs.fasta marker-defn.tsv frequencies.tsv

frequencies.tsv
marker-defn.tsv
refr-seqs.fasta
[?2004h

: 1

Prior to haplotype calling, the NGS reads must be mapped to the reference sequences.
This mapping procedure requires the construction of a search index for the reference sequences.
The indexing task only needs to be performed once for any given reference sequence file.

In [3]:
bwa index refr-seqs.fasta

[bwa_index] Pack FASTA... 0.00 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.00 seconds elapse.
[bwa_index] Update BWT... 0.00 sec
[bwa_index] Pack forward-only FASTA... 0.00 sec
[bwa_index] Construct SA from BWT and Occ... 0.00 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index refr-seqs.fasta
[main] Real time: 0.017 sec; CPU: 0.015 sec
[?2004h

: 1

And then of course, we need to download the NGS reads for our mock scenarios.

In [4]:
echo FIXME

FIXME04l
[?2004h

: 1

## Scenario 1

In this scenario, we have collected two evidentiary samples in the course of a forensic investigation.
These samples have been labeled **EVD1** and **EVD2**.
We have been assured that these are both single-source DNA samples.
We also have a reference sample labeled **REF1** collected from a person of interest in the investigation.
Each sample was assayed with our 23-plex NGS panel, and the reads were stored in three pairs of files: `EVD1-reads-R*.fastq.gz`, `EVD2-reads-R*.fastq.gz`, and `REF1-reads-R*.fastq.gz`.

In [5]:
ls -1 EVD1-reads-R*.fastq.gz EVD2-reads-R*.fastq.gz REF1-reads-R*.fastq.gz

EVD1-reads-R1.fastq.gz
EVD1-reads-R2.fastq.gz
EVD2-reads-R1.fastq.gz
EVD2-reads-R2.fastq.gz
REF1-reads-R1.fastq.gz
REF1-reads-R2.fastq.gz
[?2004h

: 1

### Preprocessing

The first step in our workflow is to merge overlapping reads pairs into a single fragment per pair.
For this we use the FLASH program.
We'll first merge the reads from sample **EVD1** with the following command.

In [6]:
flash EVD1-reads-R1.fastq.gz EVD1-reads-R2.fastq.gz --min-overlap=100 --max-overlap=325 --output-prefix=EVD1 --allow-outies

[FLASH] Starting FLASH v1.2.11
[FLASH] Fast Length Adjustment of SHort reads
[FLASH]  
[FLASH] Input files:
[FLASH]     EVD1-reads-R1.fastq.gz
[FLASH]     EVD1-reads-R2.fastq.gz
[FLASH]  
[FLASH] Output files:
[FLASH]     ./EVD1.extendedFrags.fastq
[FLASH]     ./EVD1.notCombined_1.fastq
[FLASH]     ./EVD1.notCombined_2.fastq
[FLASH]     ./EVD1.hist
[FLASH]     ./EVD1.histogram
[FLASH]  
[FLASH] Parameters:
[FLASH]     Min overlap:           100
[FLASH]     Max overlap:           325
[FLASH]     Max mismatch density:  0.250000
[FLASH]     Allow "outie" pairs:   true
[FLASH]     Cap mismatch quals:    false
[FLASH]     Combiner threads:      8
[FLASH]     Input format:          FASTQ, phred_offset=33
[FLASH]     Output format:         FASTQ, phred_offset=33
[FLASH]  
[FLASH] Starting reader and writer threads
[FLASH] Starting 8 combiner threads
[FLASH] Processed 2500 read pairs
[FLASH]  
[FLASH] Read combination statistics:
[FLASH]     Total pairs:      2500
[FLASH]     Combined pairs:  

: 1

As noted in the FLASH output, we now have the reads stored with one fragment per read in the file `EVD1.extendedFrags.fastq`.
The next step in our workflow is to map the reads to the target amplicon sequences in `refr-seqs.fasta`.
In this notebook we use the `bwa mem` algorithm, but other algorithms such as `bowtie2` would also be appopriate to use here.
We also use `samtools` to convert the plain text alignments in SAM format to sorted, compressed, and indexed read alignments in BAM format.

In [7]:
bwa mem refr-seqs.fasta EVD1.extendedFrags.fastq | samtools view -b | samtools sort -o EVD1-reads.bam
samtools index EVD1-reads.bam

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 2472 sequences (765992 bp)...
[M::mem_process_seqs] Processed 2472 reads in 0.212 CPU sec, 0.212 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem refr-seqs.fasta EVD1.extendedFrags.fastq
[main] Real time: 0.239 sec; CPU: 0.224 sec
[?2004h[?2004l

: 1

We can now repeat the data preprocessing for **EVD2** and **REF1**.

In [8]:
flash EVD2-reads-R1.fastq.gz EVD2-reads-R2.fastq.gz --min-overlap=100 --max-overlap=325 --output-prefix=EVD2 --allow-outies
bwa mem refr-seqs.fasta EVD2.extendedFrags.fastq | samtools view -b | samtools sort -o EVD2-reads.bam
samtools index EVD2-reads.bam

flash REF1-reads-R1.fastq.gz REF1-reads-R2.fastq.gz --min-overlap=100 --max-overlap=325 --output-prefix=REF1 --allow-outies
bwa mem refr-seqs.fasta REF1.extendedFrags.fastq | samtools view -b | samtools sort -o REF1-reads.bam
samtools index REF1-reads.bam

[FLASH] Starting FLASH v1.2.11
[FLASH] Fast Length Adjustment of SHort reads
[FLASH]  
[FLASH] Input files:
[FLASH]     EVD2-reads-R1.fastq.gz
[FLASH]     EVD2-reads-R2.fastq.gz
[FLASH]  
[FLASH] Output files:
[FLASH]     ./EVD2.extendedFrags.fastq
[FLASH]     ./EVD2.notCombined_1.fastq
[FLASH]     ./EVD2.notCombined_2.fastq
[FLASH]     ./EVD2.hist
[FLASH]     ./EVD2.histogram
[FLASH]  
[FLASH] Parameters:
[FLASH]     Min overlap:           100
[FLASH]     Max overlap:           325
[FLASH]     Max mismatch density:  0.250000
[FLASH]     Allow "outie" pairs:   true
[FLASH]     Cap mismatch quals:    false
[FLASH]     Combiner threads:      8
[FLASH]     Input format:          FASTQ, phred_offset=33
[FLASH]     Output format:         FASTQ, phred_offset=33
[FLASH]  
[FLASH] Starting reader and writer threads
[FLASH] Starting 8 combiner threads
[FLASH] Processed 2500 read pairs
[FLASH]  
[FLASH] Read combination statistics:
[FLASH]     Total pairs:      2500
[FLASH]     Combined pairs:  

: 1

We now have a `.bam` file with aligned reads for each sample.

In [9]:
ls -1 EVD1-reads.bam EVD2-reads.bam REF1-reads.bam

EVD1-reads.bam
EVD2-reads.bam
REF1-reads.bam
[?2004h

: 1

### Haplotype Calling

With the reads aligned to their respective reference sequences, we have everything we need to perform haplotype calling and infer a genotype for these samples.
This is done with the `mhpl8r type` command.
In brief, MicroHapulator iterates over each aligned read, determining both the per-SNP alleles as well as the allele of all SNPs in aggregate, i.e., the marker's haplotype.
We'll call the complete tally of all observed haplotypes the sample's *typing result*.

Due to sequencing errors, some of the haplotypes observed in a typing result will be technical artifacts.
After computing a typing result, the `mhpl8r type` command can also apply naïve static and/or dynamic filters to distinguish true haplotypes from false and determine the genotype of the sample.

In addition to the BAM file containing read alignments, we also need to specify the configuration file containing marker definitions for the 23-plex panel.
MicroHapulator will compute both the typing result and genotype call, storing them in a file named `EVD1-result.json`.

In [10]:
mhpl8r type marker-defn.tsv EVD1-reads.bam --dynamic 0.1 --static 5 -o EVD1-result.json

[MicroHapulator] running version 0.4.1+44.ge3fd75c.dirty
[MicroHapulator::type] discarded 12 reads with gaps or missing data at positions of interest
[?2004h

: 1

~***You are here***~

We can peek at the first few lines of this file to get an idea of its contents.
The data is stored in JavaScript Object Notation (JSON), and includes the typing results, a genotype call, and a handful of coverage statistics.

The ge


lists the read counts for each allelic combination (haplotype) observed at each marker.


The haplotypes *within* each marker have been resolved by the reads spanning the corresponding SNPs, but MicroHapulator does not attempt to resolve the haplotypes *between* markers (as indicated by the `"haplotype": null` elements).
These are used to make a preliminary genotype call.
In this case, the first two markers listed (`mh01USC-1pD` and `mh02USC-2pC`) are both called as heterozygous.
Only the first two markers are shown here, but the rest of the file contains haplotype tallies and genotype calls for the remaining 21 markers.

In [1]:
cat EVD1-result.json | head -n 42

{[?2004l
    "markers": {
        "mh01USC-1pD": {
            "genotype": [
                {
                    "haplotype": "C,C,A",
                    "index": null
                },
                {
                    "haplotype": "C,C,C",
                    "index": null
                }
            ],
            "max_coverage": 109,
            "mean_coverage": 103.8,
            "min_coverage": 4,
            "num_discarded_reads": 0,
            "typing_result": {
                "C,C,A": 55,
                "C,C,C": 54
            }
        },
        "mh02USC-2pC": {
            "genotype": [
                {
                    "haplotype": "A,C,G,T",
                    "index": null
                },
                {
                    "haplotype": "G,C,G,T",
                    "index": null
                }
            ],
            "max_coverage": 106,
            "mean_coverage": 101.2,
            "min_coverage": 8,
            "num_discarded_reads": 0,

: 1

Now we repeat this step for **EVD2** and **REF1**.

In [12]:
mhpl8r type --out profile-EVD2.json beta-panel.fasta reads-EVD2.bam
mhpl8r type --out profile-REF1.json beta-panel.fasta reads-REF1.bam

[MicroHapulator] running version 0.4.1
[MicroHapulator::type] discarded 666 reads with gaps or missing data at positions of interest
[MicroHapulator] running version 0.4.1
[MicroHapulator::type] discarded 7206 reads with gaps or missing data at positions of interest


We have now inferred a genotype profile for each sample.

In [13]:
ls -1 profile-EVD1.json profile-EVD2.json profile-REF1.json

profile-EVD1.json
profile-EVD2.json
profile-REF1.json


How then do we compare evidentiary samples with the reference sample?
MicroHapulator implements two operations for comparing single-source samples.
The first is the `mhpl8r dist` operation, which computes a naïve Hamming distance between two sample profiles.
Here, we define the Hamming distance as the number of markers at which the two profiles differ.
A Hamming distance of 0 represents a perfect match, while a distance of 50 (in the case of this panel) represents a mismatch at every marker.
Let's use the `mhpl8r dist` command to compare the reference sample and the evidentiary sample.

In [14]:
mhpl8r dist profile-REF1.json profile-EVD1.json

[MicroHapulator] running version 0.4.1
{
    "hamming_distance": 0
}

Our first glance suggests that these samples are likely a match.
However, while Hamming distance may be simple to interpret, it doesn't provide any sense of confidence and would be difficult to defend in any formal legal context.
It would be more helpful if we could compute a likelihood ratio (LR) that quantifies the strength of the profile match.
The second operation MicroHapulator implements for comparing single-source samples is the `mhpl8r prob` command, which assesses the relative likelihood of the following propositions.

- $H_p$: the reference sample and evidentiary sample were derived from the same individual
- $H_d$: the reference sample and evidentiary sample were derived from two unrelated individuals in the population

The probability $P(H_p) = \epsilon^R$, where $\epsilon$ is a per-marker rate of genotyping error (default: 0.001) and $R$ is the number of allele mismatches between the reference and evidentiary samples.
The probability $P(H_d)$ is the random match probability (RMP) of the profile.
Note that in cases of a perfect match, $P(H_p) = 1$ and thus the LR is the reciprocal of the RMP.

Now let's use `mhpl8r prob` to compare the reference and evidentiary samples.
We specify that MicroHapulator should use the `Iberian` population allele frequency distribution for computing this LR.

In [15]:
mhpl8r prob Iberian profile-REF1.json profile-EVD1.json

[MicroHapulator] running version 0.4.1
{
    "likelihood_ratio": "2.604E+59"
}

The result is a very large LR of $2.6 \times 10^{59}$, strongly supporting $H_p$ over $H_d$.
This gives us very strong evidence that **EVD1** and **REF1** are from the same individual.

Now, we repeat these comparisons for **EVD2**.

In [16]:
mhpl8r dist profile-REF1.json profile-EVD2.json

[MicroHapulator] running version 0.4.1
{
    "hamming_distance": 41
}

In [17]:
mhpl8r prob Iberian profile-REF1.json profile-EVD2.json

[MicroHapulator] running version 0.4.1
{
    "likelihood_ratio": "2.604E-115"
}

Here we see a very different story.
The Hamming distance shows differences at 41/50 markers, and the LR test statistic is very small, $2.6 \times 10^{-115}$, strongly supporting $H_d$ over $H_p$ for this sample.
The evidence is very strong that **EVD2** and **REF1** do not correspond to the same individual.

## Scenario 2

In this scenario, we have collected an evidentiary sample (**EVD3**) in the course of a forensic investigation, and there is some suspicion that this sample has multiple DNA contributors.
We have also collected reference samples from three persons of interest in the investigation, labeled **REF2**, **REF3**, and **REF4**.
As in the previous scenario, all four samples have been assayed with our 50 microhap MPS panel.
Reads are available in the following files.

In [18]:
ls -1 reads-EVD3.fastq.gz reads-REF2.fastq.gz reads-REF3.fastq.gz reads-REF4.fastq.gz

reads-EVD3.fastq.gz
reads-REF2.fastq.gz
reads-REF3.fastq.gz
reads-REF4.fastq.gz


As before, we will use `bwa mem` and `samtools` to align, sort, and index the reads for each sample.

In [19]:
bwa mem beta-panel.fasta reads-EVD3.fastq.gz | samtools view -bS - | samtools sort -o reads-EVD3.bam -
samtools index reads-EVD3.bam

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 33224 sequences (10000424 bp)...
[M::process] read 16726 sequences (5034526 bp)...
[M::mem_process_seqs] Processed 33224 reads in 5.250 CPU sec, 5.263 real sec
[M::mem_process_seqs] Processed 16726 reads in 2.613 CPU sec, 2.593 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem beta-panel.fasta reads-EVD3.fastq.gz
[main] Real time: 8.409 sec; CPU: 8.020 sec


In [20]:
bwa mem beta-panel.fasta reads-REF2.fastq.gz | samtools view -bS - | samtools sort -o reads-REF2.bam -
samtools index reads-REF2.bam

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 33224 sequences (10000424 bp)...
[M::process] read 16726 sequences (5034526 bp)...
[M::mem_process_seqs] Processed 33224 reads in 4.867 CPU sec, 4.818 real sec
[M::mem_process_seqs] Processed 16726 reads in 2.628 CPU sec, 2.589 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem beta-panel.fasta reads-REF2.fastq.gz
[main] Real time: 8.040 sec; CPU: 7.652 sec


In [21]:
bwa mem beta-panel.fasta reads-REF3.fastq.gz | samtools view -bS - | samtools sort -o reads-REF3.bam -
samtools index reads-REF3.bam

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 33224 sequences (10000424 bp)...
[M::process] read 16726 sequences (5034526 bp)...
[M::mem_process_seqs] Processed 33224 reads in 5.123 CPU sec, 5.075 real sec
[M::mem_process_seqs] Processed 16726 reads in 2.730 CPU sec, 2.691 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem beta-panel.fasta reads-REF3.fastq.gz
[main] Real time: 8.406 sec; CPU: 8.015 sec


In [22]:
bwa mem beta-panel.fasta reads-REF4.fastq.gz | samtools view -bS - | samtools sort -o reads-REF4.bam -
samtools index reads-REF4.bam

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 33224 sequences (10000424 bp)...
[M::process] read 16726 sequences (5034526 bp)...
[M::mem_process_seqs] Processed 33224 reads in 5.039 CPU sec, 5.005 real sec
[M::mem_process_seqs] Processed 16726 reads in 2.701 CPU sec, 2.647 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem beta-panel.fasta reads-REF4.fastq.gz
[main] Real time: 8.291 sec; CPU: 7.900 sec


Next we use `mhpl8r type` to infer genotype profiles for each sample.

In [23]:
mhpl8r type --out profile-EVD3.json beta-panel.fasta reads-EVD3.bam
mhpl8r type --out profile-REF2.json beta-panel.fasta reads-REF2.bam
mhpl8r type --out profile-REF3.json beta-panel.fasta reads-REF3.bam
mhpl8r type --out profile-REF4.json beta-panel.fasta reads-REF4.bam

[MicroHapulator] running version 0.4.1
[MicroHapulator::type] discarded 3674 reads with gaps or missing data at positions of interest
[MicroHapulator] running version 0.4.1
[MicroHapulator::type] discarded 7483 reads with gaps or missing data at positions of interest
[MicroHapulator] running version 0.4.1
[MicroHapulator::type] discarded 7280 reads with gaps or missing data at positions of interest
[MicroHapulator] running version 0.4.1
[MicroHapulator::type] discarded 7456 reads with gaps or missing data at positions of interest


We must now evaluate the evidentiary sample and see if we can confirm the presence of multiple DNA contributors.
The `mhpl8r contrib` command implements a simple check for determining the minimum number of contributors by scanning the sample profile to determine the maximum number of alleles $N_{\text{al}}$ present at any single locus.
From this, it can calculate the minimum number of sample contributors $C_{\text{min}}$ as follows.

$$
C_{\text{min}} = \left\lceil\frac{N_{\text{al}}}{2}\right\rceil
$$

In [24]:
mhpl8r contrib -j profile-EVD3.json

[MicroHapulator] running version 0.4.1
{
    "min_num_contrib": 3,
    "num_loci_max_alleles": 3,
    "perc_loci_max_alleles": 0.06
}

The profile supports the presence of at least three DNA contributors in this evidentiary sample.
We must now determine which of the reference samples, if any, is a contributor.
For this we use the `mhpl8r contain` command, which calculates the "containment" of one sample profile in another.
Complete containment (or near-complete containment, allowing for genotyping error) suggests the *plausibility* that a simple single-contributor profile—the "query"—is a contributor to the a complex mixture profile—the "subject."
(Unfortunately, it cannot give positive confirmation that the query is a contributor.)
On the other hand, lack of complete or near-complete containment is strong evidence that the query is *not* a contributor to the subject.

Let us calculate the containment of sample **REF2** in sample **EVD3**.

In [25]:
mhpl8r contain profile-EVD3.json profile-REF2.json

[MicroHapulator] running version 0.4.1
{
    "containment": 1.0,
    "contained_alleles": 83,
    "total_alleles": 83
}

This result tells us that 100% of the alleles from **REF2** are present in **EVD3**, and suggests **REF2** is a plausible contributor to **EVD3**.
What can we say about **REF3** and **REF4**?

In [26]:
mhpl8r contain profile-EVD3.json profile-REF3.json

[MicroHapulator] running version 0.4.1
{
    "containment": 0.7143,
    "contained_alleles": 60,
    "total_alleles": 84
}

In [27]:
mhpl8r contain profile-EVD3.json profile-REF4.json

[MicroHapulator] running version 0.4.1
{
    "containment": 0.6977,
    "contained_alleles": 60,
    "total_alleles": 86
}

Only about 70% of the alleles in both of these samples are present in **EVD3**, strongly suggesting that they are not contributors to the sample.

As a final note, it important acknowledge several factors that can influence the value of the containment metric.
Minor contributors to a mixture may not be fully captured by the inferred genotype profile without some refinement of analytical thresholds, and thus may have a containment value < 1.0.
The amount of input DNA and the depth of sequencing coverage also influence the ability to recover minor contributors in a sample profile.
On the other hand, numerous alleles from non-contributors will likely be present in a mixture simply by chance, and as the complexity and diversity of the mixture increases so will the containment for non-contributors.
Probabilistic genotyping methods are the preferred approach for robust interpretation of complex mixtures, although these are not yet available in MicroHapulator.