# MicroHapulator: API Demo

**MicroHapulator** is software for forensic analysis of microhaplotype sequence data.
Features include the following:

- simulating simple (single-contributor) and complex (multi-contributor) DNA samples
- simulated MPS sequencing of user-specified microhap panels
- genotyping of DNA profiles from simple and complex DNA samples
- tools for deterministic and probabilistic interpretation of simple and complex samples

MicroHapulator relies on microhap marker definitions and allele frequencies from [MicroHapDB](https://github.com/bioforensics/MicroHapDB) and MPS error models included with [InSilicoSeq](https://github.com/HadrienG/InSilicoSeq/).

## Synopsis

This notebook provides an interactive demonstration of MicroHapulator's Python API.
Readers may also be interested in the CLI demo in [demo-cli.ipynb](demo-cli.ipynb) and the simulation demo in [demo-sim.ipynb](demo-sim.ipynb).

To use MicroHapulator in a Python program or interactive Python interpreter, simply load it with `import microhapulator`.
Additional functions and classes can also be imported for convenience.

In [1]:
import microhapulator
from microhapulator.profile import ObservedProfile
from microhapulator.type import observe_genotypes

Before we begin the demo proper, let's grab some mock data.
See the CLI demo for a description of the data.

In [2]:
!curl -sL https://osf.io/5rmgw/download > reads-EVD1.bam
!curl -sL https://osf.io/x8y9j/download > reads-EVD1.bam.bai
!curl -sL https://osf.io/jtbr8/download > reads-EVD2.bam
!curl -sL https://osf.io/w7c59/download > reads-EVD2.bam.bai
!curl -sL https://osf.io/4657h/download > reads-EVD3.bam
!curl -sL https://osf.io/g4zqf/download > reads-EVD3.bam.bai
    
!curl -sL https://osf.io/zdtcn/download > reads-REF1.bam
!curl -sL https://osf.io/5t4kd/download > reads-REF1.bam.bai
!curl -sL https://osf.io/qakjt/download > reads-REF2.bam
!curl -sL https://osf.io/6bjk3/download > reads-REF2.bam.bai
!curl -sL https://osf.io/na23r/download > reads-REF3.bam
!curl -sL https://osf.io/6vm5u/download > reads-REF3.bam.bai
!curl -sL https://osf.io/sh7ya/download > reads-REF4.bam
!curl -sL https://osf.io/x9kg6/download > reads-REF4.bam.bai

!curl -sL https://osf.io/fjdnq/download > beta-panel.fasta

## Working with Genotype Profiles

On the command line, the `mhpl8r type` command is used to infer genotype profiles from MPS read alignments.
In the Python API, this functionality is implemented in the `ObservedProfile` class or, alternatively, in the `microhapulator.type` module.
The following code shows how a genotype profile is inferred from an indexed BAM file and a reference Fasta file containing sequences of the marker target amplicons.

In [3]:
bamfile = 'reads-REF1.bam'
refrfasta = 'beta-panel.fasta'

profile = ObservedProfile()
for markerid, cov_by_pos, counts, ndiscarded in observe_genotypes(bamfile, refrfasta):
    profile.record_coverage(markerid, cov_by_pos, ndiscarded)
    for allele, count in counts.items():
        profile.record_allele(markerid, allele, count)
profile.infer()

[MicroHapulator::type] discarded 7206 reads with gaps or missing data at positions of interest


Here's a breakdown of the code.

- we create an `ObservedProfile` object
- we iterate over the reads using the `observe_genotypes` function; this function aggregates per-base coverage, allele counts, and discarded read counts for each marker
- the `profile.record_coverage` method stores the aggregate coverage information
- the `profile.record_allele` method stores allele counts
- finally, the `profile.infer` method scans the allele counts at each marker and makes a genotype call

Alternatively, we can perform the same operation with the `microhapulator.type.type` function.

In [4]:
profile = microhapulator.type.type(bamfile, refrfasta)

[MicroHapulator::type] discarded 7206 reads with gaps or missing data at positions of interest


There should be 50 markers in our profile.
Let's confirm this and grab the identifiers of the first 5 markers in the profile.

In [5]:
markers = list(profile.markers())
print(len(markers))
print(markers[:5])

50
['mh17CP-001', 'mh09KK-033', 'mh11KK-037', 'mh04KK-017', 'mh12CP-008']


We can grab the alleles for the genotype called at one of these markers.

In [6]:
print(profile.alleles('mh13KK-223'))

{'C,G,C,T', 'T,G,C,T'}


The underlying raw data is stored in a large nested data structure.
We can access this data through the `profile.data` member variable.

In [7]:
profile.data['markers']['mh13KK-223']

{'mean_coverage': 994.1,
 'min_coverage': 800,
 'max_coverage': 998,
 'num_discarded_reads': 176,
 'allele_counts': {'T,G,C,T': 407,
  'T,G,C,G': 2,
  'T,G,C,A': 2,
  'C,G,C,T': 407,
  'C,G,C,G': 2,
  'C,G,C,A': 2},
 'genotype': [{'allele': 'C,G,C,T', 'haplotype': None},
  {'allele': 'T,G,C,T', 'haplotype': None}]}

Let's use the `json.dumps` function to get a nicer view.

In [8]:
import json
print(json.dumps(profile.data['markers']['mh13KK-223'], indent=4))

{
    "mean_coverage": 994.1,
    "min_coverage": 800,
    "max_coverage": 998,
    "num_discarded_reads": 176,
    "allele_counts": {
        "T,G,C,T": 407,
        "T,G,C,G": 2,
        "T,G,C,A": 2,
        "C,G,C,T": 407,
        "C,G,C,G": 2,
        "C,G,C,A": 2
    },
    "genotype": [
        {
            "allele": "C,G,C,T",
            "haplotype": null
        },
        {
            "allele": "T,G,C,T",
            "haplotype": null
        }
    ]
}


In fact, when saving a profile, it is this raw data that is rendered in JSON and written to an output file.
(The `profile.dump` method is a wrapper around the `json.dump` function.)
We can then construct a new profile object from the JSON file stored on disk.
The following code round-trips the profile from memory to disk and then back into memory.
The `==` operator simply checks whether two `Profile` objects have the same allele calls.

In [9]:
profile.dump('profile.json')
profile_copy = ObservedProfile(fromfile='profile.json')
profile == profile_copy

True

## Single-source Profile Comparisons

We can easily replicate the analyses described in the [CLI demo](demo-cli.ipynb) using the Python API.
First let's infer genotype profiles for each sample in Scenario 1.

In [10]:
evd1 = microhapulator.type.type('reads-EVD1.bam', 'beta-panel.fasta')
evd2 = microhapulator.type.type('reads-EVD2.bam', 'beta-panel.fasta')
ref1 = microhapulator.type.type('reads-REF1.bam', 'beta-panel.fasta')

[MicroHapulator::type] discarded 654 reads with gaps or missing data at positions of interest
[MicroHapulator::type] discarded 666 reads with gaps or missing data at positions of interest
[MicroHapulator::type] discarded 7206 reads with gaps or missing data at positions of interest


We can calculate the Hamming distance between two profiles using the `microhapulator.dist.dist` function.

In [11]:
microhapulator.dist.dist(ref1, evd1)

0

In [12]:
microhapulator.dist.dist(ref1, evd2)

41

We can also use the `rand_match_prob` method of `ObservedProfile` objects to compute the RMP, and the `rmp_lr_test` to compute the LR test statistic as described in the CLI demo.

In [13]:
rmp = ref1.rand_match_prob('SA004108N')
rmp

3.840230248298337e-60

In [14]:
lrt_stat = ref1.rmp_lr_test(evd1, 'SA004108N')
lrt_stat

2.6040105289080383e+59

In [15]:
lrt_stat = ref1.rmp_lr_test(evd2, 'SA004108N')
lrt_stat

2.6040105289080417e-115

## Mixture Analysis

The following code reproduces the analysis of Scenario 2 in the CLI demo.

In [16]:
evd3 = microhapulator.type.type('reads-EVD3.bam', 'beta-panel.fasta')
ref2 = microhapulator.type.type('reads-REF2.bam', 'beta-panel.fasta')
ref3 = microhapulator.type.type('reads-REF3.bam', 'beta-panel.fasta')
ref4 = microhapulator.type.type('reads-REF4.bam', 'beta-panel.fasta')

[MicroHapulator::type] discarded 3674 reads with gaps or missing data at positions of interest
[MicroHapulator::type] discarded 7483 reads with gaps or missing data at positions of interest
[MicroHapulator::type] discarded 7280 reads with gaps or missing data at positions of interest
[MicroHapulator::type] discarded 7456 reads with gaps or missing data at positions of interest


In [17]:
min_contrib, num_markers, perc_markers = microhapulator.contrib.contrib(evd3)
print('Minimum contributors in sample:', min_contrib)
print('    - number of supporting markers with max allele count:', num_markers)
print('    - percentage of markers overall with max allele count:', perc_markers)

Minimum contributors in sample: 3
    - number of supporting markers with max allele count: 3
    - percentage of markers overall with max allele count: 0.06


In [18]:
for ref, label in zip((ref2, ref3, ref4), ('REF2', 'REF3', 'REF4')):
    contained, total = microhapulator.contain.contain(evd3, ref)
    output = '{:s} vs EVD3: {:d}/{:d} ({:.3f})'.format(label, contained, total, contained / total)
    print(output)

REF2 vs EVD3: 83/83 (1.000)
REF3 vs EVD3: 60/84 (0.714)
REF4 vs EVD3: 60/86 (0.698)
