# MicroHapulator: Simulation Demo

**MicroHapulator** is software for forensic analysis of microhaplotype sequence data.
Features include the following:

- simulating simple (single-contributor) and complex (multi-contributor) DNA samples
- simulated MPS sequencing of user-specified microhap panels
- genotyping of DNA profiles from simple and complex DNA samples
- tools for deterministic and probabilistic interpretation of simple and complex samples

MicroHapulator relies on microhap marker definitions and allele frequencies from [MicroHapDB](https://github.com/bioforensics/MicroHapDB) and MPS error models included with [InSilicoSeq](https://github.com/HadrienG/InSilicoSeq/).

MPS sequence data from microhap loci of real DNA samples is not easily accessible due to a variety of reasons, including privacy, intellectual property, cost, and the novelty of microhaplotypes as a forensic marker.
The development of methods and tools for analysis and interpretation of microhaps can therefore benefit immensely from simulated data.
This notebook provides an interactive demonstration of MicroHapulator's features for simulating genotype profiles and targeted MPS sequencing of the corresponding loci.

## Scenario 1 — Simulating Genotypes

First we will construct a hypothetical forensic scenario involving a reference sample obtained from a person of interest, as well as two evidentiary samples whose origin is unknown in the context of the investigation.

To fully construct this scenario we need to simulate genotype profiles for two individuals, **i1** and **i2**.
To do this, we must specify the population from which each parental haplotype originates and the panel of microhap markers that will be assayed.
We invoke the `mhpl8r sim` command to simulate each genotype.

In [2]:
mhpl8r sim --seed 24680 --out genotype-i1.json Iberian Iberian beta-panel.txt

[MicroHapulator] running version 0.3+28.g2cc0a4b.dirty
[MicroHapulator::sim] simulated microhaplotype variation at 50 markers
[MicroHapulator::sim] profile JSON written to genotype-i1.json


Let's break down what each part of this command means.

- `mhl8r sim`: The command to simulate a genotype profile.
- `--seed 24680`: MicroHapulator uses a random number generator to sample alleles randomly at each marker for each parental haplotype. It's possible to set this random number generator to a predictable state using a specific "seed" value so that the same sequence of "random" numbers can be reproduced later. In other words, specifying a seed is indicating that I don't want *just any* random genotype, I want *that particular* random genotype. The seed value can be chosen arbitrarily since it has no other significance or meaning.
- `--out genotype-i1.json`: The file to which the simulated genotype will be written.
- `Iberian Iberian`: Use the "Iberian" population allele frequencies (ALFRED population "SA004108N") to sample alleles for both the maternal and paternal haplotypes. MicroHapulator can use allele frequency distributions from any population in the MicroHapDB database to simulate genotypes that realisticly reflect true allele frequencies. MicroHapDB includes definitions for 290 microhaplotype markers and allele frequencies for 102 global populations and cohorts, which represents nearly all published microhap data.
- `beta-panel.txt`: A file containing identifiers for the 50 microhap markers we will assay in our hypothetical scenario.

We can peek at the top of this file to explore its contents.
As you can see, it contains marker names and genotypes in JSON (JavaScript Object Notation) format, which is easily consumed by computers and also human readable (if a bit verbose).
For example, this genotype is homozygous for the `T,G,G` allele at marker `mh01CP-016`, and heterzygous for the alleles `A,A,C,T` and `A,G,C,T` at marker `mh01KK-117`.
Here we can only see information for the first two markers, but information for all 50 markers is present in the remainder of the file.

In [3]:
head -n 25 genotype-i1.json

{
    "markers": {
        "mh01CP-016": {
            "genotype": [
                {
                    "allele": "T,G,G",
                    "haplotype": 0
                },
                {
                    "allele": "T,G,G",
                    "haplotype": 1
                }
            ]
        },
        "mh01KK-117": {
            "genotype": [
                {
                    "allele": "A,A,C,T",
                    "haplotype": 0
                },
                {
                    "allele": "A,G,C,T",
                    "haplotype": 1
                }
            ]


Now let's simulate our second individual, storing its genotype in the file `genotype-i2.json`.

In [4]:
mhpl8r sim --seed 13579 --out genotype-i2.json Iberian Iberian beta-panel.txt

[MicroHapulator] running version 0.3+28.g2cc0a4b.dirty
[MicroHapulator::sim] simulated microhaplotype variation at 50 markers
[MicroHapulator::sim] profile JSON written to genotype-i2.json


## Scenario 1 — Simulated MPS Sequencing

Next we will simulate Illumina MiSeq sequencing of three samples from these two individuals: a reference sample **r1** and two evidentiary samples **e1** and **e2**.
Samples **r1** and **e1** originate from individual **i1**, and sample **e2** originates from individual **i2**.
Of course, in a real case the identity of the reference sample donor would be known, but the identity of the evidentiary sample donor(s) would be unknown.

Let's start with sample **r1**.

In [5]:
mhpl8r seq --num-reads 10000 --out reads-r1.fastq.gz --seed 666410524 --threads 2 genotype-i1.json

[MicroHapulator] running version 0.3+28.g2cc0a4b.dirty
[MicroHapulator::seq] Individual seed=666410524 numreads=10000


Let's break this command down.

- `mhpl8r seq`: The command to simulate MPS sequencing.
- `--num-reads 10000`: Simulate 10,000 reads.
- `--out reads-r1.fastq.gz`: Write the results to a file named `reads-r1.fastq.gz`.
- `--seed 666410524`: Seed the random number generator. As with `mhpl8r sim`, running `mhpl8r seq` with the same random seed on the same genotype will produce exactly the same "random" reads. However, running `mhpl8r seq` on the same genotype twice using two different seeds represents sequencing two independent samples of the same individual.
- `--threads 2`: Accelerate the simulation using 2 threads. On larger machines, you may be able to use as many as 32 or 64 threads to accelerate the simulation.
- `genotype-i1.json`: The file containing the genotype to be sequenced.

Now let's create sample **e1** from the same genotype.

In [6]:
mhpl8r seq --threads 2 --num-reads 2000 --seed 3374532379 --out reads-e1.fastq.gz genotype-i1.json

[MicroHapulator] running version 0.3+28.g2cc0a4b.dirty
[MicroHapulator::seq] Individual seed=3374532379 numreads=2000


And finally, sample **e2** from the genotype of individual **i2**.

In [7]:
mhpl8r seq --threads 2 --num-reads 2000 --seed 3963949764 --out reads-e2.fastq.gz genotype-i2.json

[MicroHapulator] running version 0.3+28.g2cc0a4b.dirty
[MicroHapulator::seq] Individual seed=3963949764 numreads=2000


## Scenario 2

More info here.

## Multiple-contributor Sample

In [9]:
# Contributors
mhpl8r sim --seed 1234 --out genotype-m1.json SA004250L SA004250L beta-panel.txt
mhpl8r sim --seed 5678 --out genotype-m2.json SA004250L SA004250L beta-panel.txt
mhpl8r sim --seed 1029 --out genotype-m3.json SA004250L SA004250L beta-panel.txt

# Non-contributors
mhpl8r sim --seed 3847 --out genotype-m4.json SA004250L SA004250L beta-panel.txt
mhpl8r sim --seed 5656 --out genotype-m5.json SA004250L SA004250L beta-panel.txt

[MicroHapulator] running version 0.3+28.g2cc0a4b.dirty
[MicroHapulator::sim] simulated microhaplotype variation at 50 markers
[MicroHapulator::sim] profile JSON written to genotype-m1.json
[MicroHapulator] running version 0.3+28.g2cc0a4b.dirty
[MicroHapulator::sim] simulated microhaplotype variation at 50 markers
[MicroHapulator::sim] profile JSON written to genotype-m2.json
[MicroHapulator] running version 0.3+28.g2cc0a4b.dirty
[MicroHapulator::sim] simulated microhaplotype variation at 50 markers
[MicroHapulator::sim] profile JSON written to genotype-m3.json
[MicroHapulator] running version 0.3+28.g2cc0a4b.dirty
[MicroHapulator::sim] simulated microhaplotype variation at 50 markers
[MicroHapulator::sim] profile JSON written to genotype-m4.json
[MicroHapulator] running version 0.3+28.g2cc0a4b.dirty
[MicroHapulator::sim] simulated microhaplotype variation at 50 markers
[MicroHapulator::sim] profile JSON written to genotype-m5.json


In [10]:
mhpl8r seq --proportions 0.7 0.2 0.1 --num-reads 20000 --threads 2 --out mix-reads.fastq.gz \
    genotype-m1.json genotype-m2.json genotype-m3.json

[MicroHapulator] running version 0.3+28.g2cc0a4b.dirty
[MicroHapulator::seq] Individual seed=3944980044 numreads=14000
[MicroHapulator::seq] Individual seed=1814986775 numreads=4000
[MicroHapulator::seq] Individual seed=2550701427 numreads=2000
