# MicroHapDB Tutorial

Daniel Standage  
2023-08-29  
MicroHapDB version 0.11  
Updated 2025-04-24

## Overview

This tutorial provides a brief overview of the MicroHapDB database, its contents, and its features. MicroHapDB is a comprehensive collection of microhaplotype marker and frequency data. It is distributed as a software package that can be installed with pip or conda: see https://microhapdb.readthedocs.io/en/latest/install.html for more information. Once installed, MicroHapDB can be accessed on the terminal via the command `microhapdb`.

<small>(The contents of MicroHapDB are also available as Pandas DataFrame objects via a Python API. This will not be covered in this tutorial.)</small>

The contents of this tutorial are presented as an interactive notebook, which can be executed in the cloud via mybinder or locally using Jupyter. Each gray box contains a terminal command, and the output of each command will be displayed below the corresponding box as the notebook is executed.

## Getting started

To get started, let's run the `microhapdb` command with no other text. This elicits a usage statement describing a handful of MicroHapDB operations or "subcommands."

In [None]:
microhapdb

The `microhapdb` command has 5 subcommands: `frequency`, `lookup`, `marker`, `population`, and `summarize`. The `summarize` subcommand, as its name suggests, provides a high-level summary of the database contents—this is a good place to start.

In [None]:
microhapdb summarize

Before proceeding, we need to download the GRCh38 human reference genome.

In [None]:
microhapdb --download

## Markers

The `microhapdb marker` command is used to retrieve information about microhaps from the database. Running the command with no other configuration will dump the entire contents of the markers table onto the screen. Given that there are more than 3000 microhap allele definitions in the database, this usually isn't very useful.

In [None]:
microhapdb marker

If we add a microhap identifier, MicroHapDB will display info only for that microhap.

In [None]:
microhapdb marker mh03USC-3pA

Adding `--format=detail` will provide a richly detailed report for the selected microhap(s).

In [None]:
microhapdb marker mh03USC-3pA --format=detail

If the provided identifier refers to a microhap locus with multiple allele definitions, all allele definitions will be displayed.

In [None]:
microhapdb marker mh02KK-014

The user can configure which population is represented in the $A_e$ column, and configure which columns are displayed. See `microhapdb marker --help` below for more instructions.

In [None]:
microhapdb marker mh02KK-014 --ae-pop=TSI

In [None]:
microhapdb marker mh02KK-014 --ae-pop=TSI --columns=xca

MicroHapDB will also return results for merged and deprecated records. For example, the microhap `mh05WL-043` is synonymous with `mh05KK-178`, and the database lookup resolves this seemlessly.

In [None]:
microhapdb marker mh05WL-043

Retrieving markers by their name or identifier is not the only way. Users can also filter by genomic region or by more advanced queries, as shown below.

In [None]:
microhapdb marker --region=chr10:1-4000000

In [None]:
microhapdb marker --query='20 < Extent < 25'

In [None]:
microhapdb marker --query='Source.str.contains("NimaGen2023")'

When designing amplicons or probes for panel, assay, or kit development, you can configure `microhapdb` to return the microhap sequence in FASTA format with the `--format=fasta` flag. The `--format=offsets` will print the corresponding SNP offsets. See `microhapdb marker --help` below for additional options for specifying the boundaries of the target locus.

In [None]:
microhapdb marker mh02WL-005 mh08PK-46625 mh09LS-9pC --format=fasta

In [None]:
microhapdb marker mh02WL-005 mh08PK-46625 mh09LS-9pC --format=offsets

To see a usage statement describing all of the configuration option shown above, and more, run `microhapdb marker --help`.

In [None]:
microhapdb marker --help

## Frequencies

The `microhapdb frequencies` command is used to retrieve population allele frequency estimates. Let's run `microhapdb frequencies --help` to get an idea of what our configuration options are.

In [None]:
microhapdb frequency --help

The frequencies table is massive, so we won't attempt to print it to the screen by running `microhapdb frequencies` without any additional configuration. Instead, we'll grab frequencies for all alleles observed at mh02KK-014.v5 for the CEU population.

In [None]:
microhapdb frequency --marker=mh02KK-014.v5 --population=CEU

To get information about a particular population (or all populations) for which frequency data is available, use the command `microhapdb population` command.

In [None]:
microhapdb population CEU

By default, frequencies are displayed in a tabular format. But they can be formatted for compatibility with popular probabilistic genotyping programs like EuroForMix with the `--format=efm` flag.

In [None]:
microhapdb frequency --population=CEU --format=efm --marker mh02WL-005 mh08PK-46625 mh09LS-9pC