# MicroHapDB Tutorial

Daniel Standage  
2023-08-29  
MicroHapDB version 0.11  
Updated 2025-04-24

## Overview

This tutorial provides a brief overview of the MicroHapDB database, its contents, and its features. MicroHapDB is a comprehensive collection of microhaplotype marker and frequency data. It is distributed as a software package that can be installed with pip or conda: see https://microhapdb.readthedocs.io/en/latest/install.html for more information. Once installed, MicroHapDB can be accessed on the terminal via the command `microhapdb`.

<small>(The contents of MicroHapDB are also available as Pandas DataFrame objects via a Python API. This will not be covered in this tutorial.)</small>

The contents of this tutorial are presented as an interactive notebook, which can be executed in the cloud via mybinder or locally using Jupyter. Each gray box contains a terminal command, and the output of each command will be displayed below the corresponding box as the notebook is executed.

## Getting started

To get started, let's run the `microhapdb` command with no other text. This elicits a usage statement describing a handful of MicroHapDB operations or "subcommands."

In [1]:
microhapdb

usage: microhapdb [-h] [-v] [-f] [--download] cmd ...

≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠
 __  __ _            _  _           ___  ___
|  \/  (_)__ _ _ ___| || |__ _ _ __|   \| _ )
| |\/| | / _| '_/ _ \ __ / _` | '_ \ |) | _ \
|_|  |_|_\__|_| \___/_||_\__,_| .__/___/|___/
                              |_|
≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠≠

Subcommands:
  cmd            frequency, lookup, marker, population, summarize

Configuration:
  -h, --help     show this help message and exit
  -v, --version  show program's version number and exit
  -f, --files    print data table filenames and exit
  --download     download the GRCh38 genome and exit


The `microhapdb` command has 5 subcommands: `frequency`, `lookup`, `marker`, `population`, and `summarize`. The `summarize` subcommand, as its name suggests, provides a high-level summary of the database contents—this is a good place to start.

In [2]:
microhapdb summarize

[microhaplotypes]
  - 3053 allele definitions
  - 2413 distinct loci
[frequencies]
  - 59704 haplotypes
  - 124 population groups
  - 885503 total microhap frequencies


Before proceeding, we need to download the GRCh38 human reference genome.

In [None]:
microhapdb --download

## Markers

The `microhapdb marker` command is used to retrieve information about microhaps from the database. Running the command with no other configuration will dump the entire contents of the markers table onto the screen. Given that there are more than 3000 microhap allele definitions in the database, this usually isn't very useful.

In [3]:
microhapdb marker

                Name  NumVars  Extent Chrom     Start       End     Ae                                  Source
            mh01LW-3        4      58  chr1     31670     31727  1.000                               Zhang2023
   mh01SCUZJ-0000740       15     345  chr1    928119    928463  7.362                                 Zhu2023
          mh01WL-090        5     195  chr1   1049886   1050080  3.230                                Yu2022G1
       mh01KK-172.v1        3     226  chr1   1551454   1551679  2.932                                Kidd2018
       mh01KK-172.v2        7     226  chr1   1551454   1551679  3.481                            Gandotra2020
       mh01KK-172.v3        8     226  chr1   1551454   1551679  3.481                             Pakstis2021
         mh01USC-1pA        4      49  chr1   1594570   1594618  3.828                          delaPuente2020
mh01SCUZJ-0005171.v1       10     310  chr1   1935166   1935475  1.000                                 Zhu2023
m

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



If we add a microhap identifier, MicroHapDB will display info only for that microhap.

In [4]:
microhapdb marker mh03USC-3pA

       Name  NumVars  Extent Chrom   Start     End    Ae         Source
mh03USC-3pA        4      69  chr3 5492730 5492798 3.703 delaPuente2020


Adding `--format=detail` will provide a richly detailed report for the selected microhap(s).

In [5]:
microhapdb marker mh03USC-3pA --format=detail

--------------------------------------------------------------[ MicroHapDB ]----
MH Locus: mh03USC-3pA
Marker:   mh03USC-3pA (Source: delaPuente2020)

Marker Definition
    Marker extent
        - chr3:5492730-5492798 (69 bp)
    Target locus
        - chr3:5492719-5492808 (89 bp)
    Constituent variants
        - chromosome offsets (GRCh37): 5534416, 5534454, 5534459, 5534484
        - chromosome offsets (GRCh38): 5492729, 5492767, 5492772, 5492797
        - marker offsets: 0, 38, 43, 68
        - target offsets: 10, 48, 53, 78
        - cross-references: rs11710159, rs9853649, rs9849667, rs1991577
    Observed haplotypes
        - A:C:G:T
        - A:T:G:T
        - C:C:A:T
        - C:C:G:C
        - C:C:G:T
        - C:T:A:C
        - C:T:A:T
        - C:T:G:C


--[ Core Marker Sequence ]--
>mh03USC-3pA
CTGTCTCACTCCATCAGGGAGGGCAGGTCTTTTTAAGCTGAGGACCTAGGGGTCAGTTCTTAGTGGCAT


--[ Marker Target Sequence with MH alleles (haplotypes) ]--
          *                                     

If the provided identifier refers to a microhap locus with multiple allele definitions, all allele definitions will be displayed.

In [6]:
microhapdb marker mh02KK-014

         Name  NumVars  Extent Chrom     Start       End     Ae       Source
mh02KK-014.v5       12     192  chr2 227659272 227659463  8.130  NimaGen2023
mh02KK-014.v3       18     155  chr2 227659318 227659472  7.421      Fan2022
mh02KK-014.v1       13     239  chr2 227659356 227659594 12.074 Gandotra2020
mh02KK-014.v2       16     239  chr2 227659356 227659594 12.127  Pakstis2021
mh02KK-014.v4        8     298  chr2 227659394 227659691 15.826      Zhu2023


The user can configure which population is represented in the $A_e$ column, and configure which columns are displayed. See `microhapdb marker --help` below for more instructions.

In [7]:
microhapdb marker mh02KK-014 --ae-pop=TSI

         Name  NumVars  Extent Chrom     Start       End     Ae       Source
mh02KK-014.v5       12     192  chr2 227659272 227659463  6.273  NimaGen2023
mh02KK-014.v3       18     155  chr2 227659318 227659472  4.942      Fan2022
mh02KK-014.v1       13     239  chr2 227659356 227659594  8.363 Gandotra2020
mh02KK-014.v2       16     239  chr2 227659356 227659594  8.363  Pakstis2021
mh02KK-014.v4        8     298  chr2 227659394 227659691 11.809      Zhu2023


In [8]:
microhapdb marker mh02KK-014 --ae-pop=TSI --columns=xca

         Name  Extent Chrom     Ae       Source
mh02KK-014.v5     192  chr2  6.273  NimaGen2023
mh02KK-014.v3     155  chr2  4.942      Fan2022
mh02KK-014.v1     239  chr2  8.363 Gandotra2020
mh02KK-014.v2     239  chr2  8.363  Pakstis2021
mh02KK-014.v4     298  chr2 11.809      Zhu2023


MicroHapDB will also return results for merged and deprecated records. For example, the microhap `mh05WL-043` is synonymous with `mh05KK-178`, and the database lookup resolves this seemlessly.

In [9]:
microhapdb marker mh05WL-043

         Name  NumVars  Extent Chrom    Start      End    Ae            Source
mh05KK-178.v4        5     249  chr5 68013857 68014105 4.910          Yu2022G3
mh05KK-178.v3        4     170  chr5 68013936 68014105 4.870 Yu2022G1;Yu2022G2
mh05KK-178.v1        8     231  chr5 68013936 68014166 5.106      Gandotra2020
mh05KK-178.v2        9     231  chr5 68013936 68014166 5.106       Pakstis2021


Retrieving markers by their name or identifier is not the only way. Users can also filter by genomic region or by more advanced queries, as shown below.

In [10]:
microhapdb marker --region=chr10:1-4000000

             Name  NumVars  Extent Chrom   Start     End    Ae              Source
mh10SCUZJ-0001540        4     124 chr10  418145  418268 1.000             Zhu2023
       mh10WL-040        4     148 chr10  523356  523503 1.000            Yu2022G1
mh10SCUZJ-0003737        8     214 chr10 1126155 1126368 9.957             Zhu2023
       mh10WL-045        5     151 chr10 1558616 1558766 1.000            Yu2022G1
mh10SCUZJ-0011512       10     312 chr10 2050582 2050893 3.602             Zhu2023
    mh10KK-162.v3        9     235 chr10 3118431 3118665 3.903         NimaGen2023
    mh10KK-162.v1       12     266 chr10 3118460 3118725 5.872        Gandotra2020
    mh10KK-162.v2       13     266 chr10 3118460 3118725 5.872         Pakstis2021
    mh10KK-163.v2        3      77 chr10 3120218 3120294 3.677         Staadig2021
    mh10KK-163.v1        5     260 chr10 3120218 3120477 5.135 Kidd2018;Turchi2019
    mh10KK-163.v3        5     191 chr10 3120287 3120477 5.383            Yu2022G1
mh10

In [11]:
microhapdb marker --query='20 < Extent < 25'

           Name  NumVars  Extent Chrom     Start       End    Ae                     Source
     mh01CP-010        3      23  chr1  85240118  85240140 2.668                   Chen2019
     mh01HYP-01        2      23  chr1 207512419 207512441 2.076                    Zou2022
     mh02KK-105        2      22  chr2  96700566  96700587 2.017                   Kidd2018
    mh02USC-2qD        3      21  chr2 205957896 205957916 3.214             delaPuente2020
    mh02USC-2qE        3      24  chr2 240602475 240602498 3.108             delaPuente2020
     mh04KK-029        2      23  chr4  99420682  99420704 1.398                   Kidd2018
    mh08ZBF-001        3      21  chr8   2133359   2133379 2.314                    Jin2020
  mh09KK-152.v2        2      23  chr9  83193793  83193815 2.338                Staadig2021
     mh10WL-009        3      24 chr10  28211460  28211483 3.062 Yu2022G1;Yu2022G2;Yu2022G4
     mh11CP-003        3      23 chr11   5854637   5854659 3.410                

In [12]:
microhapdb marker --query='Source.str.contains("NimaGen2023")'

           Name  NumVars  Extent Chrom     Start       End     Ae                               Source
  mh01WL-003.v1        5     198  chr1  85527716  85527913  1.146        Yu2022G1;Yu2022G2;NimaGen2023
  mh01KK-212.v1       11     243  chr1 202647419 202647661 10.054             Gandotra2020;NimaGen2023
  mh01WL-006.v3        7     213  chr1 236518814 236519026 10.016                          NimaGen2023
  mh02KK-022.v3        7     259  chr2   3168667   3168925  5.581                          NimaGen2023
  mh02KK-029.v1       13     229  chr2  68911825  68912053  6.168             Gandotra2020;NimaGen2023
  mh02KK-134.v2        6     104  chr2 160222900 160223003  5.206             Gandotra2020;NimaGen2023
  mh02KK-014.v5       12     192  chr2 227659272 227659463  8.130                          NimaGen2023
  mh03WL-006.v4        6      83  chr3   2598095   2598177  5.010                          NimaGen2023
   mh03LV-06.v5       11     309  chr3  11914418  11914726 14.015        

When designing amplicons or probes for panel, assay, or kit development, you can configure `microhapdb` to return the microhap sequence in FASTA format with the `--format=fasta` flag. The `--format=offsets` will print the corresponding SNP offsets. See `microhapdb marker --help` below for additional options for specifying the boundaries of the target locus.

In [13]:
microhapdb marker mh02WL-005 mh08PK-46625 mh09LS-9pC --format=fasta

>mh02WL-005 GRCh38:chr2:30759849-30760028 mh02WL-005=10,79,167,168
CCCATGCTCAGCAGTGGGGCATTGCTGTGCAGGAGCCGCACAGTCACACATGGGCCCAGGAGCCTGGCTCGGATAAGGCT
CTTTTCTGTTGCTTCCCTTGCCTGCCCCGCTCCCTGGCCAGTACCTCAGCAGAAGGGGCTCTGACTTGCAATGCCTCCAA
GAGGCACGAACCCAGGTTT
>mh08PK-46625 GRCh38:chr8:1194322-1194402 mh08PK-46625=30,34,42,49
CTGGTGGAGGGAGCCCGGATGCCTGGCAGACAGTCAGTGGTCGGTTGGCGGCCGGCCCACATAAGGGCACCATGCTCACC
>mh09LS-9pC GRCh38:chr9:36010354-36010565 mh09LS-9pC=10,78,200
ATCAAATAGACCTCCTCAACCTGCAAGAATGTCTGAGATCATCTGGCCTCACCCTGCCATCATGCTGAGGAAATGGGGGC
TTGCCAGGGACCATACAGCTGGTAGTCCTTCTCACGATTGGTTGTCCTTTCTTGATAAAAACTGTTGGGCTCTTTCTCAC
TTCATAGTTAGATACAAAGTCTGGCAATAATAATAAGCATGTTTCTGAATC


In [14]:
microhapdb marker mh02WL-005 mh08PK-46625 mh09LS-9pC --format=offsets

Marker	Offset	Chrom	OffsetHg38
mh02WL-005	10	chr2	30759859
mh02WL-005	79	chr2	30759928
mh02WL-005	167	chr2	30760016
mh02WL-005	168	chr2	30760017
mh08PK-46625	30	chr8	1194352
mh08PK-46625	34	chr8	1194356
mh08PK-46625	42	chr8	1194364
mh08PK-46625	49	chr8	1194371
mh09LS-9pC	10	chr9	36010364
mh09LS-9pC	78	chr9	36010432
mh09LS-9pC	200	chr9	36010554


To see a usage statement describing all of the configuration option shown above, and more, run `microhapdb marker --help`.

In [15]:
microhapdb marker --help

usage: microhapdb marker [-h] [--ae-pop POP] [--panel FILE] [--region RGN]
                         [--query QRY] [--format {table,detail,fasta,offsets}]
                         [--columns C] [--delta D] [--min-length L]
                         [--extend-mode E] [--notrunc]
                         [id ...]

Retrieve marker records by identifier or query

Required Arguments:
  id                    one or more marker identifiers

Options:
  -h, --help            show this help message and exit

Data Retrieval:
  Configure how marker records are retrieved from the database.

  --ae-pop POP          specify the 1000 Genomes population from which to
                        report effective number of alleles in the "Ae" column;
                        by default, the Ae value averaged over all 26 1KGP
                        populations is reported
  --panel FILE          file containing a list of marker names/identifiers,
                        one per line
  --region RGN          rest

## Frequencies

The `microhapdb frequencies` command is used to retrieve population allele frequency estimates. Let's run `microhapdb frequencies --help` to get an idea of what our configuration options are.

In [16]:
microhapdb frequency --help

usage: microhapdb frequency [-h] [--format {table,mhpl8r,efm}]
                            [--marker ID [ID ...] | --panel FILE]
                            [--population ID [ID ...]] [--allele ID]

Retrieve population allele frequencies

optional arguments:
  -h, --help            show this help message and exit
  --format {table,mhpl8r,efm}
  --marker ID [ID ...]  restrict frequencies by marker
  --panel FILE          restrict frequencies to markers listed in FILE, one ID
                        per line
  --population ID [ID ...]
                        restrict frequencies by population
  --allele ID           restrict frequencies by allele

Examples::

    microhapdb frequency --marker=mh22KK-060 --population=SA000001B
    microhapdb frequency --marker=mh22KK-060 --allele='C|A'


The frequencies table is massive, so we won't attempt to print it to the screen by running `microhapdb frequencies` without any additional configuration. Instead, we'll grab frequencies for all alleles observed at mh02KK-014.v5 for the CEU population.

In [17]:
microhapdb frequency --marker=mh02KK-014.v5 --population=CEU

Marker	Population	Allele	Frequency	Count	Source
mh02KK-014.v5	CEU	A:G:C:T:A:C:C:C:G:T:A:A	0.03689	244	Byrska-Bishop2022
mh02KK-014.v5	CEU	A:G:C:T:A:C:C:C:G:T:A:G	0.04508	244	Byrska-Bishop2022
mh02KK-014.v5	CEU	A:G:C:T:A:G:C:C:A:T:A:A	0.05328	244	Byrska-Bishop2022
mh02KK-014.v5	CEU	A:G:C:T:A:G:C:C:A:T:A:G	0.11885	244	Byrska-Bishop2022
mh02KK-014.v5	CEU	A:G:C:T:A:G:C:C:G:T:A:A	0.2541	244	Byrska-Bishop2022
mh02KK-014.v5	CEU	A:G:C:T:A:G:C:C:G:T:A:G	0.23361	244	Byrska-Bishop2022
mh02KK-014.v5	CEU	A:G:C:T:A:G:T:C:G:T:A:A	0.04918	244	Byrska-Bishop2022
mh02KK-014.v5	CEU	A:G:C:T:A:G:T:C:G:T:A:G	0.0041	244	Byrska-Bishop2022
mh02KK-014.v5	CEU	A:G:C:T:G:C:C:C:A:T:A:G	0.02869	244	Byrska-Bishop2022
mh02KK-014.v5	CEU	A:G:C:T:G:C:C:C:G:T:A:G	0.0082	244	Byrska-Bishop2022
mh02KK-014.v5	CEU	A:G:C:T:G:G:C:C:A:T:A:G	0.02049	244	Byrska-Bishop2022
mh02KK-014.v5	CEU	A:G:C:T:G:G:C:C:G:T:A:A	0.0123	244	Byrska-Bishop2022
mh02KK-014.v5	CEU	A:G:C:T:G:G:C:C:G:T:A:G	0.04918	244	Byrska-Bishop2022
mh02KK-014.v5	CEU	G:

To get information about a particular population (or all populations) for which frequency data is available, use the command `microhapdb population` command.

In [18]:
microhapdb population CEU

 ID                                                              Name            Source
CEU Utah Residents (CEPH) with Northern and Western European Ancestry Byrska-Bishop2022


By default, frequencies are displayed in a tabular format. But they can be formatted for compatibility with popular probabilistic genotyping programs like EuroForMix with the `--format=efm` flag.

In [19]:
microhapdb frequency --population=CEU --format=efm --marker mh02WL-005 mh08PK-46625 mh09LS-9pC

[microhapdb] retrieved and ordered 13 distinct haplotypes
[microhapdb] constructed frequency table for 13 haplotypes and 3 markers
Allele,mh02WL-005,mh08PK-46625,mh09LS-9pC
G:C:G:G,0.082,,
G:T:G:A,0.393,,
G:T:G:G,0.004,,
T:C:G:G,0.303,,
T:T:A:G,0.016,,
T:T:G:A,0.201,,
C:C:G:G,,0.41,
G:C:T:G,,0.193,
G:G:G:G,,0.016,
G:G:T:G,,0.381,
A:T:A,,,0.119
C:G:G,,,0.779
C:T:G,,,0.102
