Skip to content

Commit

Permalink
API rewrite (#99)
Browse files Browse the repository at this point in the history
  • Loading branch information
standage committed Nov 14, 2022
1 parent 5134431 commit 5aceb7f
Show file tree
Hide file tree
Showing 28 changed files with 2,263 additions and 1,810 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/cibuild.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,10 @@ jobs:
python -m pip install --upgrade pip
pip install .
make devdeps
- name: Style check
run: make style
- name: Test with pytest
run: make test
- name: Style check
run: make style
- uses: codecov/codecov-action@v1
with:
token: ${{ secrets.CODECOV_TOKEN }} #required
Expand Down
12 changes: 8 additions & 4 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -6,18 +6,22 @@ help: Makefile

## test: execute the automated test suite
test:
pytest --cov=microhapdb --cov-report=term --cov-report=xml --doctest-modules microhapdb/cli/*.py microhapdb/retrieve.py microhapdb/tests/test_*.py
pytest --cov=microhapdb --cov-report=term --cov-report=xml --doctest-modules microhapdb/cli/*.py microhapdb/marker.py microhapdb/population.py microhapdb/tests/test_*.py

## devdeps: install development dependencies
devdeps:
pip install --upgrade pip setuptools
pip install wheel twine
pip install pycodestyle 'pytest>=5.0' pytest-cov pytest-sugar
pip install black==22.10 'pytest>=5.0' pytest-cov

## clean: remove development artifacts
clean:
rm -rf __pycache__/ microhapdb/__pycache__/ microhapdb/*/__pycache__ build/ dist/ *.egg-info/ dbbuild/.snakemake

## style: check code style against PEP8
## style: check code style
style:
pycodestyle --ignore=E501,W503 microhapdb/*.py
black --line-length=99 --check *.py microhapdb/*.py microhapdb/*/*.py

## format: autoformat code with Black
format:
black --line-length=99 *.py microhapdb/*.py microhapdb/*/*.py
91 changes: 53 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ https://github.com/bioforensics/microhapdb
The database includes a comprehensive collection of marker and allele frequency data from numerous databases and published research articles.<sup>[5-19]</sup>
Effective number of allele (*A<sub>e</sub>*)<sup>[2]</sup> and informativeness for assignment (*I<sub>n</sub>*)<sup>[3]</sup> statistics are included so that markers can be ranked for different forensic applications.
The entire contents of the database are distributed with each copy of MicroHapDB, and instructions for adding private data to a local copy of the database are provided.
MicroHapDB is designed to be user-friendly for both practitioners and researchers, supporting a range of access methods from browsing to simple text queries to complex queries to full programmatic access via a Python API.
MicroHapDB is designed to be user-friendly for both practitioners and researchers, supporting a range of access methods from browsing and simple text queries to complex queries and full programmatic access via a Python API.
MicroHapDB is also designed as a community resource requiring minimal infrastructure to use and maintain.


Expand All @@ -35,53 +35,68 @@ Conda ensures the correct installation of Python version 3 and the [Pandas][] li

## Usage

### Browsing
There is no web interface for MicroHapDB.
The database must be installed locally to the computer(s) on which it will be used.
The entire contents of the database are distributed with each copy of MicroHapDB.
Once installed, users can access the database contents in any of the following ways.
(Click on each arrow for more information.)

Typing `microhapdb marker` on the command line will print a complete listing of all microhap markers in MicroHapDB to your terminal window.
The commands `microhapdb population` and `microhapdb frequency` will do the same for population descriptions and allele frequencies.
<details>
<summary>Command line interface</summary>

> *__WARNING__: it's unlikely the entire data table will fit on your screen at once, so you may have to scroll back in your terminal to view all rows of the table.*
### Command line interface

Alternatively, the files `marker.tsv`, `population.tsv`, and `frequency.tsv` can be opened in Excel or loaded into your statistics/datascience environment of choice.
Type `microhapdb --files` on the command line to see the location of these files.
MicroHapDB provides a simple and user-friendly text interface for database query and retrieval.
Using the `microhapdb` command in the terminal, a user can provide filtering criteria to select population, marker, and frequency data, and format this data in a variety of ways.

### Database queries
- `microhapdb population` for retrieving information on populations for which MicroHapDB has allele frequency data
- `microhapdb marker` for retrieving marker information
- `microhapdb frequency` for retrieving microhap population frequencies
- `microhapdb lookup` for retrieving individual records of any type

The `microhapdb lookup <identifier>` command searches all data tables for relevant records with a user-provided name, identifier, or description, such as `mh06PK-24844`, `rs8192488`, or `Yoruba`.
Executing `microhapdb population --help`, `microhapdb marker --help`, and so on in the terminal will print a usage statement with detailed instructions for configuring and running database queries.

The `microhapdb marker <identifier>` command searches the microhap markers with one or more user-provided names or identifiers.
The command also supports region-based queries (such as `chr1` or `chr12:1000000-5000000`), and can print either a tabular report or a detailed report.
Run `microhapdb marker --help` for additional details.
</details>

The `microhapdb population <identifier>` command searches the population & cohort table with one or more user-provided names or identifiers.
Run `microhapdb population --help` for additional details.

The `microhapdb frequency --marker <markerID> --population <popID> --allele <allele>` command searches the allele frequency table.
The search can be restricted using all query terms (marker, population, and allele), or broadened by dropping one or more of the query terms.
Run `microhapdb frequency --help` for additional details.

<img alt="MicroHapDB UNIX CLI" src="img/microhapdb-unix-cli.gif" width="600px" />
<details>
<summary>Python API</summary>

### Python API

To access MicroHapDB from Python, simply invoke `import microhapdb` and query the following tables.

- `microhapdb.markers`
- `microhapdb.populations`
- `microhapdb.frequencies`

Each is a [Pandas][]<sup>[1]</sup> dataframe object, supporting convenient and efficient listing, subsetting, and query capabilities.

<img alt="MicroHapDB Python API" src="img/microhapdb-python-api.gif" width="600px" />

See the [Pandas][] documentation for more details on dataframe access and query methods.

MicroHapDB also includes 4 auxiliary tables, which may be useful in a variety of scenarios.

- `microhapdb.variantmap`: contains a mapping of dbSNP variants to their corresponding microhap markers
- `microhapdb.idmap`: cross-references external names and identifiers with internal MicroHapDB identifiers
- `microhapdb.sequences`: contains the sequence spanning and flanking each microhap locus
- `microhapdb.indels`: contains variant information for markers that include insertion/deletion variants
For users with programming experience, the contents of MicroHapDB can be accessed programmatically from the `microhapdb` Python package.
Including `import microhapdb` in the header of a Python program will provide access to the following resources.

- database tables, stored in memory as [Pandas][]<sup>[1]</sup> DataFrame objects
- `microhapdb.markers`
- `microhapdb.populations`
- `microhapdb.frequencies`
- convenience functions for data retrieval
- some return `Marker` or `Population` objects with numerous auxiliary attributes and methods
- `marker = microhapdb.Marker.from_id("mh02USC-2pC")`
- `for marker in microhapdb.Marker.from_ids(idlist):`
- `for marker in microhapdb.Marker.from_query("Source == '10.1007/s00414-020-02483-x'"):`
- `for marker in microhapdb.Marker.from_region("chr11:25000000-50000000"):`
- `population = microhapdb.Population.from_id("IBS")`
- `for population in microhapdb.Population.from_ids(idlist):`
- `for population in microhapdb.Population.from_query("Name.str.contains('Afr')"):`
- others return DataFrame objects with subsets of the primary database tables
- `result = microhapdb.Marker.table_from_ids(idlist)`
- `result = microhapdb.Marker.table_from_query("Source == '10.1007/s00414-020-02483-x'")`
- `result = microhapdb.Marker.table_from_region("chr11:25000000-50000000")`
- `result = microhapdb.Population.table_from_ids(idlist):`
- `result = microhapdb.Population.table_from_query("Name.str.contains('Afr')"):`
</details>

<details>
<summary>Direct file access</summary>

### Direct file access

For users that want to circumvent MicroHapDB's CLI and API, the database tables can be accessed directly and loaded into R, Python, Excel, or any preferred environment.
Running `microhapdb --files` on the command line will reveal the location of these files.

> *__WARNING__: Modifying the contents of the database files may cause problems with MicroHapDB. Any user wishing to sort, filter, or otherwise manipulate the contents of the core database files should instead copy those files and manipulate the copies.*
</details>

## Ranking Markers
Expand Down
42 changes: 21 additions & 21 deletions dbbuild/sources/kureshi2020/marker.tsv
Original file line number Diff line number Diff line change
@@ -1,21 +1,21 @@
Name Xref NumVars Chrom OffsetsHg37 OffsetsHg38 VarRef
mh02ZHA-012 3 chr2 "rs949778,rs867005,rs952210"
mh03ZHA-001 4 chr3 "rs4858685,rs4858686,rs75773180,rs9838878"
mh04ZHA-001 4 chr4 "rs6830692,rs9714725,rs12501341,rs10939388"
mh04ZHA-002 4 chr4 "rs10939597,rs79276692,rs62409414,rs62409415"
mh04ZHA-004 4 chr4 "rs10049992,rs1914740,rs1714017,rs6835177"
mh04ZHA-007 3 chr4 "rs6819048,rs62308082,rs74383997"
mh05ZHA-004 3 chr5 "rs2457087,rs2644662,rs2662178"
mh07ZHA-003 4 chr7 "rs4724041,rs378367,rs433709,rs404569"
mh07ZHA-004 3 chr7 "rs6971410,rs2971679,rs3808323"
mh07ZHA-009 4 chr7 "rs144858626,rs149890778,rs11773043,rs7792859"
mh08ZHA-011 4 chr8 "rs4831247,rs13265601,rs4831248,rs13268053"
mh09ZHA-008 3 chr9 "rs11506774,rs10981667,rs10739387"
mh10ZHA-002 4 chr10 "rs10764175,rs148665640,rs10827896,rs10827897"
mh11ZHA-006a 4 chr11 "rs3809057,rs3809056,rs3809055,rs3809054"
mh14ZHA-003 3 chr14 "rs4902946,rs8012670,rs4902947"
mh16ZHA-009 4 chr16 "rs76047588,rs11641186,rs11641193,rs80213582"
mh17ZHA-001 3 chr17 "rs56023444,rs4131415,rs4260117"
mh19ZHA-007 4 chr19 "rs8106726,rs8102417,rs59490836,rs10406130"
mh19ZHA-009 5 chr19 "rs74178308,rs8108729,rs8107824,rs8108835,rs2560950"
mh22ZHA-008 3 chr22 "rs11568183,rs8142282,rs8136173"
Name Xref NumVars Chrom OffsetsHg37 OffsetsHg38 VarRef
mh02ZHA-012 3 chr2 "rs949778,rs867005,rs952210"
mh03ZHA-001 4 chr3 "rs4858685,rs4858686,rs75773180,rs9838878"
mh04ZHA-001 4 chr4 "rs6830692,rs9714725,rs12501341,rs10939388"
mh04ZHA-002 4 chr4 "rs10939597,rs79276692,rs62409414,rs62409415"
mh04ZHA-004 4 chr4 "rs10049992,rs1914740,rs1714017,rs6835177"
mh04ZHA-007 3 chr4 "rs6819048,rs62308082,rs74383997"
mh05ZHA-004 3 chr5 "rs2457087,rs2644662,rs2662178"
mh07ZHA-003 4 chr7 "rs4724041,rs378367,rs433709,rs404569"
mh07ZHA-004 3 chr7 "rs6971410,rs2971679,rs3808323"
mh07ZHA-009 4 chr7 "rs144858626,rs149890778,rs11773043,rs7792859"
mh08ZHA-011 4 chr8 "rs4831247,rs13265601,rs4831248,rs13268053"
mh09ZHA-008 3 chr9 "rs11506774,rs10981667,rs10739387"
mh10ZHA-002 4 chr10 "rs10764175,rs148665640,rs10827896,rs10827897"
mh11ZHA-006a 4 chr11 "rs3809057,rs3809056,rs3809055,rs3809054"
mh14ZHA-003 3 chr14 "rs4902946,rs8012670,rs4902947"
mh16ZHA-009 4 chr16 "rs76047588,rs11641186,rs11641193,rs80213582"
mh17ZHA-001 3 chr17 "rs56023444,rs4131415,rs4260117"
mh19ZHA-007 4 chr19 "rs8106726,rs8102417,rs59490836,rs10406130"
mh19ZHA-009 5 chr19 "rs74178308,rs8108729,rs8107824,rs8108835,rs2560950"
mh22ZHA-008 3 chr22 "rs11568183,rs8142282,rs8136173"
110 changes: 77 additions & 33 deletions microhapdb/__init__.py
Original file line number Diff line number Diff line change
@@ -1,58 +1,102 @@
#!/usr/bin/env python3
# -------------------------------------------------------------------------------------------------
# Copyright (c) 2018, DHS.
#
# -----------------------------------------------------------------------------
# Copyright (c) 2018, Battelle National Biodefense Institute.
# This file is part of MicroHapDB (http://github.com/bioforensics/MicroHapDB) and is licensed under
# the BSD license: see LICENSE.txt.
#
# This file is part of MicroHapDB (http://github.com/bioforensics/microhapdb)
# and is licensed under the BSD license: see LICENSE.txt.
# -----------------------------------------------------------------------------
# This software was prepared for the Department of Homeland Security (DHS) by the Battelle National
# Biodefense Institute, LLC (BNBI) as part of contract HSHQDC-15-C-00064 to manage and operate the
# National Biodefense Analysis and Countermeasures Center (NBACC), a Federally Funded Research and
# Development Center.
# -------------------------------------------------------------------------------------------------


from microhapdb.util import data_file
from .tables import markers, populations, frequencies, variantmap, idmap, sequences, indels
from .population import Population
from .marker import Marker
from microhapdb import cli
from microhapdb import retrieve
from microhapdb import marker
from microhapdb import panel
from microhapdb import population
import os
import pandas as pd
from pkg_resources import resource_filename
import pandas
from ._version import get_versions
__version__ = get_versions()['version']

__version__ = get_versions()["version"]
del get_versions


def data_file(path):
return resource_filename("microhapdb", f"data/{path}")


def set_ae_population(popid=None):
global markers
columns = ['Name', 'PermID', 'Reference', 'Chrom', 'Offsets', 'Ae', 'In', 'Fst', 'Source']
columns = ["Name", "PermID", "Reference", "Chrom", "Offsets", "Ae", "In", "Fst", "Source"]
if popid is None:
defaults = pandas.read_csv(data_file('marker.tsv'), sep='\t')
defaults = defaults[['Name', 'Ae']]
markers = markers.drop(columns=['Ae']).join(defaults.set_index('Name'), on='Name')[columns]
defaults = pd.read_csv(data_file("marker.tsv"), sep="\t")
defaults = defaults[["Name", "Ae"]]
markers = markers.drop(columns=["Ae"]).join(defaults.set_index("Name"), on="Name")[columns]
else:
aes = pandas.read_csv(data_file('marker-aes.tsv'), sep='\t')
aes = pd.read_csv(data_file("marker-aes.tsv"), sep="\t")
if popid not in aes.Population.unique():
raise ValueError(f'no Ae data for population "{popid}"')
popaes = aes[aes.Population == popid].drop(columns=['Population'])
markers = markers.drop(columns=['Ae']).join(popaes.set_index('Marker'), on='Name')[columns]
popaes = aes[aes.Population == popid].drop(columns=["Population"])
markers = markers.drop(columns=["Ae"]).join(popaes.set_index("Marker"), on="Name")[columns]


def set_reference(refr):
global markers
assert refr in (37, 38)
columns = ['Name', 'PermID', 'Reference', 'Chrom', 'Offsets', 'Ae', 'In', 'Fst', 'Source']
columns = ["Name", "PermID", "Reference", "Chrom", "Offsets", "Ae", "In", "Fst", "Source"]
if refr == 38:
defaults = pandas.read_csv(data_file('marker.tsv'), sep='\t')[['Name', 'Reference', 'Offsets']]
markers = markers.drop(columns=['Reference', 'Offsets']).join(defaults.set_index('Name'), on='Name')[columns]
defaults = pd.read_csv(data_file("marker.tsv"), sep="\t")[["Name", "Reference", "Offsets"]]
markers = markers.drop(columns=["Reference", "Offsets"]).join(
defaults.set_index("Name"), on="Name"
)[columns]
else:
o37 = pandas.read_csv(data_file('marker-offsets-GRCh37.tsv'), sep='\t')
markers = markers.drop(columns=['Reference', 'Offsets']).join(o37.set_index('Marker'), on='Name')[columns]
o37 = pd.read_csv(data_file("marker-offsets-GRCh37.tsv"), sep="\t")
markers = markers.drop(columns=["Reference", "Offsets"]).join(
o37.set_index("Marker"), on="Name"
)[columns]


markers = pandas.read_csv(data_file('marker.tsv'), sep='\t')
populations = pandas.read_csv(data_file('population.tsv'), sep='\t')
frequencies = pandas.read_csv(data_file('frequency.tsv'), sep='\t')
variantmap = pandas.read_csv(data_file('variantmap.tsv'), sep='\t')
idmap = pandas.read_csv(data_file('idmap.tsv'), sep='\t')
sequences = pandas.read_csv(data_file('sequences.tsv'), sep='\t')
indels = pandas.read_csv(data_file('indels.tsv'), sep='\t')
def retrieve_by_id(ident):
"""Retrieve records by name or identifier
>>> retrieve_by_id("mh17KK-014")
Name PermID Reference Chrom Offsets Ae In Fst Source
510 mh17KK-014 MHDBM-83a239de GRCh38 chr17 4497060,4497088,4497096 2.0215 0.6423 0.3014 ALFRED
>>> retrieve_by_id("SI664726F")
Name PermID Reference Chrom Offsets Ae In Fst Source
510 mh17KK-014 MHDBM-83a239de GRCh38 chr17 4497060,4497088,4497096 2.0215 0.6423 0.3014 ALFRED
>>> retrieve_by_id("MHDBM-ea520d26")
Name PermID Reference Chrom Offsets Ae In Fst Source
539 mh18KK-285 MHDBM-ea520d26 GRCh38 chr18 24557354,24557431,24557447,24557489 2.7524 0.1721 0.0836 ALFRED
>>> retrieve_by_id("PJL")
ID Name Source
82 PJL Punjabi from Lahore, Pakistan 1KGP
>>> retrieve_by_id("Asia")
ID Name Source
7 MHDBP-936bc36f79 Asia 10.1016/j.fsigen.2018.05.008
>>> retrieve_by_id("Japanese")
ID Name Source
45 MHDBP-63967b883e Japanese 10.1016/j.legalmed.2015.06.003
46 SA000010B Japanese ALFRED
"""

def id_in_series(ident, series):
return series.str.contains(ident).any()

if id_in_series(ident, idmap.Xref):
result = idmap[idmap.Xref == ident]
assert len(result) == 1
ident = result.ID.iloc[0]
id_in_pop_ids = id_in_series(ident, populations.ID)
id_in_pop_names = id_in_series(ident, populations.Name)
id_in_variants = id_in_series(ident, variantmap.Variant)
id_in_marker_names = id_in_series(ident, markers.Name)
id_in_marker_permids = id_in_series(ident, markers.PermID)
if id_in_pop_ids or id_in_pop_names:
return Population.table_from_ids([ident])
elif id_in_variants or id_in_marker_names or id_in_marker_permids:
return Marker.table_from_ids([ident])
else:
raise ValueError(f'identifier "{ident}" not found in MicroHapDB')
Loading

0 comments on commit 5aceb7f

Please sign in to comment.