API rewrite (#99)

bioforensics · Nov 14, 2022 · 5aceb7f · 5aceb7f
1 parent 5134431
commit 5aceb7f
Show file tree

Hide file tree

Showing 28 changed files with 2,263 additions and 1,810 deletions.
diff --git a/.github/workflows/cibuild.yml b/.github/workflows/cibuild.yml
@@ -22,10 +22,10 @@ jobs:
         python -m pip install --upgrade pip
         pip install .
         make devdeps
-    - name: Style check
-      run: make style
     - name: Test with pytest
       run: make test
+    - name: Style check
+      run: make style
     - uses: codecov/codecov-action@v1
       with:
         token: ${{ secrets.CODECOV_TOKEN }} #required

diff --git a/Makefile b/Makefile
@@ -6,18 +6,22 @@ help: Makefile
 
 ## test:      execute the automated test suite
 test:
-	pytest --cov=microhapdb --cov-report=term --cov-report=xml --doctest-modules microhapdb/cli/*.py microhapdb/retrieve.py microhapdb/tests/test_*.py
+	pytest --cov=microhapdb --cov-report=term --cov-report=xml --doctest-modules microhapdb/cli/*.py microhapdb/marker.py microhapdb/population.py microhapdb/tests/test_*.py
 
 ## devdeps:   install development dependencies
 devdeps:
 	pip install --upgrade pip setuptools
 	pip install wheel twine
-	pip install pycodestyle 'pytest>=5.0' pytest-cov pytest-sugar
+	pip install black==22.10 'pytest>=5.0' pytest-cov
 
 ## clean:     remove development artifacts
 clean:
 	rm -rf __pycache__/ microhapdb/__pycache__/ microhapdb/*/__pycache__ build/ dist/ *.egg-info/ dbbuild/.snakemake
 
-## style:     check code style against PEP8
+## style:     check code style
 style:
-	pycodestyle --ignore=E501,W503 microhapdb/*.py
+	black --line-length=99 --check *.py microhapdb/*.py microhapdb/*/*.py
+
+## format:    autoformat code with Black
+format:
+	black --line-length=99 *.py microhapdb/*.py microhapdb/*/*.py
diff --git a/README.md b/README.md
@@ -11,7 +11,7 @@ https://github.com/bioforensics/microhapdb
 The database includes a comprehensive collection of marker and allele frequency data from numerous databases and published research articles.<sup>[5-19]</sup>
 Effective number of allele (*A<sub>e</sub>*)<sup>[2]</sup> and informativeness for assignment (*I<sub>n</sub>*)<sup>[3]</sup> statistics are included so that markers can be ranked for different forensic applications.
 The entire contents of the database are distributed with each copy of MicroHapDB, and instructions for adding private data to a local copy of the database are provided.
-MicroHapDB is designed to be user-friendly for both practitioners and researchers, supporting a range of access methods from browsing to simple text queries to complex queries to full programmatic access via a Python API.
+MicroHapDB is designed to be user-friendly for both practitioners and researchers, supporting a range of access methods from browsing and simple text queries to complex queries and full programmatic access via a Python API.
 MicroHapDB is also designed as a community resource requiring minimal infrastructure to use and maintain.
 
 
@@ -35,53 +35,68 @@ Conda ensures the correct installation of Python version 3 and the [Pandas][] li
 
 ## Usage
 
-### Browsing
+There is no web interface for MicroHapDB.
+The database must be installed locally to the computer(s) on which it will be used.
+The entire contents of the database are distributed with each copy of MicroHapDB.
+Once installed, users can access the database contents in any of the following ways.
+(Click on each arrow for more information.)
 
-Typing `microhapdb marker` on the command line will print a complete listing of all microhap markers in MicroHapDB to your terminal window.
-The commands `microhapdb population` and `microhapdb frequency` will do the same for population descriptions and allele frequencies.
+<details>
+    <summary>Command line interface</summary>
 
-> *__WARNING__: it's unlikely the entire data table will fit on your screen at once, so you may have to scroll back in your terminal to view all rows of the table.*
+### Command line interface
 
-Alternatively, the files `marker.tsv`, `population.tsv`, and `frequency.tsv` can be opened in Excel or loaded into your statistics/datascience environment of choice.
-Type `microhapdb --files` on the command line to see the location of these files.
+MicroHapDB provides a simple and user-friendly text interface for database query and retrieval.
+Using the `microhapdb` command in the terminal, a user can provide filtering criteria to select population, marker, and frequency data, and format this data in a variety of ways.
 
-### Database queries
+- `microhapdb population` for retrieving information on populations for which MicroHapDB has allele frequency data
+- `microhapdb marker` for retrieving marker information
+- `microhapdb frequency` for retrieving microhap population frequencies
+- `microhapdb lookup` for retrieving individual records of any type
 
-The `microhapdb lookup <identifier>` command searches all data tables for relevant records with a user-provided name, identifier, or description, such as `mh06PK-24844`, `rs8192488`, or `Yoruba`.
+Executing `microhapdb population --help`, `microhapdb marker --help`, and so on in the terminal will print a usage statement with detailed instructions for configuring and running database queries.
 
-The `microhapdb marker <identifier>` command searches the microhap markers with one or more user-provided names or identifiers.
-The command also supports region-based queries (such as `chr1` or `chr12:1000000-5000000`), and can print either a tabular report or a detailed report.
-Run `microhapdb marker --help` for additional details.
+</details>
 
-The `microhapdb population <identifier>` command searches the population & cohort table with one or more user-provided names or identifiers.
-Run `microhapdb population --help` for additional details.
-
-The `microhapdb frequency --marker <markerID> --population <popID> --allele <allele>` command searches the allele frequency table.
-The search can be restricted using all query terms (marker, population, and allele), or broadened by dropping one or more of the query terms.
-Run `microhapdb frequency --help` for additional details.
-
-<img alt="MicroHapDB UNIX CLI" src="img/microhapdb-unix-cli.gif" width="600px" />
+<details>
+    <summary>Python API</summary>
 
 ### Python API
 
-To access MicroHapDB from Python, simply invoke `import microhapdb` and query the following tables.
-
-- `microhapdb.markers`
-- `microhapdb.populations`
-- `microhapdb.frequencies`
-
-Each is a [Pandas][]<sup>[1]</sup> dataframe object, supporting convenient and efficient listing, subsetting, and query capabilities.
-
-<img alt="MicroHapDB Python API" src="img/microhapdb-python-api.gif" width="600px" />
-
-See the [Pandas][] documentation for more details on dataframe access and query methods.
-
-MicroHapDB also includes 4 auxiliary tables, which may be useful in a variety of scenarios.
-
-- `microhapdb.variantmap`: contains a mapping of dbSNP variants to their corresponding microhap markers
-- `microhapdb.idmap`: cross-references external names and identifiers with internal MicroHapDB identifiers
-- `microhapdb.sequences`: contains the sequence spanning and flanking each microhap locus
-- `microhapdb.indels`: contains variant information for markers that include insertion/deletion variants
+For users with programming experience, the contents of MicroHapDB can be accessed programmatically from the `microhapdb` Python package.
+Including `import microhapdb` in the header of a Python program will provide access to the following resources.
+
+- database tables, stored in memory as [Pandas][]<sup>[1]</sup> DataFrame objects
+    - `microhapdb.markers`
+    - `microhapdb.populations`
+    - `microhapdb.frequencies`
+- convenience functions for data retrieval
+    - some return `Marker` or `Population` objects with numerous auxiliary attributes and methods
+        - `marker = microhapdb.Marker.from_id("mh02USC-2pC")`
+        - `for marker in microhapdb.Marker.from_ids(idlist):`
+        - `for marker in microhapdb.Marker.from_query("Source == '10.1007/s00414-020-02483-x'"):`
+        - `for marker in microhapdb.Marker.from_region("chr11:25000000-50000000"):`
+        - `population = microhapdb.Population.from_id("IBS")`
+        - `for population in microhapdb.Population.from_ids(idlist):`
+        - `for population in microhapdb.Population.from_query("Name.str.contains('Afr')"):`
+    - others return DataFrame objects with subsets of the primary database tables
+        - `result = microhapdb.Marker.table_from_ids(idlist)`
+        - `result = microhapdb.Marker.table_from_query("Source == '10.1007/s00414-020-02483-x'")`
+        - `result = microhapdb.Marker.table_from_region("chr11:25000000-50000000")`
+        - `result = microhapdb.Population.table_from_ids(idlist):`
+        - `result = microhapdb.Population.table_from_query("Name.str.contains('Afr')"):`
+</details>
+
+<details>
+    <summary>Direct file access</summary>
+
+### Direct file access
+
+For users that want to circumvent MicroHapDB's CLI and API, the database tables can be accessed directly and loaded into R, Python, Excel, or any preferred environment.
+Running `microhapdb --files` on the command line will reveal the location of these files.
+
+> *__WARNING__: Modifying the contents of the database files may cause problems with MicroHapDB. Any user wishing to sort, filter, or otherwise manipulate the contents of the core database files should instead copy those files and manipulate the copies.*
+</details>
 
 
 ## Ranking Markers

diff --git a/dbbuild/sources/kureshi2020/marker.tsv b/dbbuild/sources/kureshi2020/marker.tsv
@@ -1,21 +1,21 @@
-Name	Xref	NumVars	Chrom	OffsetsHg37	OffsetsHg38	VarRef
-mh02ZHA-012		3	chr2			"rs949778,rs867005,rs952210"
-mh03ZHA-001		4	chr3			"rs4858685,rs4858686,rs75773180,rs9838878"
-mh04ZHA-001		4	chr4			"rs6830692,rs9714725,rs12501341,rs10939388"
-mh04ZHA-002		4	chr4			"rs10939597,rs79276692,rs62409414,rs62409415"
-mh04ZHA-004		4	chr4			"rs10049992,rs1914740,rs1714017,rs6835177"
-mh04ZHA-007		3	chr4			"rs6819048,rs62308082,rs74383997"
-mh05ZHA-004		3	chr5			"rs2457087,rs2644662,rs2662178"
-mh07ZHA-003		4	chr7			"rs4724041,rs378367,rs433709,rs404569"
-mh07ZHA-004		3	chr7			"rs6971410,rs2971679,rs3808323"
-mh07ZHA-009		4	chr7			"rs144858626,rs149890778,rs11773043,rs7792859"
-mh08ZHA-011		4	chr8			"rs4831247,rs13265601,rs4831248,rs13268053"
-mh09ZHA-008		3	chr9			"rs11506774,rs10981667,rs10739387"
-mh10ZHA-002		4	chr10			"rs10764175,rs148665640,rs10827896,rs10827897"
-mh11ZHA-006a		4	chr11			"rs3809057,rs3809056,rs3809055,rs3809054"
-mh14ZHA-003		3	chr14			"rs4902946,rs8012670,rs4902947"
-mh16ZHA-009		4	chr16			"rs76047588,rs11641186,rs11641193,rs80213582"
-mh17ZHA-001		3	chr17			"rs56023444,rs4131415,rs4260117"
-mh19ZHA-007		4	chr19			"rs8106726,rs8102417,rs59490836,rs10406130"
-mh19ZHA-009		5	chr19			"rs74178308,rs8108729,rs8107824,rs8108835,rs2560950"
-mh22ZHA-008		3	chr22			"rs11568183,rs8142282,rs8136173"
+Name	Xref	NumVars	Chrom	OffsetsHg37	OffsetsHg38	VarRef
+mh02ZHA-012		3	chr2			"rs949778,rs867005,rs952210"
+mh03ZHA-001		4	chr3			"rs4858685,rs4858686,rs75773180,rs9838878"
+mh04ZHA-001		4	chr4			"rs6830692,rs9714725,rs12501341,rs10939388"
+mh04ZHA-002		4	chr4			"rs10939597,rs79276692,rs62409414,rs62409415"
+mh04ZHA-004		4	chr4			"rs10049992,rs1914740,rs1714017,rs6835177"
+mh04ZHA-007		3	chr4			"rs6819048,rs62308082,rs74383997"
+mh05ZHA-004		3	chr5			"rs2457087,rs2644662,rs2662178"
+mh07ZHA-003		4	chr7			"rs4724041,rs378367,rs433709,rs404569"
+mh07ZHA-004		3	chr7			"rs6971410,rs2971679,rs3808323"
+mh07ZHA-009		4	chr7			"rs144858626,rs149890778,rs11773043,rs7792859"
+mh08ZHA-011		4	chr8			"rs4831247,rs13265601,rs4831248,rs13268053"
+mh09ZHA-008		3	chr9			"rs11506774,rs10981667,rs10739387"
+mh10ZHA-002		4	chr10			"rs10764175,rs148665640,rs10827896,rs10827897"
+mh11ZHA-006a		4	chr11			"rs3809057,rs3809056,rs3809055,rs3809054"
+mh14ZHA-003		3	chr14			"rs4902946,rs8012670,rs4902947"
+mh16ZHA-009		4	chr16			"rs76047588,rs11641186,rs11641193,rs80213582"
+mh17ZHA-001		3	chr17			"rs56023444,rs4131415,rs4260117"
+mh19ZHA-007		4	chr19			"rs8106726,rs8102417,rs59490836,rs10406130"
+mh19ZHA-009		5	chr19			"rs74178308,rs8108729,rs8107824,rs8108835,rs2560950"
+mh22ZHA-008		3	chr22			"rs11568183,rs8142282,rs8136173"
diff --git a/microhapdb/__init__.py b/microhapdb/__init__.py
@@ -1,58 +1,102 @@
-#!/usr/bin/env python3
+# -------------------------------------------------------------------------------------------------
+# Copyright (c) 2018, DHS.
 #
-# -----------------------------------------------------------------------------
-# Copyright (c) 2018, Battelle National Biodefense Institute.
+# This file is part of MicroHapDB (http://github.com/bioforensics/MicroHapDB) and is licensed under
+# the BSD license: see LICENSE.txt.
 #
-# This file is part of MicroHapDB (http://github.com/bioforensics/microhapdb)
-# and is licensed under the BSD license: see LICENSE.txt.
-# -----------------------------------------------------------------------------
+# This software was prepared for the Department of Homeland Security (DHS) by the Battelle National
+# Biodefense Institute, LLC (BNBI) as part of contract HSHQDC-15-C-00064 to manage and operate the
+# National Biodefense Analysis and Countermeasures Center (NBACC), a Federally Funded Research and
+# Development Center.
+# -------------------------------------------------------------------------------------------------
 
-
-from microhapdb.util import data_file
+from .tables import markers, populations, frequencies, variantmap, idmap, sequences, indels
+from .population import Population
+from .marker import Marker
 from microhapdb import cli
-from microhapdb import retrieve
-from microhapdb import marker
 from microhapdb import panel
-from microhapdb import population
-import os
+import pandas as pd
 from pkg_resources import resource_filename
-import pandas
 from ._version import get_versions
-__version__ = get_versions()['version']
+
+__version__ = get_versions()["version"]
 del get_versions
 
 
+def data_file(path):
+    return resource_filename("microhapdb", f"data/{path}")
+
+
 def set_ae_population(popid=None):
     global markers
-    columns = ['Name', 'PermID', 'Reference', 'Chrom', 'Offsets', 'Ae', 'In', 'Fst', 'Source']
+    columns = ["Name", "PermID", "Reference", "Chrom", "Offsets", "Ae", "In", "Fst", "Source"]
     if popid is None:
-        defaults = pandas.read_csv(data_file('marker.tsv'), sep='\t')
-        defaults = defaults[['Name', 'Ae']]
-        markers = markers.drop(columns=['Ae']).join(defaults.set_index('Name'), on='Name')[columns]
+        defaults = pd.read_csv(data_file("marker.tsv"), sep="\t")
+        defaults = defaults[["Name", "Ae"]]
+        markers = markers.drop(columns=["Ae"]).join(defaults.set_index("Name"), on="Name")[columns]
     else:
-        aes = pandas.read_csv(data_file('marker-aes.tsv'), sep='\t')
+        aes = pd.read_csv(data_file("marker-aes.tsv"), sep="\t")
         if popid not in aes.Population.unique():
             raise ValueError(f'no Ae data for population "{popid}"')
-        popaes = aes[aes.Population == popid].drop(columns=['Population'])
-        markers = markers.drop(columns=['Ae']).join(popaes.set_index('Marker'), on='Name')[columns]
+        popaes = aes[aes.Population == popid].drop(columns=["Population"])
+        markers = markers.drop(columns=["Ae"]).join(popaes.set_index("Marker"), on="Name")[columns]
 
 
 def set_reference(refr):
     global markers
     assert refr in (37, 38)
-    columns = ['Name', 'PermID', 'Reference', 'Chrom', 'Offsets', 'Ae', 'In', 'Fst', 'Source']
+    columns = ["Name", "PermID", "Reference", "Chrom", "Offsets", "Ae", "In", "Fst", "Source"]
     if refr == 38:
-        defaults = pandas.read_csv(data_file('marker.tsv'), sep='\t')[['Name', 'Reference', 'Offsets']]
-        markers = markers.drop(columns=['Reference', 'Offsets']).join(defaults.set_index('Name'), on='Name')[columns]
+        defaults = pd.read_csv(data_file("marker.tsv"), sep="\t")[["Name", "Reference", "Offsets"]]
+        markers = markers.drop(columns=["Reference", "Offsets"]).join(
+            defaults.set_index("Name"), on="Name"
+        )[columns]
     else:
-        o37 = pandas.read_csv(data_file('marker-offsets-GRCh37.tsv'), sep='\t')
-        markers = markers.drop(columns=['Reference', 'Offsets']).join(o37.set_index('Marker'), on='Name')[columns]
+        o37 = pd.read_csv(data_file("marker-offsets-GRCh37.tsv"), sep="\t")
+        markers = markers.drop(columns=["Reference", "Offsets"]).join(
+            o37.set_index("Marker"), on="Name"
+        )[columns]
 
 
-markers = pandas.read_csv(data_file('marker.tsv'), sep='\t')
-populations = pandas.read_csv(data_file('population.tsv'), sep='\t')
-frequencies = pandas.read_csv(data_file('frequency.tsv'), sep='\t')
-variantmap = pandas.read_csv(data_file('variantmap.tsv'), sep='\t')
-idmap = pandas.read_csv(data_file('idmap.tsv'), sep='\t')
-sequences = pandas.read_csv(data_file('sequences.tsv'), sep='\t')
-indels = pandas.read_csv(data_file('indels.tsv'), sep='\t')
+def retrieve_by_id(ident):
+    """Retrieve records by name or identifier
+
+    >>> retrieve_by_id("mh17KK-014")
+               Name          PermID Reference  Chrom                  Offsets      Ae      In     Fst  Source
+    510  mh17KK-014  MHDBM-83a239de    GRCh38  chr17  4497060,4497088,4497096  2.0215  0.6423  0.3014  ALFRED
+    >>> retrieve_by_id("SI664726F")
+               Name          PermID Reference  Chrom                  Offsets      Ae      In     Fst  Source
+    510  mh17KK-014  MHDBM-83a239de    GRCh38  chr17  4497060,4497088,4497096  2.0215  0.6423  0.3014  ALFRED
+    >>> retrieve_by_id("MHDBM-ea520d26")
+               Name          PermID Reference  Chrom                              Offsets      Ae      In     Fst  Source
+    539  mh18KK-285  MHDBM-ea520d26    GRCh38  chr18  24557354,24557431,24557447,24557489  2.7524  0.1721  0.0836  ALFRED
+    >>> retrieve_by_id("PJL")
+         ID                           Name Source
+    82  PJL  Punjabi from Lahore, Pakistan   1KGP
+    >>> retrieve_by_id("Asia")
+                     ID  Name                        Source
+    7  MHDBP-936bc36f79  Asia  10.1016/j.fsigen.2018.05.008
+    >>> retrieve_by_id("Japanese")
+                      ID      Name                          Source
+    45  MHDBP-63967b883e  Japanese  10.1016/j.legalmed.2015.06.003
+    46         SA000010B  Japanese                          ALFRED
+    """
+
+    def id_in_series(ident, series):
+        return series.str.contains(ident).any()
+
+    if id_in_series(ident, idmap.Xref):
+        result = idmap[idmap.Xref == ident]
+        assert len(result) == 1
+        ident = result.ID.iloc[0]
+    id_in_pop_ids = id_in_series(ident, populations.ID)
+    id_in_pop_names = id_in_series(ident, populations.Name)
+    id_in_variants = id_in_series(ident, variantmap.Variant)
+    id_in_marker_names = id_in_series(ident, markers.Name)
+    id_in_marker_permids = id_in_series(ident, markers.PermID)
+    if id_in_pop_ids or id_in_pop_names:
+        return Population.table_from_ids([ident])
+    elif id_in_variants or id_in_marker_names or id_in_marker_permids:
+        return Marker.table_from_ids([ident])
+    else:
+        raise ValueError(f'identifier "{ident}" not found in MicroHapDB')