Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database backend rebuild #106

Merged
merged 43 commits into from
Feb 23, 2023
Merged

Database backend rebuild #106

merged 43 commits into from
Feb 23, 2023

Conversation

standage
Copy link
Member

@standage standage commented Feb 23, 2023

The MicroHapDB "backend" is really just a handful of tabular files. These are the most usable and useful contribution of MicroHapDB to the community, but the code and procedure for compiling these tables from primary data sources (the real "backend") is also an important community contribution.

This PR made some changes to marker nomenclature, the structure of the database marker table, and the frequency estimation procedure. In the process, the database build code was revisited and streamlined.

The most important changes are detailed below.

  • MicroHapDB now explicitly handles markers that are defined in different ways but using the same name. The standard MH nomenclature (Kidd 2016) applies to the locus, and when there are multiple competing marker definitions at a locus, MicroHapDB assigns a suffix to the locus name so that the various markers can be distinguished.
    • Example: mh11KK-191 (chr11:100009431-100009620)
      • mh11KK-191.v1 (Kidd 2018): rs12421109;rs12289401;rs12420819;rs770566
      • mh11KK-191.v2 (Gandotra 2020): rs12421109;rs12289401;rs12420819;rs11222337;rs770566
      • mh11KK-191.v3 (Staadig 2021): rs12421109;rs12289401
      • mh11KK-191.v4 (Pakstis 2021): rs12421109;rs12289401;rs12420819;rs1315919758;rs11222337;rs770566
  • The main marker table now includes a superset of features a user may want to query. Rather than storing e.g. GRCh37 coordinates in a separate table, these are kept in the main table. Instead, the microhapdb marker command will show a subset of fields for each marker in the default tabular mode. The user can select which fields to show.
  • The Staadig 2021 paper replaces the preliminary data from the 2019 ISFG poster used previously.
  • The NYGC 2022 update to the 1000 Genomes Project (Byrska-Bishop 2022) was used to re-estimate haplotype frequencies for all MHs. Rather than discarding markers with rare SNPs, the reference allele was assumed. (Indel handling was not validated, and use of MicroHapDB frequencies for markers with indels in the definition is not recommended.)
    • The updated estimates are for non-admixed superpopulation groups (EUR, AFR, SAS, EAS) and aggregated over all 1000 Genomes data (1KGP).
    • Previously computed estimates for the 26 global populations from the 2015 data are still accessible.
  • Support for query by database cross-reference has been minimized.
  • The "sequences" table has been dropped. Instead, the pyfaidx package is now used to retrieve MH reference sequences directly from the GRCh38 reference assembly. This must be downloaded as part of the install process.

  • Update change log
  • Update documentation

@standage standage marked this pull request as ready for review February 23, 2023 17:45
@standage standage merged commit eefa6ae into master Feb 23, 2023
@standage standage deleted the rebuild branch February 23, 2023 18:23
@standage standage mentioned this pull request Feb 23, 2023
@standage standage mentioned this pull request Mar 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant