Genome, transcript, and protein sequence variants are typically reported using the variation nomenclature ("varnomen") recommendations provided by the Human Genome Variation Society (HGVS) (Taschner and den Dunnen, 2011). Most variants are deceptively simple looking, such as NM_021960.4:c.740C>T. In reality, the varnomen standard provides for much more complex concepts and representations.

As high-throughput sequencing becomes commonplace in the investigation and diagnosis of disease, it is essential that communicating variants from sequencing projects to the scientific community and from diagnostic laboratories to health care providers is easy and accurate. The HGVS mutation nomenclature recommendations⁠ are generally accepted for the communication of sequence variation: they are widely endorsed by professional organizations, mandated by numerous journals, and the prevalent representation used by databases and interactive scientific software tools. The guidelines – originally devised to standardize the representation of variants discovered before the advent of high-throughput sequencing – are now approved by the HGVS and continue to evolve under the auspices of the Human Variome Project. Unfortunately, the complexity of biological phenomena and the breadth of the varnomen standard makes it difficult to implement the standard in software, which in turn makes using the standard in high-throughput analyses difficult.

This package, hgvs, is an easy-to-use Python library for parsing, representing, formatting, and mapping variants between genome, transcript, and protein sequences. The current implementation handles most (but not all) of the varnomen standard for precisely defined sequence variants. The intent is to centralize the subset of HGVS variant manipulation that is routinely used in modern, high-throughput sequencing analysis.

Features of the hgvs Package

  • Convenient object representation. Manipulate variants conceptually rather than by modifying text strings. Classes model HGVS concepts such as :class:`Interval <hgvs.location.Interval>`, intronic offsets (in :class:`BaseOffsetPosition <hgvs.location.BaseOffsetPosition>`), uncertainty, and types of variation (:mod:`hgvs.edit`).
  • A grammar-based parser. hgvs uses :doc:`a formal grammar <hgvs_railroad>` to parse HGVS variants rather than string partitioning or regular expression pattern matching. This makes parsing easier to understand, extend, and validate.
  • Simple variant formatting. Object representations of variants may be turned into HGVS strings simply by printing or "stringifying" them.
  • Robust variant mapping. The package includes tools to map variants between genome, transcript, and protein sequences (:class:`VariantMapper <hgvs.variantmapper.VariantMapper>` and to perform liftover between two transcript via a common reference (:class:`Projector <hgvs.projector.Projector>`). The hgvs mapper is specifically designed to reliably handl of regions reference-transcript indel discrepancy that are not covered by other tools.
  • Additional variant validation. The package includes tools to validate variants, separate from syntactic validation provided by the grammar.
  • Extensible data sources. Mapping and sequence data come from `UTA`_ by default, but the package includes a well-defined service interface that enables alternative data sources.
  • Extensive automated tests. We run extensive automated tests consisting of all supported variant types on many genes for every single commit to the source code repository. Test results are displayed publicly and immediately.


Some HGVS recommendations are intentionally absent. This package is primarily concerned with the subset of the `VarNomen`_ recommendations that are relevant for high-throughput sequencing. See `issues`_ for a full set of bugs and feature requests.

Related tools

  • Mutalyzer provides a web interface to variant validation and mapping.
  • Counsyl hgvs package provides functionality conceptually similar to that of the Invitae hgvs package.


