Problem
Three gaps in MaveDB's current reverse translation system:
- Coverage ceiling: We can only generate reverse translations for variants already registered in ClinGen. Variants absent from ClinGen have no nucleotide-level equivalents surfaced.
- Annotation linkage: Derived nucleotide alleles (the NT variants encoding a protein change) have no first-class representation in our data model and cannot be independently annotated.
- API semantics: There is no structured way to communicate to the UI and API consumers that a set of nucleotide variants are categorical representations of a protein-level scored variant — not directly measured variants.
Architecture
Data model (AssayedVariant → MappingRecord → Allele):
-
MappingRecord replaces MappedVariant as the provenance record per mapping run. It carries vrs_digest (indexed, pre-mapped variant), pre_mapped JSONB, assay_level, mapping metadata, and QC fields. It has a M:N relationship to Allele rows via a mapping_record_alleles association table.
-
Allele is a flat table (no inheritance) deduplicated by VRS digest across all score sets. Key columns:
| Column |
Notes |
vrs_digest |
unique |
level |
enum: genomic | coding | protein |
transcript |
NOT NULL — present for all levels |
hgvs_g / hgvs_c / hgvs_p |
nullable; populated in post-processing, enforced at application layer |
clingen_allele_id |
nullable, populated where available |
post_mapped |
JSONB — raw mapper output for this allele at this level |
The same allele appearing in multiple score sets shares a single row; annotation results are shared accordingly.
-
Annotation FKs point to allele_id. AnnotationStatus is scoped to QC/audit only. Annotation data lives in first-class per-type tables with superseded_at for temporal queries: VEPAnnotation (new), GnomADVariant (updated), ClinicalControl (updated).
Schema rule: Fields stable by construction (HGVS strings, ClinGen IDs) live as columns on the alleles / mapping_records tables. External interpretations subject to revision (VEP, gnomAD, ClinVar) live in temporal annotation tables with superseded_at.
Pipeline:
- Mapping pipeline (dcd_mapping) produces a
MappingRecord + Allele rows at all applicable levels for every variant. For protein-level targets, reverse translation enumerates all coding variants encoding each amino acid change.
- Coding and genomic
Allele rows (level = 'coding' or level = 'genomic') are submitted to ClinGen pre-registration. Alleles already registered are skipped.
- Existing annotation jobs (VEP, ClinVar, gnomAD) are extended to annotate
Allele rows via allele_id, with level-appropriate routing driven by the level column.
API transit:
- CatVRS (GA4GH Categorical Variation Representation Specification) is used as a transit layer in API responses to express that a set of coding alleles share an implied score — they encode a protein change that was scored, but are not scored directly.
- Traversal:
AssayedVariant with MappingRecord.assay_level = protein → mapping_record_alleles → Allele rows where level = 'coding' → CatVRS members.
- Storage remains in the
alleles table; CatVRS lives only in responses.
Child Issues
dcd_mapping2
Data Model & Storage
ClinGen Pre-Registration
Annotation Pipeline
API Layer
Backfill
Deferred
Problem
Three gaps in MaveDB's current reverse translation system:
Architecture
Data model (
AssayedVariant→MappingRecord→Allele):MappingRecordreplacesMappedVariantas the provenance record per mapping run. It carriesvrs_digest(indexed, pre-mapped variant),pre_mappedJSONB,assay_level, mapping metadata, and QC fields. It has a M:N relationship toAllelerows via amapping_record_allelesassociation table.Alleleis a flat table (no inheritance) deduplicated by VRS digest across all score sets. Key columns:vrs_digestlevelgenomic|coding|proteintranscripthgvs_g/hgvs_c/hgvs_pclingen_allele_idpost_mappedThe same allele appearing in multiple score sets shares a single row; annotation results are shared accordingly.
Annotation FKs point to
allele_id.AnnotationStatusis scoped to QC/audit only. Annotation data lives in first-class per-type tables withsuperseded_atfor temporal queries:VEPAnnotation(new),GnomADVariant(updated),ClinicalControl(updated).Schema rule: Fields stable by construction (HGVS strings, ClinGen IDs) live as columns on the
alleles/mapping_recordstables. External interpretations subject to revision (VEP, gnomAD, ClinVar) live in temporal annotation tables withsuperseded_at.Pipeline:
MappingRecord+Allelerows at all applicable levels for every variant. For protein-level targets, reverse translation enumerates all coding variants encoding each amino acid change.Allelerows (level = 'coding'orlevel = 'genomic') are submitted to ClinGen pre-registration. Alleles already registered are skipped.Allelerows viaallele_id, with level-appropriate routing driven by thelevelcolumn.API transit:
AssayedVariantwithMappingRecord.assay_level = protein→mapping_record_alleles→Allelerows wherelevel = 'coding'→ CatVRS members.allelestable; CatVRS lives only in responses.Child Issues
dcd_mapping2
Data Model & Storage
ClinGen Pre-Registration
Annotation Pipeline
API Layer
Backfill
Deferred