Context
Depends on: #742
The new mapping pipeline covers score sets going forward. Existing score sets in MaveDB have MappedVariant records under the old schema. A two-phase backfill is needed: (1) migrate all existing MappedVariant rows to the new MappingRecord + flat Allele schema, and (2) run reverse translation for protein-level score sets that currently lack coding and genomic Allele rows.
Goal
A one-time (re-runnable) backfill job that migrates existing MappedVariant data to the new schema and ensures all protein-level score sets have a full complement of coding and genomic Allele rows with ClinGen registration and annotations.
Phase 1 — Schema migration of existing data
For each existing MappedVariant:
- Create a
MappingRecord row carrying provenance fields (pre_mapped, mapping_api_version, mapped_date, current, QC fields)
- Create
Allele rows from existing hgvs_g, hgvs_c, hgvs_p columns, populating level, transcript, and post_mapped as available (upsert by VRS digest)
- Create
mapping_record_alleles association rows
- Migrate
VariantAnnotationStatus rows from variant_id to allele_id
- Repoint
gnomad_variants and clinical_controls M2M associations from mapped_variant_id to allele_id
Phase 2 — Reverse translation for protein-level score sets
For each protein-level score set with MappingRecord rows but no associated coding/genomic Allele rows:
Open Question: variant_translations deprecation
Once all score sets have Allele rows at all levels, the variant_translations flat lookup table may be redundant. Whether the FK join (AssayedVariant → MappingRecord → Allele) is fast enough for search at scale should be benchmarked during the backfill. If performant, variant_translations can be deprecated; if not, it stays as a denormalized search index populated from Allele FK data.
Acceptance Criteria
- Phase 1 migrates all existing
MappedVariant rows to MappingRecord + Allele rows without data loss
- All
VariantAnnotationStatus, gnomad_variants, and clinical_controls associations are correctly remapped to allele_id
- Phase 2 identifies protein-level score sets with no coding/genomic
Allele rows and runs the full pipeline for each
- Job is idempotent — re-running does not create duplicate records
- Progress and failures are observable via logging and status tracking
- FK join query performance (
AssayedVariant → MappingRecord → Allele) is benchmarked against variant_translations flat lookup at scale to inform the deprecation decision
Context
Depends on: #742
The new mapping pipeline covers score sets going forward. Existing score sets in MaveDB have
MappedVariantrecords under the old schema. A two-phase backfill is needed: (1) migrate all existingMappedVariantrows to the newMappingRecord+ flatAlleleschema, and (2) run reverse translation for protein-level score sets that currently lack coding and genomicAllelerows.Goal
A one-time (re-runnable) backfill job that migrates existing
MappedVariantdata to the new schema and ensures all protein-level score sets have a full complement of coding and genomicAllelerows with ClinGen registration and annotations.Phase 1 — Schema migration of existing data
For each existing
MappedVariant:MappingRecordrow carrying provenance fields (pre_mapped,mapping_api_version,mapped_date,current, QC fields)Allelerows from existinghgvs_g,hgvs_c,hgvs_pcolumns, populatinglevel,transcript, andpost_mappedas available (upsert by VRS digest)mapping_record_allelesassociation rowsVariantAnnotationStatusrows fromvariant_idtoallele_idgnomad_variantsandclinical_controlsM2M associations frommapped_variant_idtoallele_idPhase 2 — Reverse translation for protein-level score sets
For each protein-level score set with
MappingRecordrows but no associated coding/genomicAllelerows:Allelerows (level = 'coding'andlevel = 'genomic') and link viamapping_record_allelesOpen Question:
variant_translationsdeprecationOnce all score sets have
Allelerows at all levels, thevariant_translationsflat lookup table may be redundant. Whether the FK join (AssayedVariant → MappingRecord → Allele) is fast enough for search at scale should be benchmarked during the backfill. If performant,variant_translationscan be deprecated; if not, it stays as a denormalized search index populated fromAlleleFK data.Acceptance Criteria
MappedVariantrows toMappingRecord+Allelerows without data lossVariantAnnotationStatus,gnomad_variants, andclinical_controlsassociations are correctly remapped toallele_idAllelerows and runs the full pipeline for eachAssayedVariant → MappingRecord → Allele) is benchmarked againstvariant_translationsflat lookup at scale to inform the deprecation decision