Skip to content

feat: Retroactive backfill — migrate MappedVariant data and run reverse translation for existing score sets #747

@bencap

Description

@bencap

Context

Depends on: #742

The new mapping pipeline covers score sets going forward. Existing score sets in MaveDB have MappedVariant records under the old schema. A two-phase backfill is needed: (1) migrate all existing MappedVariant rows to the new MappingRecord + flat Allele schema, and (2) run reverse translation for protein-level score sets that currently lack coding and genomic Allele rows.

Goal

A one-time (re-runnable) backfill job that migrates existing MappedVariant data to the new schema and ensures all protein-level score sets have a full complement of coding and genomic Allele rows with ClinGen registration and annotations.

Phase 1 — Schema migration of existing data

For each existing MappedVariant:

  • Create a MappingRecord row carrying provenance fields (pre_mapped, mapping_api_version, mapped_date, current, QC fields)
  • Create Allele rows from existing hgvs_g, hgvs_c, hgvs_p columns, populating level, transcript, and post_mapped as available (upsert by VRS digest)
  • Create mapping_record_alleles association rows
  • Migrate VariantAnnotationStatus rows from variant_id to allele_id
  • Repoint gnomad_variants and clinical_controls M2M associations from mapped_variant_id to allele_id

Phase 2 — Reverse translation for protein-level score sets

For each protein-level score set with MappingRecord rows but no associated coding/genomic Allele rows:

Open Question: variant_translations deprecation

Once all score sets have Allele rows at all levels, the variant_translations flat lookup table may be redundant. Whether the FK join (AssayedVariant → MappingRecord → Allele) is fast enough for search at scale should be benchmarked during the backfill. If performant, variant_translations can be deprecated; if not, it stays as a denormalized search index populated from Allele FK data.

Acceptance Criteria

  • Phase 1 migrates all existing MappedVariant rows to MappingRecord + Allele rows without data loss
  • All VariantAnnotationStatus, gnomad_variants, and clinical_controls associations are correctly remapped to allele_id
  • Phase 2 identifies protein-level score sets with no coding/genomic Allele rows and runs the full pipeline for each
  • Job is idempotent — re-running does not create duplicate records
  • Progress and failures are observable via logging and status tracking
  • FK join query performance (AssayedVariant → MappingRecord → Allele) is benchmarked against variant_translations flat lookup at scale to inform the deprecation decision

Metadata

Metadata

Assignees

No one assigned

    Labels

    app: backendTask implementation touches the backendapp: workerTask implementation touches the workertype: featureNew featureworkstream: clinicalTask relates to clinical features

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions