feat: Retroactive backfill — migrate MappedVariant data and run reverse translation for existing score sets

## Context

Depends on: [#742](https://github.com/VariantEffect/mavedb-api/issues/742)

The new mapping pipeline covers score sets going forward. Existing score sets in MaveDB have `MappedVariant` records under the old schema. A two-phase backfill is needed: (1) migrate all existing `MappedVariant` rows to the new `MappingRecord` + flat `Allele` schema, and (2) run reverse translation for protein-level score sets that currently lack coding and genomic `Allele` rows.

## Goal

A one-time (re-runnable) backfill job that migrates existing `MappedVariant` data to the new schema and ensures all protein-level score sets have a full complement of coding and genomic `Allele` rows with ClinGen registration and annotations.

## Phase 1 — Schema migration of existing data

For each existing `MappedVariant`:
- Create a `MappingRecord` row carrying provenance fields (`pre_mapped`, `mapping_api_version`, `mapped_date`, `current`, QC fields)
- Create `Allele` rows from existing `hgvs_g`, `hgvs_c`, `hgvs_p` columns, populating `level`, `transcript`, and `post_mapped` as available (upsert by VRS digest)
- Create `mapping_record_alleles` association rows
- Migrate `VariantAnnotationStatus` rows from `variant_id` to `allele_id`
- Repoint `gnomad_variants` and `clinical_controls` M2M associations from `mapped_variant_id` to `allele_id`

## Phase 2 — Reverse translation for protein-level score sets

For each protein-level score set with `MappingRecord` rows but no associated coding/genomic `Allele` rows:
- Run reverse translation to enumerate all coding variants
- Upsert resulting `Allele` rows (`level = 'coding'` and `level = 'genomic'`) and link via `mapping_record_alleles`
- Run ClinGen pre-registration (#741) and annotation (#742) for new alleles

## Open Question: `variant_translations` deprecation

Once all score sets have `Allele` rows at all levels, the `variant_translations` flat lookup table may be redundant. Whether the FK join (`AssayedVariant → MappingRecord → Allele`) is fast enough for search at scale should be benchmarked during the backfill. If performant, `variant_translations` can be deprecated; if not, it stays as a denormalized search index populated from `Allele` FK data.

## Acceptance Criteria

- Phase 1 migrates all existing `MappedVariant` rows to `MappingRecord` + `Allele` rows without data loss
- All `VariantAnnotationStatus`, `gnomad_variants`, and `clinical_controls` associations are correctly remapped to `allele_id`
- Phase 2 identifies protein-level score sets with no coding/genomic `Allele` rows and runs the full pipeline for each
- Job is idempotent — re-running does not create duplicate records
- Progress and failures are observable via logging and status tracking
- FK join query performance (`AssayedVariant → MappingRecord → Allele`) is benchmarked against `variant_translations` flat lookup at scale to inform the deprecation decision


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Retroactive backfill — migrate MappedVariant data and run reverse translation for existing score sets #747

Context

Goal

Phase 1 — Schema migration of existing data

Phase 2 — Reverse translation for protein-level score sets

Open Question: `variant_translations` deprecation

Acceptance Criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat: Retroactive backfill — migrate MappedVariant data and run reverse translation for existing score sets #747

Description

Context

Goal

Phase 1 — Schema migration of existing data

Phase 2 — Reverse translation for protein-level score sets

Open Question: variant_translations deprecation

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Open Question: `variant_translations` deprecation