Summary
Variant is foreign-keyed to ScoreSet but not to TargetGene. For multi-target score sets, a variant's target association must be recovered by re-parsing its HGVS string against the score set's target sequences — information we have at validation time and then throw away. Add variants.target_gene_id and populate it during variant creation so downstream systems (mapping, QC, publication) can attribute per variant without re-parsing or guessing.
Problem
The validation pipeline knows which target each variant was scored against — it parses HGVS strings against specific target sequences to accept or reject the row. That knowledge is discarded once validation passes. The schema retains only variant.score_set_id.
Consequences we are already paying for:
- It's non-trivial to supply target based variants in API responses for downstream consumers to use directly against target data.
- Per-target QC ("what fraction of target 2's variants mapped cleanly?") cannot be answered directly from the database.
- Any future system that wants to partition variants by target — ClinVar submission, LDH, annotation rollups — re-derives this attribution ad hoc.
Single-target score sets (the common case) are unaffected in practice, which is why the pain has been tolerated; multi-target score sets pay the full cost.
Proposed behavior
Add a NOT NULL target_gene_id FK to variants. Populate it in create_variants_data / create_variants at the point where HGVS is already being validated against target sequences — the target identity is already known there and just needs to be carried through.
ScoreSet → Variant stays queryable via a viewonly relationship that joins through target_genes so existing call sites continue to work during the transition. Provenance tables that currently FK to MappedVariant (or will — e.g., TargetGeneMapping) can attribute precisely: the mapped variant points at the variant, which points at the target, closing the loop.
Acceptance criteria
variants.target_gene_id column added, NOT NULL, FK to target_genes.id with ondelete="CASCADE".
Variant.target_gene relationship defined with back_populates on TargetGene.variants.
- Variant creation paths set
target_gene_id from the target used during HGVS validation; no code path inserts a variant without it.
ScoreSet.variants continues to return all variants of the score set (via join through targets, viewonly) so existing callers need no change.
- The VRS mapping worker's
TargetGeneMapping attribution uses variant.target_gene_id instead of min(target_gene_id).
- Alembic migration back-fills
target_gene_id for existing rows:
- Single-target score sets: trivially set to that score set's only target.
- Multi-target score sets: re-parse
hgvs_nt / hgvs_pro to match target accession or label; rows that cannot be attributed unambiguously are reported in a migration summary and require manual triage before the NOT NULL constraint is applied.
- Migration includes a pre-flight SQL query (runnable standalone) that counts ambiguous rows so the scope of manual triage is known before the migration is applied in production.
Implementation notes
- Preferred order of operations in the migration: add column nullable → run back-fill for single-target sets → run back-fill for multi-target sets with exact accession/label match → emit a report of remaining NULLs → require zero NULLs before
ALTER COLUMN ... SET NOT NULL. If the report is non-empty, the migration fails loudly and requires a data-fix step.
- The back-fill's HGVS re-parsing logic should reuse the same machinery as the validation pipeline rather than a migration-local reimplementation; keeping a single parser avoids the back-fill and validation disagreeing.
- URN generation for variants currently encodes position within the score set (
<score_set_urn>#<n>). This issue does not change URN semantics — variants remain addressable by score-set-scoped URN — but the FK makes per-target slicing efficient.
- Publication, variant listing, and dataset-columns logic should be audited for implicit assumptions that "all variants in a score set share one target"; those call sites were previously masking the multi-target case.
- Consider whether
target_sequence / target_accession re-derivation elsewhere in the codebase can be simplified once this FK exists.
Summary
Variantis foreign-keyed toScoreSetbut not toTargetGene. For multi-target score sets, a variant's target association must be recovered by re-parsing its HGVS string against the score set's target sequences — information we have at validation time and then throw away. Addvariants.target_gene_idand populate it during variant creation so downstream systems (mapping, QC, publication) can attribute per variant without re-parsing or guessing.Problem
The validation pipeline knows which target each variant was scored against — it parses HGVS strings against specific target sequences to accept or reject the row. That knowledge is discarded once validation passes. The schema retains only
variant.score_set_id.Consequences we are already paying for:
Single-target score sets (the common case) are unaffected in practice, which is why the pain has been tolerated; multi-target score sets pay the full cost.
Proposed behavior
Add a NOT NULL
target_gene_idFK tovariants. Populate it increate_variants_data/create_variantsat the point where HGVS is already being validated against target sequences — the target identity is already known there and just needs to be carried through.ScoreSet → Variantstays queryable via a viewonly relationship that joins throughtarget_genesso existing call sites continue to work during the transition. Provenance tables that currently FK toMappedVariant(or will — e.g.,TargetGeneMapping) can attribute precisely: the mapped variant points at the variant, which points at the target, closing the loop.Acceptance criteria
variants.target_gene_idcolumn added, NOT NULL, FK totarget_genes.idwithondelete="CASCADE".Variant.target_generelationship defined withback_populatesonTargetGene.variants.target_gene_idfrom the target used during HGVS validation; no code path inserts a variant without it.ScoreSet.variantscontinues to return all variants of the score set (via join through targets, viewonly) so existing callers need no change.TargetGeneMappingattribution usesvariant.target_gene_idinstead ofmin(target_gene_id).target_gene_idfor existing rows:hgvs_nt/hgvs_proto match target accession or label; rows that cannot be attributed unambiguously are reported in a migration summary and require manual triage before the NOT NULL constraint is applied.Implementation notes
ALTER COLUMN ... SET NOT NULL. If the report is non-empty, the migration fails loudly and requires a data-fix step.<score_set_urn>#<n>). This issue does not change URN semantics — variants remain addressable by score-set-scoped URN — but the FK makes per-target slicing efficient.target_sequence/target_accessionre-derivation elsewhere in the codebase can be simplified once this FK exists.