Skip to content

Consolidate per-(target gene, layer) reference identity into a single structured home #763

@bencap

Description

@bencap

Summary

Per-(target gene, layer) reference identity — the pre/post-mapped sequence_id, sequence_type, and reference accession(s) for each annotation layer (genomic / cdna / protein) — has no single structured home. It is split between structured columns on TargetGeneMapping (a QC/provenance table) and JSON blobs on TargetGene. This issue tracks consolidating it into one source that view models can map from cleanly.

Problem

Reference identity currently lives in two overlapping places:

  • TargetGeneMapping columns (reference_accession, reference_sequence_id, reference_assembly) — per (target_gene, alignment_level), structured and queryable, but on a table whose purpose is alignment QC/provenance (scores, mismatch/gap counts, CIGAR, etc.).
  • TargetGene.pre_mapped_metadata / post_mapped_metadata — JSON blobs keyed by layer (e.g. {"cdna": {"sequence_accessions": ["NM_003345.5"]}, "genomic": {"sequence_id": "ga4gh:SQ…", "sequence_type": "dna", "sequence_accessions": ["NC_000016.10"]}}), read via fragile .get("layer", {}).get("field") chains in lib/mapping/metadata.py, lib/target_genes.py, and the statistics / score_sets routers.

Consequences:

  • Duplication. TargetGene.post_mapped_metadata[layer].sequence_accessions overlaps TargetGeneMapping(layer).reference_accession; …sequence_id overlaps reference_sequence_id.
  • A non-idiomatic shortcut. The reverse-translation worker job sources its coding (NM_) transcript from TargetGeneMapping.reference_accession — reaching into the QC table — because that is the only structured path; the idiomatic source (per-target-gene reference metadata) is stuck in the painful JSON blob.
  • Painful extraction. Every consumer that needs a per-layer accession or sequence id does ad-hoc, unvalidated dictionary digging into the blobs.

Proposed behavior

Per-(target gene, layer) reference identity is read and written from one structured, queryable home, such that:

  • The reverse-translation job (and other consumers) resolve the coding/genomic/protein reference from that home rather than from the QC table or the JSON blobs.
  • A view model can map the reference data from a single source, not by reassembling it across multiple tables.
  • The TargetGene.{pre,post}_mapped_metadata blobs are retired (or derived), removing the duplication.

Two candidate shapes (decide during implementation):

  1. Promote onto TargetGeneMapping — it is already per-(target_gene, layer). Add the remaining structured fields and retire the blobs. Smallest schema change; relaxes the "TGM = QC only" framing.
  2. Dedicated reference table keyed by (target_gene, layer) — keeps TGM purely QC, moves reference identity (incl. the existing reference_accession / reference_sequence_id) into the new table.

Acceptance criteria

  • Per-layer reference identity (pre/post sequence_id, sequence_type, accession(s)) is readable from a single structured source without parsing JSON blobs.
  • The reverse-translation job resolves its coding transcript from that source rather than from a QC table as a side effect.
  • lib/mapping/metadata.py, lib/target_genes.py, and the statistics / score_sets routers no longer dig per-layer reference fields out of TargetGene.{pre,post}_mapped_metadata.
  • The TargetGene metadata blobs are retired or explicitly derived; the TGM-vs-blob duplication is gone.
  • View models expose reference identity by mapping from one source; storage promotion does not bloat API responses (response models stay curated).

Implementation notes

  • View-model bloat is avoidable. Storage and the API response are separate layers — promote columns/table for internal query ergonomics and keep the response view models curated. The constraint that matters is the opposite one: avoid scattering reference identity across tables, because then every view model must reassemble it.
  • sequence_genes is not reference identity. It is gene annotation (HGNC symbols per layer, backfilled by scripts/mapped_gene_from_mapped_variant.py, read by the statistics router). It needs its own home and should not ride along in a reference table by default.
  • Reconcile cardinality. sequence_accessions is a list in the blob; TargetGeneMapping.reference_accession is singular. Decide on an array column vs. "primary accession + the rest."
  • Deferred intentionally — there is no hot-path correctness or performance problem today; the cost is ergonomic. This issue records the design so it is not re-derived each time the area is touched.

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions