Summary
Per-(target gene, layer) reference identity — the pre/post-mapped sequence_id, sequence_type, and reference accession(s) for each annotation layer (genomic / cdna / protein) — has no single structured home. It is split between structured columns on TargetGeneMapping (a QC/provenance table) and JSON blobs on TargetGene. This issue tracks consolidating it into one source that view models can map from cleanly.
Problem
Reference identity currently lives in two overlapping places:
TargetGeneMapping columns (reference_accession, reference_sequence_id, reference_assembly) — per (target_gene, alignment_level), structured and queryable, but on a table whose purpose is alignment QC/provenance (scores, mismatch/gap counts, CIGAR, etc.).
TargetGene.pre_mapped_metadata / post_mapped_metadata — JSON blobs keyed by layer (e.g. {"cdna": {"sequence_accessions": ["NM_003345.5"]}, "genomic": {"sequence_id": "ga4gh:SQ…", "sequence_type": "dna", "sequence_accessions": ["NC_000016.10"]}}), read via fragile .get("layer", {}).get("field") chains in lib/mapping/metadata.py, lib/target_genes.py, and the statistics / score_sets routers.
Consequences:
- Duplication.
TargetGene.post_mapped_metadata[layer].sequence_accessions overlaps TargetGeneMapping(layer).reference_accession; …sequence_id overlaps reference_sequence_id.
- A non-idiomatic shortcut. The reverse-translation worker job sources its coding (
NM_) transcript from TargetGeneMapping.reference_accession — reaching into the QC table — because that is the only structured path; the idiomatic source (per-target-gene reference metadata) is stuck in the painful JSON blob.
- Painful extraction. Every consumer that needs a per-layer accession or sequence id does ad-hoc, unvalidated dictionary digging into the blobs.
Proposed behavior
Per-(target gene, layer) reference identity is read and written from one structured, queryable home, such that:
- The reverse-translation job (and other consumers) resolve the coding/genomic/protein reference from that home rather than from the QC table or the JSON blobs.
- A view model can map the reference data from a single source, not by reassembling it across multiple tables.
- The
TargetGene.{pre,post}_mapped_metadata blobs are retired (or derived), removing the duplication.
Two candidate shapes (decide during implementation):
- Promote onto
TargetGeneMapping — it is already per-(target_gene, layer). Add the remaining structured fields and retire the blobs. Smallest schema change; relaxes the "TGM = QC only" framing.
- Dedicated reference table keyed by
(target_gene, layer) — keeps TGM purely QC, moves reference identity (incl. the existing reference_accession / reference_sequence_id) into the new table.
Acceptance criteria
Implementation notes
- View-model bloat is avoidable. Storage and the API response are separate layers — promote columns/table for internal query ergonomics and keep the response view models curated. The constraint that matters is the opposite one: avoid scattering reference identity across tables, because then every view model must reassemble it.
sequence_genes is not reference identity. It is gene annotation (HGNC symbols per layer, backfilled by scripts/mapped_gene_from_mapped_variant.py, read by the statistics router). It needs its own home and should not ride along in a reference table by default.
- Reconcile cardinality.
sequence_accessions is a list in the blob; TargetGeneMapping.reference_accession is singular. Decide on an array column vs. "primary accession + the rest."
- Deferred intentionally — there is no hot-path correctness or performance problem today; the cost is ergonomic. This issue records the design so it is not re-derived each time the area is touched.
🤖 Generated with Claude Code
Summary
Per-
(target gene, layer)reference identity — the pre/post-mappedsequence_id,sequence_type, and reference accession(s) for each annotation layer (genomic / cdna / protein) — has no single structured home. It is split between structured columns onTargetGeneMapping(a QC/provenance table) and JSON blobs onTargetGene. This issue tracks consolidating it into one source that view models can map from cleanly.Problem
Reference identity currently lives in two overlapping places:
TargetGeneMappingcolumns (reference_accession,reference_sequence_id,reference_assembly) — per(target_gene, alignment_level), structured and queryable, but on a table whose purpose is alignment QC/provenance (scores, mismatch/gap counts, CIGAR, etc.).TargetGene.pre_mapped_metadata/post_mapped_metadata— JSON blobs keyed by layer (e.g.{"cdna": {"sequence_accessions": ["NM_003345.5"]}, "genomic": {"sequence_id": "ga4gh:SQ…", "sequence_type": "dna", "sequence_accessions": ["NC_000016.10"]}}), read via fragile.get("layer", {}).get("field")chains inlib/mapping/metadata.py,lib/target_genes.py, and the statistics / score_sets routers.Consequences:
TargetGene.post_mapped_metadata[layer].sequence_accessionsoverlapsTargetGeneMapping(layer).reference_accession;…sequence_idoverlapsreference_sequence_id.NM_) transcript fromTargetGeneMapping.reference_accession— reaching into the QC table — because that is the only structured path; the idiomatic source (per-target-gene reference metadata) is stuck in the painful JSON blob.Proposed behavior
Per-
(target gene, layer)reference identity is read and written from one structured, queryable home, such that:TargetGene.{pre,post}_mapped_metadatablobs are retired (or derived), removing the duplication.Two candidate shapes (decide during implementation):
TargetGeneMapping— it is already per-(target_gene, layer). Add the remaining structured fields and retire the blobs. Smallest schema change; relaxes the "TGM = QC only" framing.(target_gene, layer)— keeps TGM purely QC, moves reference identity (incl. the existingreference_accession/reference_sequence_id) into the new table.Acceptance criteria
sequence_id,sequence_type, accession(s)) is readable from a single structured source without parsing JSON blobs.lib/mapping/metadata.py,lib/target_genes.py, and the statistics / score_sets routers no longer dig per-layer reference fields out ofTargetGene.{pre,post}_mapped_metadata.TargetGenemetadata blobs are retired or explicitly derived; the TGM-vs-blob duplication is gone.Implementation notes
sequence_genesis not reference identity. It is gene annotation (HGNC symbols per layer, backfilled byscripts/mapped_gene_from_mapped_variant.py, read by the statistics router). It needs its own home and should not ride along in a reference table by default.sequence_accessionsis a list in the blob;TargetGeneMapping.reference_accessionis singular. Decide on an array column vs. "primary accession + the rest."🤖 Generated with Claude Code