Skip to content

feat: Target level mapping metadata#97

Draft
bencap wants to merge 6 commits intofeature/bencap/improved-error-visibilityfrom
feature/bencap/86/target-level-mapping-metadata
Draft

feat: Target level mapping metadata#97
bencap wants to merge 6 commits intofeature/bencap/improved-error-visibilityfrom
feature/bencap/86/target-level-mapping-metadata

Conversation

@bencap
Copy link
Copy Markdown
Collaborator

@bencap bencap commented Apr 30, 2026

This pull request introduces several improvements to the mapping output pipeline, focusing on standardizing the output format, simplifying result assembly, and improving maintainability. The changes ensure that mapping outputs conform to a documented schema, add automated schema validation to CI, and refactor the API code to use a single authoritative mapping output builder. Documentation has also been expanded to clarify the mapping output structure and regeneration workflow.

Mapping output standardization and documentation:

  • Added a comprehensive "Mapping output" section to README.md, documenting the structure and fields of the mapping output JSON, including top-level keys, per-variant audit flags, and instructions for regenerating the schema.

Schema generation and validation:

  • Added a new script scripts/generate_schema.py to generate schema.json from the current Pydantic models, ensuring the schema stays in sync with code changes.
  • Updated the GitHub Actions workflow (.github/workflows/checks.yaml) to verify that schema.json is up to date on every CI run.

API and codebase refactoring:

  • Refactored the /map API endpoint in src/api/routers/map.py to use the new build_scoreset_mapping function for assembling mapping output, removing manual construction of the output structure and ensuring consistency with the schema. All error responses now consistently use the verbatim upstream metadata. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]

Code quality and style:

  • Updated pyproject.toml to ignore additional docstring-related lint rules in test files, reducing noise from style checks.

Gene symbol normalization:

  • Improved the get_gene_symbol function in src/dcd_mapping/lookup.py to tokenize the target name and try each token for gene normalization, making gene symbol extraction more robust.

bencap and others added 5 commits April 28, 2026 15:41
- Introduce `AlignmentQc` schema capturing per-alignment BLAT quality
  metrics (identity, CIGAR, mismatch positions, gap intervals) with
  in-memory-only position lists excluded from serialization
- Add `TargetMapping` schema representing a per-(target, layer) mapping
  row with variant counts, tool parameters, and preferred-layer flag
- Add `VrsMapResult` NamedTuple to pair VRS mappings with their
  `TargetMapping` rows from `vrs_map`
- Rename `annotation_layer` → `alignment_level` on `MappedScore` /
  `ScoreAnnotation` to align with new terminology
- Rename `ident_pct` → `percent_identity` on `AlignmentResult`; add
  `score`, `next_best_score`, `alignment_qc`, `aligner_parameters`, and
  `reference_assembly` fields
- Implement `build_scoreset_mapping` in `annotate.py` to assemble
  `ScoresetMapping` with populated `target_mappings` list, per-variant
  locus-quality flags (`at_mismatched_locus`, `near_gap`), and reference
  sequence metadata
- Restore canonical BLAT PSL scoring (`matches - misMatches -
  qNumInsert - tNumInsert`) in `_get_best_hsp`; previous BioPython port
  used raw identity count, causing noisy alignments to outrank clean ones
- Update JSON schema, API router, CI workflow, and README to reflect new
  output shape
- Add `test_annotate_target_mapping.py` and expand `test_align.py` /
  `test_annotate.py` with unit tests for new logic

Co-authored-by: Copilot <copilot@github.com>
…alignment_qc

- protein-vs-DNA (-q=prot -t=dnax) BLAT runs store target coords in
  nucleotide space and query coords in amino-acid space (3:1 ratio);
  minus-strand target blocks have ts > te, making seq[ts:te] return "".
  Comparison was crashing with ValueError from zip(strict=True); the
  per-base mismatch loop is now skipped entirely for this mode, setting
  mismatch_positions_unavailable=True so at_mismatched_locus is
  correctly left as None (not evaluated) rather than a false False.
  The preferred layer for protein scoresets is PROTEIN, flagged from
  the downstream protein-to-protein alignment, so no signal is lost.

- For nucleotide-vs-nucleotide runs, replace the bare zip(strict=True)
  with an explicit length-mismatch guard that logs a WARNING and falls
  through to zip(strict=False), preserving all mismatches in the
  overlapping prefix rather than crashing or discarding the block.
…s for better matching

This is mostly useful in multi-word target names where gene information is available but not in the first word of the target name.
…and adjust related annotations

Co-authored-by: Copilot <copilot@github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant