feat: Target level mapping metadata by bencap · Pull Request #97 · VariantEffect/dcd_mapping2

bencap · 2026-04-30T15:47:37Z

This pull request introduces several improvements to the mapping output pipeline, focusing on standardizing the output format, simplifying result assembly, and improving maintainability. The changes ensure that mapping outputs conform to a documented schema, add automated schema validation to CI, and refactor the API code to use a single authoritative mapping output builder. Documentation has also been expanded to clarify the mapping output structure and regeneration workflow.

Mapping output standardization and documentation:

Added a comprehensive "Mapping output" section to README.md, documenting the structure and fields of the mapping output JSON, including top-level keys, per-variant audit flags, and instructions for regenerating the schema.

Schema generation and validation:

Added a new script scripts/generate_schema.py to generate schema.json from the current Pydantic models, ensuring the schema stays in sync with code changes.
Updated the GitHub Actions workflow (.github/workflows/checks.yaml) to verify that schema.json is up to date on every CI run.

API and codebase refactoring:

Refactored the /map API endpoint in src/api/routers/map.py to use the new build_scoreset_mapping function for assembling mapping output, removing manual construction of the output structure and ensuring consistency with the schema. All error responses now consistently use the verbatim upstream metadata. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]

Code quality and style:

Updated pyproject.toml to ignore additional docstring-related lint rules in test files, reducing noise from style checks.

Gene symbol normalization:

Improved the get_gene_symbol function in src/dcd_mapping/lookup.py to tokenize the target name and try each token for gene normalization, making gene symbol extraction more robust.

- Introduce `AlignmentQc` schema capturing per-alignment BLAT quality metrics (identity, CIGAR, mismatch positions, gap intervals) with in-memory-only position lists excluded from serialization - Add `TargetMapping` schema representing a per-(target, layer) mapping row with variant counts, tool parameters, and preferred-layer flag - Add `VrsMapResult` NamedTuple to pair VRS mappings with their `TargetMapping` rows from `vrs_map` - Rename `annotation_layer` → `alignment_level` on `MappedScore` / `ScoreAnnotation` to align with new terminology - Rename `ident_pct` → `percent_identity` on `AlignmentResult`; add `score`, `next_best_score`, `alignment_qc`, `aligner_parameters`, and `reference_assembly` fields - Implement `build_scoreset_mapping` in `annotate.py` to assemble `ScoresetMapping` with populated `target_mappings` list, per-variant locus-quality flags (`at_mismatched_locus`, `near_gap`), and reference sequence metadata - Restore canonical BLAT PSL scoring (`matches - misMatches - qNumInsert - tNumInsert`) in `_get_best_hsp`; previous BioPython port used raw identity count, causing noisy alignments to outrank clean ones - Update JSON schema, API router, CI workflow, and README to reflect new output shape - Add `test_annotate_target_mapping.py` and expand `test_align.py` / `test_annotate.py` with unit tests for new logic Co-authored-by: Copilot <copilot@github.com>

…alignment_qc - protein-vs-DNA (-q=prot -t=dnax) BLAT runs store target coords in nucleotide space and query coords in amino-acid space (3:1 ratio); minus-strand target blocks have ts > te, making seq[ts:te] return "". Comparison was crashing with ValueError from zip(strict=True); the per-base mismatch loop is now skipped entirely for this mode, setting mismatch_positions_unavailable=True so at_mismatched_locus is correctly left as None (not evaluated) rather than a false False. The preferred layer for protein scoresets is PROTEIN, flagged from the downstream protein-to-protein alignment, so no signal is lost. - For nucleotide-vs-nucleotide runs, replace the bare zip(strict=True) with an explicit length-mismatch guard that logs a WARNING and falls through to zip(strict=False), preserving all mismatches in the overlapping prefix rather than crashing or discarding the block.

…JSON mode

…s for better matching This is mostly useful in multi-word target names where gene information is available but not in the first word of the target name.

…and adjust related annotations Co-authored-by: Copilot <copilot@github.com>

bencap and others added 5 commits April 28, 2026 15:41

fix(annotate): ensure JSON output is expanded by serializing without …

b449001

…JSON mode

feat(lookup): enhance gene symbol retrieval by tokenizing target name…

736db82

…s for better matching This is mostly useful in multi-word target names where gene information is available but not in the first word of the target name.

fix(mapping): update aligner type to reference_accession_passthrough …

ef26cbd

…and adjust related annotations Co-authored-by: Copilot <copilot@github.com>

bencap mentioned this pull request Apr 30, 2026

feat: Target Gene Mapping Table VariantEffect/mavedb-api#719

Draft

chore(schema): Update schema to reflect updated output

6cc9b66

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Target level mapping metadata#97

feat: Target level mapping metadata#97
bencap wants to merge 6 commits intofeature/bencap/improved-error-visibilityfrom
feature/bencap/86/target-level-mapping-metadata

bencap commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bencap commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant