feat: Target level mapping metadata#97
Draft
bencap wants to merge 6 commits intofeature/bencap/improved-error-visibilityfrom
Draft
feat: Target level mapping metadata#97bencap wants to merge 6 commits intofeature/bencap/improved-error-visibilityfrom
bencap wants to merge 6 commits intofeature/bencap/improved-error-visibilityfrom
Conversation
- Introduce `AlignmentQc` schema capturing per-alignment BLAT quality metrics (identity, CIGAR, mismatch positions, gap intervals) with in-memory-only position lists excluded from serialization - Add `TargetMapping` schema representing a per-(target, layer) mapping row with variant counts, tool parameters, and preferred-layer flag - Add `VrsMapResult` NamedTuple to pair VRS mappings with their `TargetMapping` rows from `vrs_map` - Rename `annotation_layer` → `alignment_level` on `MappedScore` / `ScoreAnnotation` to align with new terminology - Rename `ident_pct` → `percent_identity` on `AlignmentResult`; add `score`, `next_best_score`, `alignment_qc`, `aligner_parameters`, and `reference_assembly` fields - Implement `build_scoreset_mapping` in `annotate.py` to assemble `ScoresetMapping` with populated `target_mappings` list, per-variant locus-quality flags (`at_mismatched_locus`, `near_gap`), and reference sequence metadata - Restore canonical BLAT PSL scoring (`matches - misMatches - qNumInsert - tNumInsert`) in `_get_best_hsp`; previous BioPython port used raw identity count, causing noisy alignments to outrank clean ones - Update JSON schema, API router, CI workflow, and README to reflect new output shape - Add `test_annotate_target_mapping.py` and expand `test_align.py` / `test_annotate.py` with unit tests for new logic Co-authored-by: Copilot <copilot@github.com>
…alignment_qc - protein-vs-DNA (-q=prot -t=dnax) BLAT runs store target coords in nucleotide space and query coords in amino-acid space (3:1 ratio); minus-strand target blocks have ts > te, making seq[ts:te] return "". Comparison was crashing with ValueError from zip(strict=True); the per-base mismatch loop is now skipped entirely for this mode, setting mismatch_positions_unavailable=True so at_mismatched_locus is correctly left as None (not evaluated) rather than a false False. The preferred layer for protein scoresets is PROTEIN, flagged from the downstream protein-to-protein alignment, so no signal is lost. - For nucleotide-vs-nucleotide runs, replace the bare zip(strict=True) with an explicit length-mismatch guard that logs a WARNING and falls through to zip(strict=False), preserving all mismatches in the overlapping prefix rather than crashing or discarding the block.
…s for better matching This is mostly useful in multi-word target names where gene information is available but not in the first word of the target name.
…and adjust related annotations Co-authored-by: Copilot <copilot@github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces several improvements to the mapping output pipeline, focusing on standardizing the output format, simplifying result assembly, and improving maintainability. The changes ensure that mapping outputs conform to a documented schema, add automated schema validation to CI, and refactor the API code to use a single authoritative mapping output builder. Documentation has also been expanded to clarify the mapping output structure and regeneration workflow.
Mapping output standardization and documentation:
README.md, documenting the structure and fields of the mapping output JSON, including top-level keys, per-variant audit flags, and instructions for regenerating the schema.Schema generation and validation:
scripts/generate_schema.pyto generateschema.jsonfrom the current Pydantic models, ensuring the schema stays in sync with code changes..github/workflows/checks.yaml) to verify thatschema.jsonis up to date on every CI run.API and codebase refactoring:
/mapAPI endpoint insrc/api/routers/map.pyto use the newbuild_scoreset_mappingfunction for assembling mapping output, removing manual construction of the output structure and ensuring consistency with the schema. All error responses now consistently use the verbatim upstream metadata. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]Code quality and style:
pyproject.tomlto ignore additional docstring-related lint rules in test files, reducing noise from style checks.Gene symbol normalization:
get_gene_symbolfunction insrc/dcd_mapping/lookup.pyto tokenize the target name and try each token for gene normalization, making gene symbol extraction more robust.