Construction-time scanner schema, unified BED model, and CRAM reference support#161
Merged
nvictus merged 6 commits intoabdenlab:mainfrom Mar 6, 2026
Merged
Conversation
Extract the rich AutoSql-derived type system (FieldType, FieldDef,
FieldBuilder) from bbi/model/base/ into bed/model/field_def.rs so both
BED and BBI formats share a common schema vocabulary. The bed module
has no dependency on bigtools.
- Add bed/model/field_def.rs with FieldType (37 variants), FieldDef,
generic string-parsing FieldBuilder, and bed_standard_fields()
- Replace minimal BedSchema {n, m} with rich BedSchema {n, m, fields}
supporting new(), new_from_nm(), new_bedgraph(), and FromStr
- Slim bbi/model/base/field.rs to only TryFrom<&AutosqlField> impls
- Remove bbi/model/base/schema.rs; BBI re-exports from bed module
- Arrow type mappings remain format-specific (BED: Int64, BBI: UInt32)
Move `fields` (and related schema parameters) from scan-time arguments to construction-time arguments across all scanner types (BED, BigBed, BigWig, BBIZoom, SAM, BAM, CRAM, VCF, BCF, GFF, GTF, FASTA, FASTQ). Scanners now validate and cache their Arrow schema at construction, and scan methods accept only dynamic parameters: `columns` (projection), `batch_size`, and `limit`. This aligns the Rust scanner API with the Arrow dataset/fragment pattern used by the Python DataSource classes, where schema is known up front and scans only control how data is read.
Move schema-defining parameters (fields, tag_defs, attr_defs, info_fields, genotype_fields, samples, genotype_by, repo) from scan method calls to Scanner constructors. Scan methods now only accept runtime parameters (columns, batch_size, limit). Also update tag_defs() and attribute_defs() to be called as static methods on the scanner type rather than instance methods.
This also fixes a bug in PyCramScanner's tag_defs() method created a noodles CRAM reader without the reference sequence repository, causing failures when decoding CRAM files that store bases as diffs against an external reference. On the Python side, extract _tag_discovery_kwargs() so CramFile can pass reference/reference_index to the throwaway scanner used during tag discovery in __init__. Add tests for CRAM decoding with an external reference (full scan and regional query).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The previous scanner API accepted schema-defining params (fields, tag_defs, info_fields, etc.) at scan() time. Moving these to the constructor makes the lifecycle clearer: construct once with schema params, then scan multiple times with different projections/batch sizes. This helps separate the concerns of defining an initial schema on a file vs projecting columns onto that schema. It also simplifies the logic in the Python API since the schema doesn't need to be re-discovered on every query.
Additional refactors and changes:
Unify BED schema model across bed and bbi modules — extract shared FieldType, FieldDef, and BedSchema into bed/model/ and re-export from bbi, eliminating ~1200 lines of duplication.
Support CRAM external reference in
from_cram. Add test fixtures (sample-ref.*) and Python tests verifying decoded bases against a reference FASTA.