Construction-time scanner schema, unified BED model, and CRAM reference support by nvictus · Pull Request #161 · abdenlab/oxbow

nvictus · 2026-03-06T11:35:37Z

The previous scanner API accepted schema-defining params (fields, tag_defs, info_fields, etc.) at scan() time. Moving these to the constructor makes the lifecycle clearer: construct once with schema params, then scan multiple times with different projections/batch sizes. This helps separate the concerns of defining an initial schema on a file vs projecting columns onto that schema. It also simplifies the logic in the Python API since the schema doesn't need to be re-discovered on every query.

Move schema-defining params to scanner construction time for all formats (SAM, BAM, CRAM, VCF, BCF, GFF, GTF, FASTA, FASTQ, BED, BigBed, BigWig).
- Update Python bindings and DataSource classes to match the new Rust scanner API. All Py*Scanner wrappers and DataSource subclasses updated accordingly.
- Update R bindings to use the new construction-time API for all 12 format readers.

Additional refactors and changes:

Unify BED schema model across bed and bbi modules — extract shared FieldType, FieldDef, and BedSchema into bed/model/ and re-export from bbi, eliminating ~1200 lines of duplication.
Support CRAM external reference in from_cram. Add test fixtures (sample-ref.*) and Python tests verifying decoded bases against a reference FASTA.

Extract the rich AutoSql-derived type system (FieldType, FieldDef, FieldBuilder) from bbi/model/base/ into bed/model/field_def.rs so both BED and BBI formats share a common schema vocabulary. The bed module has no dependency on bigtools. - Add bed/model/field_def.rs with FieldType (37 variants), FieldDef, generic string-parsing FieldBuilder, and bed_standard_fields() - Replace minimal BedSchema {n, m} with rich BedSchema {n, m, fields} supporting new(), new_from_nm(), new_bedgraph(), and FromStr - Slim bbi/model/base/field.rs to only TryFrom<&AutosqlField> impls - Remove bbi/model/base/schema.rs; BBI re-exports from bed module - Arrow type mappings remain format-specific (BED: Int64, BBI: UInt32)

Move `fields` (and related schema parameters) from scan-time arguments to construction-time arguments across all scanner types (BED, BigBed, BigWig, BBIZoom, SAM, BAM, CRAM, VCF, BCF, GFF, GTF, FASTA, FASTQ). Scanners now validate and cache their Arrow schema at construction, and scan methods accept only dynamic parameters: `columns` (projection), `batch_size`, and `limit`. This aligns the Rust scanner API with the Arrow dataset/fragment pattern used by the Python DataSource classes, where schema is known up front and scans only control how data is read.

Move schema-defining parameters (fields, tag_defs, attr_defs, info_fields, genotype_fields, samples, genotype_by, repo) from scan method calls to Scanner constructors. Scan methods now only accept runtime parameters (columns, batch_size, limit). Also update tag_defs() and attribute_defs() to be called as static methods on the scanner type rather than instance methods.

This also fixes a bug in PyCramScanner's tag_defs() method created a noodles CRAM reader without the reference sequence repository, causing failures when decoding CRAM files that store bases as diffs against an external reference. On the Python side, extract _tag_discovery_kwargs() so CramFile can pass reference/reference_index to the throwaway scanner used during tag discovery in __init__. Add tests for CRAM decoding with an external reference (full scan and regional query).

nvictus added 6 commits March 5, 2026 15:52

Format rust

866d856

Format python

52df649

nvictus merged commit 024d866 into abdenlab:main Mar 6, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Construction-time scanner schema, unified BED model, and CRAM reference support#161

Construction-time scanner schema, unified BED model, and CRAM reference support#161
nvictus merged 6 commits intoabdenlab:mainfrom
nvictus:refactor-scanner-api

nvictus commented Mar 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nvictus commented Mar 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant