Skip to content

Construction-time scanner schema, unified BED model, and CRAM reference support#161

Merged
nvictus merged 6 commits intoabdenlab:mainfrom
nvictus:refactor-scanner-api
Mar 6, 2026
Merged

Construction-time scanner schema, unified BED model, and CRAM reference support#161
nvictus merged 6 commits intoabdenlab:mainfrom
nvictus:refactor-scanner-api

Conversation

@nvictus
Copy link
Copy Markdown
Member

@nvictus nvictus commented Mar 6, 2026

The previous scanner API accepted schema-defining params (fields, tag_defs, info_fields, etc.) at scan() time. Moving these to the constructor makes the lifecycle clearer: construct once with schema params, then scan multiple times with different projections/batch sizes. This helps separate the concerns of defining an initial schema on a file vs projecting columns onto that schema. It also simplifies the logic in the Python API since the schema doesn't need to be re-discovered on every query.

  • Move schema-defining params to scanner construction time for all formats (SAM, BAM, CRAM, VCF, BCF, GFF, GTF, FASTA, FASTQ, BED, BigBed, BigWig).
    • Update Python bindings and DataSource classes to match the new Rust scanner API. All Py*Scanner wrappers and DataSource subclasses updated accordingly.
    • Update R bindings to use the new construction-time API for all 12 format readers.

Additional refactors and changes:

  • Unify BED schema model across bed and bbi modules — extract shared FieldType, FieldDef, and BedSchema into bed/model/ and re-export from bbi, eliminating ~1200 lines of duplication.

  • Support CRAM external reference in from_cram. Add test fixtures (sample-ref.*) and Python tests verifying decoded bases against a reference FASTA.

nvictus added 6 commits March 5, 2026 15:52
Extract the rich AutoSql-derived type system (FieldType, FieldDef,
FieldBuilder) from bbi/model/base/ into bed/model/field_def.rs so both
BED and BBI formats share a common schema vocabulary. The bed module
has no dependency on bigtools.

- Add bed/model/field_def.rs with FieldType (37 variants), FieldDef,
  generic string-parsing FieldBuilder, and bed_standard_fields()
- Replace minimal BedSchema {n, m} with rich BedSchema {n, m, fields}
  supporting new(), new_from_nm(), new_bedgraph(), and FromStr
- Slim bbi/model/base/field.rs to only TryFrom<&AutosqlField> impls
- Remove bbi/model/base/schema.rs; BBI re-exports from bed module
- Arrow type mappings remain format-specific (BED: Int64, BBI: UInt32)
Move `fields` (and related schema parameters) from scan-time arguments
to construction-time arguments across all scanner types (BED, BigBed,
BigWig, BBIZoom, SAM, BAM, CRAM, VCF, BCF, GFF, GTF, FASTA, FASTQ).
Scanners now validate and cache their Arrow schema at construction,
and scan methods accept only dynamic parameters: `columns` (projection),
`batch_size`, and `limit`.

This aligns the Rust scanner API with the Arrow dataset/fragment pattern
used by the Python DataSource classes, where schema is known up front
and scans only control how data is read.
Move schema-defining parameters (fields, tag_defs, attr_defs,
info_fields, genotype_fields, samples, genotype_by, repo) from scan
method calls to Scanner constructors. Scan methods now only accept
runtime parameters (columns, batch_size, limit).

Also update tag_defs() and attribute_defs() to be called as static
methods on the scanner type rather than instance methods.
This also fixes a bug in PyCramScanner's tag_defs() method created
a noodles CRAM reader without the reference sequence repository,
causing failures when decoding CRAM files that store bases as diffs
against an external reference.

On the Python side, extract _tag_discovery_kwargs() so CramFile can
pass reference/reference_index to the throwaway scanner used during
tag discovery in __init__. Add tests for CRAM decoding with an
external reference (full scan and regional query).
@nvictus nvictus merged commit 024d866 into abdenlab:main Mar 6, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant