refactor: File-independent declarative data model structs for each format family by nvictus · Pull Request #169 · abdenlab/oxbow

nvictus · 2026-03-14T22:05:19Z

This PR introduces declarative Model structs for every format family, providing a single source of truth for Arrow schema configuration independent of file instances (i.e. headers, etc.).

Each Model encapsulates the schema-defining parameters for its formats — which fields to include, which composite sub-fields (tags, attributes, info, sample genotypes) to materialize — and caches the resulting Arrow schema. Scanners and BatchBuilders now construct from Models rather than raw parameter bags. Batch builders take headers separately as needed (e.g., dictionary-encoded chrom fields)

Data Models introduced

SequenceModel: fields only with format-aware defaults (new_fasta / new_fastq)
AlignmentModel: fields + tag_defs: Option
GxfModel: fields + attr_defs: Option
VariantModel: fields + info_defs + genotype_defs + samples + genotype_by, with from_header() for header-based derivation
BedModel (shared by BED, BigBed, BigWig scanners): wraps BedSchema (parsing interpretation) + a fields projection on that schema
BBIZoomModel: fixed 8-field zoom summary schema

Key design decisions

Semantics for standard (primary) fields parameters:
- None = include all default primary fields
- Some(vec![...]) = project only specified fields
Consistent Option semantics for composite fields (tags, attributes, info) parameters:
- None = omit column entirely
- Some(vec![]) = empty struct,
- Some(vec![...]) = struct with sub-fields
project() method on each Model for column projection with validation

However, the Python API semantics are currently unchanged (e.g., empty tag/attribute results converted to None before passing to scanner).

New features

BedSchema::from_defs() enables fully customized BED schemas (e.g., BED narrowPeak)

Additional changes

Rename batch_builder.rs → batch.rs across all format modules
Add TagDef::to_tuple(), AttributeDef::to_tuple() for clean round-trip conversion
Simplify PyO3 scanner structs by deriving pickle state from the model

Introduce a data model that encapsulates schema-defining parameters for alignment record projections (SAM/BAM/CRAM), independent of any file header. - `fields`: selects standard SAM fields (None = all 12 defaults) - `tag_defs`: controls the tags struct column independently - None = no tags column - Some([]) = empty struct column - Some([...]) = struct with specified sub-fields The model produces an Arrow schema, supports column projection, and round-trips through Display/FromStr. BatchBuilder and all three scanners now construct from the model. Python and PyO3 APIs updated to match.

Introduce a data model for GXF (GFF/GTF) feature record projections, mirroring the AlignmentModel pattern. - `fields`: selects standard GXF fields (None = all 8 defaults) - `attr_defs`: controls the attributes struct column independently - None = no attributes column - Some([]) = empty struct column - Some([...]) = struct with specified sub-fields BatchBuilder gains `from_model()` and both scanners now store and expose the model. PyO3 scanner structs simplified by deriving pickle state from the model instead of storing raw copies. Also adds `AttributeDef::to_tuple()` for clean round-trip conversion.

Introduce a data model for sequence record projections (FASTA/FASTQ), following the same pattern as AlignmentModel and GxfModel. Format-aware constructors `new_fasta()` / `new_fastq()` provide different field defaults (3 vs 4 fields). No composite columns — the simplest of the three models. BatchBuilder gains `from_model()` and both scanners now store and expose the model. PyO3 scanner structs simplified by deriving pickle state from the model instead of storing raw copies.

Introduce a data model for variant record projections (VCF/BCF), the most complex of the four format models. - `fields`: selects standard VCF fields (None = all 7 defaults) - `info_defs`: controls the INFO struct column (None = no info) - `genotype_defs` + `samples`: control per-sample/per-field genotype columns (both must be present for genotype output) - `genotype_by`: layout mode (Sample or Field) `from_header()` derives INFO and FORMAT definitions from a VCF header with optional name filtering, while `new()` accepts pre-validated definitions for header-independent construction. Also adds `GenotypeType::arrow_type()` and `GenotypeDef::get_arrow_field()` so the Model can build its schema without instantiating builders. BatchBuilder gains `from_model()` and both scanners now store and expose the model. PyO3 scanner structs simplified by deriving pickle state from the model instead of storing raw copies.

Introduce data models for BED/BBI record projections, completing the Model pattern across all formats. BedModel wraps a BedSchema (parsing interpretation) with field projection. BedSchema gains `from_defs()` for fully custom schemas (e.g., narrowPeak). Shared by BED, BigBed, and BigWig scanners — BBI re-exports both BedModel and BedSchema from the bed module. BBIZoomModel covers the fixed 8-field zoom summary schema. All scanners now store and expose their model. PyO3 scanner structs simplified by deriving pickle state from the model.

- Convert empty tag/attribute discovery results to None in Python (alignment, GXF), preventing empty struct column creation when no tags or attributes are found in a file. - Update test exception handlers to catch ValueError in addition to OSError, since Model validation now rejects invalid field names at construction time rather than at scan time. Regenerated manifests.

Introduce separate data models for BED and BBI record projections, each producing format-appropriate Arrow types from a shared BedSchema (parsing interpretation). BedSchema refactored to be type-agnostic for standard fields — stores only the field count (n), extension mode (m), and typed custom FieldDefs. Each Model constructs its own Arrow types: - BedModel: BED-spec types (Int64 positions) via specialized Field enum - BBIBaseModel: AutoSql types (UInt32 positions) via FieldDef - BBIZoomModel: fixed 8-field zoom summary schema Features added: - BED BatchBuilder updated to use generic FieldBuilder for custom fields, enabling custom naming and typed parsing (float, uint, etc.) instead of treating all custom columns as strings. - BedSchema gains custom field support via tuple form in Python: ox.from_bed("peaks.bed", schema=("bed6", [("signal", "float"), ...])) ox.from_bed("peaks.bed", schema=("bed6", {"signal": "float", ...})) ox.from_bigbed("peaks.bb", schema=("bed6", {"signal": "float", ...})) Bug fixes: - BBI BigBed Push impl: standard fields 4-12 in rest were silently skipped when custom_field_count was 0 (e.g., bed6 BigBed files)

nvictus added 8 commits March 14, 2026 10:33

Rename batch_builder.rs modules to batch.rs

6f0adf3

nvictus force-pushed the feat-datamodel branch from 97a4ccb to a683ebd Compare March 15, 2026 12:38

nvictus merged commit 3d17f2b into abdenlab:main Mar 15, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: File-independent declarative data model structs for each format family#169

refactor: File-independent declarative data model structs for each format family#169
nvictus merged 8 commits intoabdenlab:mainfrom
nvictus:feat-datamodel

nvictus commented Mar 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nvictus commented Mar 14, 2026

Data Models introduced

Key design decisions

New features

Additional changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant