Skip to content

refactor: File-independent declarative data model structs for each format family#169

Merged
nvictus merged 8 commits intoabdenlab:mainfrom
nvictus:feat-datamodel
Mar 15, 2026
Merged

refactor: File-independent declarative data model structs for each format family#169
nvictus merged 8 commits intoabdenlab:mainfrom
nvictus:feat-datamodel

Conversation

@nvictus
Copy link
Copy Markdown
Member

@nvictus nvictus commented Mar 14, 2026

This PR introduces declarative Model structs for every format family, providing a single source of truth for Arrow schema configuration independent of file instances (i.e. headers, etc.).

Each Model encapsulates the schema-defining parameters for its formats — which fields to include, which composite sub-fields (tags, attributes, info, sample genotypes) to materialize — and caches the resulting Arrow schema. Scanners and BatchBuilders now construct from Models rather than raw parameter bags. Batch builders take headers separately as needed (e.g., dictionary-encoded chrom fields)

Data Models introduced

  • SequenceModel: fields only with format-aware defaults (new_fasta / new_fastq)
  • AlignmentModel: fields + tag_defs: Option
  • GxfModel: fields + attr_defs: Option
  • VariantModel: fields + info_defs + genotype_defs + samples + genotype_by, with from_header() for header-based derivation
  • BedModel (shared by BED, BigBed, BigWig scanners): wraps BedSchema (parsing interpretation) + a fields projection on that schema
  • BBIZoomModel: fixed 8-field zoom summary schema

Key design decisions

  • Semantics for standard (primary) fields parameters:
    • None = include all default primary fields
    • Some(vec![...]) = project only specified fields
  • Consistent Option semantics for composite fields (tags, attributes, info) parameters:
    • None = omit column entirely
    • Some(vec![]) = empty struct,
    • Some(vec![...]) = struct with sub-fields
  • project() method on each Model for column projection with validation

However, the Python API semantics are currently unchanged (e.g., empty tag/attribute results converted to None before passing to scanner).

New features

  • BedSchema::from_defs() enables fully customized BED schemas (e.g., BED narrowPeak)

Additional changes

  • Rename batch_builder.rsbatch.rs across all format modules
  • Add TagDef::to_tuple(), AttributeDef::to_tuple() for clean round-trip conversion
  • Simplify PyO3 scanner structs by deriving pickle state from the model

nvictus added 8 commits March 14, 2026 10:33
Introduce a data model that encapsulates schema-defining parameters for
alignment record projections (SAM/BAM/CRAM), independent of any file
header.

- `fields`: selects standard SAM fields (None = all 12 defaults)
- `tag_defs`: controls the tags struct column independently
  - None = no tags column
  - Some([]) = empty struct column
  - Some([...]) = struct with specified sub-fields

The model produces an Arrow schema, supports column projection, and
round-trips through Display/FromStr. BatchBuilder and all three scanners
now construct from the model. Python and PyO3 APIs updated to match.
Introduce a data model for GXF (GFF/GTF) feature record projections,
mirroring the AlignmentModel pattern.

- `fields`: selects standard GXF fields (None = all 8 defaults)
- `attr_defs`: controls the attributes struct column independently
  - None = no attributes column
  - Some([]) = empty struct column
  - Some([...]) = struct with specified sub-fields

BatchBuilder gains `from_model()` and both scanners now store and
expose the model. PyO3 scanner structs simplified by deriving pickle
state from the model instead of storing raw copies.

Also adds `AttributeDef::to_tuple()` for clean round-trip conversion.
Introduce a data model for sequence record projections (FASTA/FASTQ),
following the same pattern as AlignmentModel and GxfModel.

Format-aware constructors `new_fasta()` / `new_fastq()` provide
different field defaults (3 vs 4 fields). No composite columns — the
simplest of the three models.

BatchBuilder gains `from_model()` and both scanners now store and
expose the model. PyO3 scanner structs simplified by deriving pickle
state from the model instead of storing raw copies.
Introduce a data model for variant record projections (VCF/BCF), the
most complex of the four format models.

- `fields`: selects standard VCF fields (None = all 7 defaults)
- `info_defs`: controls the INFO struct column (None = no info)
- `genotype_defs` + `samples`: control per-sample/per-field genotype
  columns (both must be present for genotype output)
- `genotype_by`: layout mode (Sample or Field)

`from_header()` derives INFO and FORMAT definitions from a VCF header
with optional name filtering, while `new()` accepts pre-validated
definitions for header-independent construction.

Also adds `GenotypeType::arrow_type()` and `GenotypeDef::get_arrow_field()`
so the Model can build its schema without instantiating builders.

BatchBuilder gains `from_model()` and both scanners now store and
expose the model. PyO3 scanner structs simplified by deriving pickle
state from the model instead of storing raw copies.
Introduce data models for BED/BBI record projections, completing the
Model pattern across all formats.

BedModel wraps a BedSchema (parsing interpretation) with field
projection. BedSchema gains `from_defs()` for fully custom schemas
(e.g., narrowPeak). Shared by BED, BigBed, and BigWig scanners —
BBI re-exports both BedModel and BedSchema from the bed module.

BBIZoomModel covers the fixed 8-field zoom summary schema.

All scanners now store and expose their model. PyO3 scanner structs
simplified by deriving pickle state from the model.
- Convert empty tag/attribute discovery results to None in Python
  (alignment, GXF), preventing empty struct column creation when no
  tags or attributes are found in a file.

- Update test exception handlers to catch ValueError in addition to
  OSError, since Model validation now rejects invalid field names at
  construction time rather than at scan time. Regenerated manifests.
Introduce separate data models for BED and BBI record projections,
each producing format-appropriate Arrow types from a shared BedSchema
(parsing interpretation). BedSchema refactored to be type-agnostic for
standard fields — stores only the field count (n), extension mode (m),
and typed custom FieldDefs.

Each Model constructs its own Arrow types:
- BedModel: BED-spec types (Int64 positions) via specialized Field enum
- BBIBaseModel: AutoSql types (UInt32 positions) via FieldDef
- BBIZoomModel: fixed 8-field zoom summary schema

Features added:
- BED BatchBuilder updated to use generic FieldBuilder for custom fields,
enabling custom naming and typed parsing (float, uint, etc.) instead of
treating all custom columns as strings.
- BedSchema gains custom field support via tuple form in Python:
  ox.from_bed("peaks.bed", schema=("bed6", [("signal", "float"), ...]))
  ox.from_bed("peaks.bed", schema=("bed6", {"signal": "float", ...}))
  ox.from_bigbed("peaks.bb", schema=("bed6", {"signal": "float", ...}))

Bug fixes:
- BBI BigBed Push impl: standard fields 4-12 in rest were silently
  skipped when custom_field_count was 0 (e.g., bed6 BigBed files)
@nvictus nvictus merged commit 3d17f2b into abdenlab:main Mar 15, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant