refactor: File-independent declarative data model structs for each format family#169
Merged
nvictus merged 8 commits intoabdenlab:mainfrom Mar 15, 2026
Merged
refactor: File-independent declarative data model structs for each format family#169nvictus merged 8 commits intoabdenlab:mainfrom
nvictus merged 8 commits intoabdenlab:mainfrom
Conversation
Introduce a data model that encapsulates schema-defining parameters for alignment record projections (SAM/BAM/CRAM), independent of any file header. - `fields`: selects standard SAM fields (None = all 12 defaults) - `tag_defs`: controls the tags struct column independently - None = no tags column - Some([]) = empty struct column - Some([...]) = struct with specified sub-fields The model produces an Arrow schema, supports column projection, and round-trips through Display/FromStr. BatchBuilder and all three scanners now construct from the model. Python and PyO3 APIs updated to match.
Introduce a data model for GXF (GFF/GTF) feature record projections, mirroring the AlignmentModel pattern. - `fields`: selects standard GXF fields (None = all 8 defaults) - `attr_defs`: controls the attributes struct column independently - None = no attributes column - Some([]) = empty struct column - Some([...]) = struct with specified sub-fields BatchBuilder gains `from_model()` and both scanners now store and expose the model. PyO3 scanner structs simplified by deriving pickle state from the model instead of storing raw copies. Also adds `AttributeDef::to_tuple()` for clean round-trip conversion.
Introduce a data model for sequence record projections (FASTA/FASTQ), following the same pattern as AlignmentModel and GxfModel. Format-aware constructors `new_fasta()` / `new_fastq()` provide different field defaults (3 vs 4 fields). No composite columns — the simplest of the three models. BatchBuilder gains `from_model()` and both scanners now store and expose the model. PyO3 scanner structs simplified by deriving pickle state from the model instead of storing raw copies.
Introduce a data model for variant record projections (VCF/BCF), the most complex of the four format models. - `fields`: selects standard VCF fields (None = all 7 defaults) - `info_defs`: controls the INFO struct column (None = no info) - `genotype_defs` + `samples`: control per-sample/per-field genotype columns (both must be present for genotype output) - `genotype_by`: layout mode (Sample or Field) `from_header()` derives INFO and FORMAT definitions from a VCF header with optional name filtering, while `new()` accepts pre-validated definitions for header-independent construction. Also adds `GenotypeType::arrow_type()` and `GenotypeDef::get_arrow_field()` so the Model can build its schema without instantiating builders. BatchBuilder gains `from_model()` and both scanners now store and expose the model. PyO3 scanner structs simplified by deriving pickle state from the model instead of storing raw copies.
Introduce data models for BED/BBI record projections, completing the Model pattern across all formats. BedModel wraps a BedSchema (parsing interpretation) with field projection. BedSchema gains `from_defs()` for fully custom schemas (e.g., narrowPeak). Shared by BED, BigBed, and BigWig scanners — BBI re-exports both BedModel and BedSchema from the bed module. BBIZoomModel covers the fixed 8-field zoom summary schema. All scanners now store and expose their model. PyO3 scanner structs simplified by deriving pickle state from the model.
- Convert empty tag/attribute discovery results to None in Python (alignment, GXF), preventing empty struct column creation when no tags or attributes are found in a file. - Update test exception handlers to catch ValueError in addition to OSError, since Model validation now rejects invalid field names at construction time rather than at scan time. Regenerated manifests.
Introduce separate data models for BED and BBI record projections,
each producing format-appropriate Arrow types from a shared BedSchema
(parsing interpretation). BedSchema refactored to be type-agnostic for
standard fields — stores only the field count (n), extension mode (m),
and typed custom FieldDefs.
Each Model constructs its own Arrow types:
- BedModel: BED-spec types (Int64 positions) via specialized Field enum
- BBIBaseModel: AutoSql types (UInt32 positions) via FieldDef
- BBIZoomModel: fixed 8-field zoom summary schema
Features added:
- BED BatchBuilder updated to use generic FieldBuilder for custom fields,
enabling custom naming and typed parsing (float, uint, etc.) instead of
treating all custom columns as strings.
- BedSchema gains custom field support via tuple form in Python:
ox.from_bed("peaks.bed", schema=("bed6", [("signal", "float"), ...]))
ox.from_bed("peaks.bed", schema=("bed6", {"signal": "float", ...}))
ox.from_bigbed("peaks.bb", schema=("bed6", {"signal": "float", ...}))
Bug fixes:
- BBI BigBed Push impl: standard fields 4-12 in rest were silently
skipped when custom_field_count was 0 (e.g., bed6 BigBed files)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR introduces declarative
Modelstructs for every format family, providing a single source of truth for Arrow schema configuration independent of file instances (i.e. headers, etc.).Each Model encapsulates the schema-defining parameters for its formats — which fields to include, which composite sub-fields (tags, attributes, info, sample genotypes) to materialize — and caches the resulting Arrow schema.
Scanners andBatchBuilders now construct from Models rather than raw parameter bags. Batch builders take headers separately as needed (e.g., dictionary-encoded chrom fields)Data Models introduced
fieldsonly with format-aware defaults (new_fasta/new_fastq)fields+tag_defs: Optionfields+attr_defs: Optionfields+info_defs+genotype_defs+samples+genotype_by, withfrom_header()for header-based derivationBedSchema(parsing interpretation) + afieldsprojection on that schemaKey design decisions
None= include all default primary fieldsSome(vec![...])= project only specified fieldsOptionsemantics for composite fields (tags, attributes, info) parameters:None= omit column entirelySome(vec![])= empty struct,Some(vec![...])= struct with sub-fieldsproject()method on each Model for column projection with validationHowever, the Python API semantics are currently unchanged (e.g., empty tag/attribute results converted to
Nonebefore passing to scanner).New features
BedSchema::from_defs()enables fully customized BED schemas (e.g., BED narrowPeak)Additional changes
batch_builder.rs→batch.rsacross all format modulesTagDef::to_tuple(),AttributeDef::to_tuple()for clean round-trip conversion