feat: Resolve coordinate system semantics for input and output positions#173
feat: Resolve coordinate system semantics for input and output positions#173nvictus merged 3 commits intoabdenlab:mainfrom
Conversation
Introduce a `CoordSystem` enum (`OneClosed` / `ZeroHalfOpen`) that
controls how start positions are represented in output Arrow batches.
Each format defaults to its native coordinate convention: 1-based
for SAM/BAM/CRAM, VCF/BCF, and GFF/GTF; 0-based for BED, BigBed,
and BigWig — but callers can request either system explicitly.
Rust changes:
* `CoordSystem` enum in `oxbow::lib` with `Display/FromStr` and `start_offset_from()` for computing the adjustment between systems.
* Every Model now carries a `coord_system` field (alignment, variant, gxf, bed, bbi base, bbi zoom, sequence).
* Offset applied at the FieldBuilder level (alignment, variant, gxf, bed) or BatchBuilder level (bbi base, bbi zoom) during `push()`.
* All scanner constructors accept an explicit `CoordSystem` parameter.
Also edded `Default` trait impl on alignment and gxf Models.
Python changes:
* All pyo3 scanner classes accept a `coords` keyword argument ("01" or "11") and serialize it through `__getnewargs_ex__`.
* All DataSource classes and from_* factory functions expose coords: `Literal["01", "11"]` with format-appropriate defaults.
* BBI zoom scanners inherit `coord_system` from their base scanner.
* All read_* functions pass the format-native CoordSystem.
R changes:
* All read_*_impl functions pass the format-native CoordSystem.
Introduce `oxbow::Region`: a coordinate-system-aware genomic region type that normalizes all coordinates to 0-based half-open internally. Supports two parsing styles: * Ambiguous UCSC notation (chr1:10,000-20,000) interpreted using a provided CoordSystem. Accepts , and _ as thousands separators. * Explicit bracket notation (chr1:[10_000,20_000) or chr1:[10_001,20_000]) that is self-describing and overrides any provided coordinate system. Only _ is accepted as a thousands separator (since , delimits start and end). `Region::to_noodles()` converts to a noodles `Region` for index-based seeking. All `scan_query` methods now accept `oxbow::Region` instead of `noodles::core::Region`, performing the conversion internally. `CoordSystem` and `Region` are extracted into a new `oxbow::coords` module and re-exported from the crate root. py-oxbow scanner classes parse user region strings using the scanner's `model().coord_system()` when using ambiguous notation. Standalone `read_*` functions use the format-native default. r-oxbow follows the same convention.
| write!(f, "{}", self.name)?; | ||
| match (self.start, self.end) { | ||
| (0, None) => {} | ||
| (s, None) => write!(f, ":[{s},)")?, |
There was a problem hiding this comment.
Region { start: 5000, end: None } displays as chr1:[5000,), but try_parse_bracket can’t parse an empty end — it splits on , and fails "".parse::<u64>(). Because try_parse_bracket returns Some(Err(…)) instead of None, the UCSC fallback path is never reached, so FromStr roundtrips break for any open-ended region with an explicit start.
Either extend bracket parsing to accept [start,) as open-ended, or have Display emit UCSC notation for the unbounded case (e.g., a 0-based chr1:5001- under the default OneClosed assumption).
| let (start, end) = match coord_system { | ||
| CoordSystem::OneClosed => { | ||
| // 1-based closed → 0-based half-open: start -= 1, end unchanged | ||
| (start.map(|s| s.saturating_sub(1)), end) |
There was a problem hiding this comment.
saturating_sub(1) correctly avoids wrapping to u64::MAX, but it also silently accepts start = 0, which isn’t a valid 1-based position. In UCSC mode with OneClosed, chr1:0-100 quietly normalizes to start = 0 (0-based) — the user gets a plausible-looking result from an invalid input. The same applies to bracket notation [0,100] at line 236.
A guard like if start == 0 { return Err(invalid_input("1-based position must be >= 1")) } before the subtraction would catch the mistake early.
|
@nvictus sorry I'm a little late, but noted a couple of things. |
This PR adds the ability to honor genomic coordinate system semantics in the way input query ranges are interpreted, and the way output positions (namely, start positions) are returned. Resolves #114.
Coordinate systems
There are two "coordinate system" conventions in genomics:
[1000, 2000)[1001, 2000]We introduce shorthands for these in Python:
"01"and"11". And every scanner is now coordinate-system aware in its output, defaulting to a format family-"native" system.Range notation
We further support 3 range string notations:
chr1:10,000-20,000chr1:[10_000,20_000)chr1:[10_001,20_001]Thousands separators
,and_are supported (only the latter in bracket notation).Example
Malformatted files
This feature alone does not solve the problem malformatted files, most often observed with text formats. BED is natively 01, but many BED-like files in the wild contain intervals with 1-based starts. Likewise, GTF is supposed to be 11, but many may use 0-based starts. If one knows the start positions in a file don't match what the format natively expects, one has to choose a coordinate system interpretation and increment or decrement the start positions accordingly to correct them.
Changes
Add a
CoordSystemenum (OneClosed/ZeroHalfOpen) that controls how start positions are represented in output Arrow batches and how user-supplied query regions are interpreted.Add an
oxbow::Regiontype with coordinate-system-aware parsing, supporting ambiguous UCSC notation (interpreted using a CoordSystem) and an explicit bracket notation.Each format defaults to its native coordinate convention (1-based for SAM/BAM/CRAM, VCF/BCF, GFF/GTF, FASTA; 0-based for BED, BigBed, BigWig) but callers can request either system explicitly via a
coordsparameter.All scanner
scan_querymethods now acceptoxbow::Regioninstead ofnoodles::core::Region, converting internally.Python API exposes
coordsas literals ("01","11") on allDataSourceclasses andfrom_*factory functions, as well as pyo3 scanner classes.pyo3
read_*functions and R bindings updated to use format-native coordinate systems.