Skip to content

Honor Table.coordinate_system and decouple bedtools-compat from interval_type in DISTANCE / NEAREST — Closes #89#91

Merged
conradbzura merged 4 commits into
mainfrom
89-honor-coord-system-distance-nearest
Apr 28, 2026
Merged

Honor Table.coordinate_system and decouple bedtools-compat from interval_type in DISTANCE / NEAREST — Closes #89#91
conradbzura merged 4 commits into
mainfrom
89-honor-coord-system-distance-nearest

Conversation

@conradbzura
Copy link
Copy Markdown
Collaborator

@conradbzura conradbzura commented Apr 27, 2026

Summary

Make DISTANCE and NEAREST honor Table.coordinate_system and stop conflating Table.interval_type with bedtools-style "+1" gap counting, so logically-equivalent intervals stored under different conventions yield the same numeric distance. Reuse the _canonical_start and _canonical_end helpers introduced in #88 to canonicalize all four endpoint expressions to 0-based half-open before they reach _generate_distance_case, and drop the implicit "+1" gap counting that closed-interval tables previously triggered.

This is a behavior change with no opt-in: tables declared with interval_type="closed" no longer apply the "+1" gap quirk for either DISTANCE or NEAREST, and interval_type no longer drives any gap arithmetic. Callers who relied on the legacy "+1" must add it explicitly in the consuming query (e.g. wrap the result in ... + 1). The two pre-existing bedtools-compat tests are rewritten to assert the new behavior, and parametrized regression tests cover canonicalization across all four (coordinate_system, interval_type) combinations for both operators.

Literal-form DISTANCE (still NotImplementedError) and the spatial-predicate code paths fixed in #88 are out of scope.

Closes #89

Proposed changes

src/giql/generators/base.py

  • giqldistance_sql: resolve each side's Table config via _resolve_table and wrap raw start/end with _canonical_start / _canonical_end before they reach _generate_distance_case.
  • giqlnearest_sql: hoist the target-Table lookup and wrap target endpoints via the helpers.
  • _resolve_nearest_reference: canonicalize column-ref returns via _canonical_start / _canonical_end in all three branches — standalone column reference, correlated explicit reference, and correlated implicit reference (literal returns are already canonical via RangeParser.to_zero_based_half_open).
  • _generate_distance_case: drop the add_one_for_gap parameter and the gap_adj term; the CASE expression now operates entirely in canonical 0-based half-open space.

tests/generators/test_base.py

  • Rename the two pre-existing closed-interval tests to the BDD pattern and rewrite them to assert that closed-interval tables emit no +1 gap counting.
  • Add parametrized canonicalization tests for DISTANCE across all four (coordinate_system, interval_type) combinations, a mixed-conventions DISTANCE test, parametrized NEAREST target-side tests, a column-ref reference NEAREST test (standalone mode), and a correlated/implicit-reference NEAREST test that exercises canonicalization of the outer table's columns.
  • Tighten the NEAREST gap-counting assertions to require both gap branches to carry the expected formulas (the previous or chain would have silently passed even if only one branch were correct).

docs/transpilation/index.rst

  • Note that DISTANCE/NEAREST canonicalize table-side endpoints based on their (coordinate_system, interval_type) config, and add a .. versionchanged:: directive flagging the removal of the implicit +1 gap counting.

Test cases

# Test Suite Given When Then Coverage Target
1 TestBaseGIQLGenerator Two 0-based closed tables joined under DISTANCE giqldistance_sql is called The CASE expression canonicalizes each end as (end + 1) and emits no + 1 in the gap branches Removal of the legacy implicit "+1" in DISTANCE for closed-interval tables
2 TestBaseGIQLGenerator A 0-based closed target table for NEAREST giqlnearest_sql is called The distance CASE expression canonicalizes the target end as (end + 1) and emits no + 1 in the gap branches Removal of the legacy implicit "+1" in NEAREST for closed-interval target tables
3 TestBaseGIQLGenerator Two tables sharing one of the four (coordinate_system, interval_type) combinations DISTANCE is called between aliased columns from each table Each side's start/end is wrapped per the canonical 0-based half-open conversion, leaving comparison and gap formulas otherwise unchanged DISTANCE canonicalization across all four conventions
4 TestBaseGIQLGenerator A 0-based half-open table joined against a 1-based closed table DISTANCE is called between aliased columns from each table The 1-based-closed side's start is wrapped as (start - 1), the default side stays raw, and the comparison/gap formulas use the canonicalized expressions DISTANCE under mixed conventions
5 TestBaseGIQLGenerator A target table declared with one of the four (coordinate_system, interval_type) combinations and a literal reference range giqlnearest_sql is called The distance CASE expression wraps target start/end per the canonical 0-based half-open conversion against the already-canonical literal reference NEAREST target canonicalization across all four conventions
6 TestBaseGIQLGenerator A 0-based half-open target table and an explicit reference column from a 1-based closed table giqlnearest_sql is called The reference's start is wrapped as (start - 1), the reference's end stays raw, and the target side stays raw NEAREST reference canonicalization for column-ref references in standalone mode
7 TestBaseGIQLGenerator A 1-based closed outer table and a 0-based half-open target table joined via CROSS JOIN LATERAL with no reference argument giqlnearest_sql is called The distance CASE expression wraps the outer table's start as (start - 1), leaves its end raw, and leaves the target side raw NEAREST reference canonicalization for the correlated/implicit-reference branch

@conradbzura conradbzura self-assigned this Apr 27, 2026
@conradbzura conradbzura force-pushed the 89-honor-coord-system-distance-nearest branch from 78e5673 to e3f204d Compare April 28, 2026 17:27
@conradbzura conradbzura marked this pull request as ready for review April 28, 2026 17:39
Comment thread src/giql/generators/base.py Outdated
@conradbzura conradbzura force-pushed the 89-honor-coord-system-distance-nearest branch from e3f204d to 3465575 Compare April 28, 2026 18:01
…ISTANCE/NEAREST

DISTANCE and NEAREST previously ignored Table.coordinate_system and
conflated Table.interval_type with bedtools-style "+1" gap counting.
A 1-based-closed table fed raw start/end into the gap formula with no
canonical (start - 1) shift, producing distances that were off-by-one
for any non-default coordinate system. The "+1" gap quirk was implicitly
enabled whenever interval_type was "closed", coupling a counting
convention to a storage convention.

Reuse the _canonical_start and _canonical_end helpers introduced in #88
to canonicalize all four endpoint expressions to 0-based half-open
before they reach _generate_distance_case, so the CASE expression
operates entirely in canonical space. Drop the implicit "+1" gap
counting outright — interval_type no longer drives any gap arithmetic.

BREAKING CHANGE: bedtools-style "+1" gap counting is removed from
DISTANCE and NEAREST. Tables previously declared with
interval_type="closed" will produce different numeric distances —
specifically, gaps will be one less than before. If the legacy "+1"
semantics are required, add the adjustment explicitly in the consuming
query (e.g. wrap the result in `... + 1`).
…oval

Rename the two pre-existing closed-interval tests to the BDD pattern
and rewrite them to assert that closed-interval tables emit no "+1"
gap adjustment — the legacy interval_type-driven implicit "+1" is gone
and there is no replacement opt-in.

Add parametrized tests covering each of the four
(coordinate_system, interval_type) combinations for DISTANCE,
plus a mixed-conventions DISTANCE test, a parametrized target-side
NEAREST test across all four combinations, and a column-ref
reference NEAREST test. Together they assert that logically
equivalent intervals stored under different conventions yield the
same canonical gap arithmetic.
Closes two coverage gaps left by the previous test commit on this
branch and tightens one brittle assertion:

- Add a NEAREST counterpart to the existing DISTANCE default-mode
  test, verifying that on a closed-interval target table the legacy
  interval_type-driven implicit "+1" is gone — NEAREST canonicalizes
  the target end as (end + 1) but no longer adds "+1" to the gap
  branches unless bedtools_compat is set.
- Add coverage for _resolve_nearest_reference's correlated-implicit
  branch with a 1-based closed outer table, exercising the new
  canonicalization with a non-trivial convention. Existing
  implicit-reference tests use 0-based half-open tables, so the
  canonicalization was a no-op and the new code path was not
  meaningfully observed.
- Replace the or-chained substring assertion in the existing
  bedtools-compat-set NEAREST test with three positive assertions
  that both gap branches carry the bedtools "+1" tail. The previous
  or chain would silently pass even if only one branch were correct.
…NEAREST

Add a new tests/integration/coordinate_space/ package that exercises
DISTANCE and NEAREST end-to-end via DuckDB across all four
(coordinate_system, interval_type) combinations, including mixed pairs.

Each test parametrizes the same canonical 0-based half-open interval(s)
across the four conventions, re-encoding the stored start/end values via
the inverse of _canonical_start / _canonical_end and feeding them through
transpile() into a real DuckDB connection. This proves the canonicalization
fix for #89 holds at the execution layer, not just at SQL emission.

Coverage: same-convention DISTANCE (gap, overlap, adjacent, cross-chrom,
signed +/-), 16 mixed-convention DISTANCE pairs, same-convention NEAREST
(k=1, k=3 membership, max_distance, standalone literal reference,
adjacent boundary), and 16 mixed-convention NEAREST pairs. 76 invocations
total. The new package does not require pybedtools or the bedtools
binary, so it runs on machines without bedtools installed.
@conradbzura conradbzura force-pushed the 89-honor-coord-system-distance-nearest branch from 3934bb0 to 2a75520 Compare April 28, 2026 19:37
@conradbzura conradbzura merged commit e65afce into main Apr 28, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Honor Table.coordinate_system and decouple bedtools-compat from interval_type in DISTANCE / NEAREST

1 participant