Honor Table.coordinate_system and decouple bedtools-compat from interval_type in DISTANCE / NEAREST — Closes #89#91
Merged
Conversation
78e5673 to
e3f204d
Compare
conradbzura
commented
Apr 28, 2026
e3f204d to
3465575
Compare
…ISTANCE/NEAREST DISTANCE and NEAREST previously ignored Table.coordinate_system and conflated Table.interval_type with bedtools-style "+1" gap counting. A 1-based-closed table fed raw start/end into the gap formula with no canonical (start - 1) shift, producing distances that were off-by-one for any non-default coordinate system. The "+1" gap quirk was implicitly enabled whenever interval_type was "closed", coupling a counting convention to a storage convention. Reuse the _canonical_start and _canonical_end helpers introduced in #88 to canonicalize all four endpoint expressions to 0-based half-open before they reach _generate_distance_case, so the CASE expression operates entirely in canonical space. Drop the implicit "+1" gap counting outright — interval_type no longer drives any gap arithmetic. BREAKING CHANGE: bedtools-style "+1" gap counting is removed from DISTANCE and NEAREST. Tables previously declared with interval_type="closed" will produce different numeric distances — specifically, gaps will be one less than before. If the legacy "+1" semantics are required, add the adjustment explicitly in the consuming query (e.g. wrap the result in `... + 1`).
…oval Rename the two pre-existing closed-interval tests to the BDD pattern and rewrite them to assert that closed-interval tables emit no "+1" gap adjustment — the legacy interval_type-driven implicit "+1" is gone and there is no replacement opt-in. Add parametrized tests covering each of the four (coordinate_system, interval_type) combinations for DISTANCE, plus a mixed-conventions DISTANCE test, a parametrized target-side NEAREST test across all four combinations, and a column-ref reference NEAREST test. Together they assert that logically equivalent intervals stored under different conventions yield the same canonical gap arithmetic.
Closes two coverage gaps left by the previous test commit on this branch and tightens one brittle assertion: - Add a NEAREST counterpart to the existing DISTANCE default-mode test, verifying that on a closed-interval target table the legacy interval_type-driven implicit "+1" is gone — NEAREST canonicalizes the target end as (end + 1) but no longer adds "+1" to the gap branches unless bedtools_compat is set. - Add coverage for _resolve_nearest_reference's correlated-implicit branch with a 1-based closed outer table, exercising the new canonicalization with a non-trivial convention. Existing implicit-reference tests use 0-based half-open tables, so the canonicalization was a no-op and the new code path was not meaningfully observed. - Replace the or-chained substring assertion in the existing bedtools-compat-set NEAREST test with three positive assertions that both gap branches carry the bedtools "+1" tail. The previous or chain would silently pass even if only one branch were correct.
…NEAREST Add a new tests/integration/coordinate_space/ package that exercises DISTANCE and NEAREST end-to-end via DuckDB across all four (coordinate_system, interval_type) combinations, including mixed pairs. Each test parametrizes the same canonical 0-based half-open interval(s) across the four conventions, re-encoding the stored start/end values via the inverse of _canonical_start / _canonical_end and feeding them through transpile() into a real DuckDB connection. This proves the canonicalization fix for #89 holds at the execution layer, not just at SQL emission. Coverage: same-convention DISTANCE (gap, overlap, adjacent, cross-chrom, signed +/-), 16 mixed-convention DISTANCE pairs, same-convention NEAREST (k=1, k=3 membership, max_distance, standalone literal reference, adjacent boundary), and 16 mixed-convention NEAREST pairs. 76 invocations total. The new package does not require pybedtools or the bedtools binary, so it runs on machines without bedtools installed.
3934bb0 to
2a75520
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Make DISTANCE and NEAREST honor
Table.coordinate_systemand stop conflatingTable.interval_typewith bedtools-style "+1" gap counting, so logically-equivalent intervals stored under different conventions yield the same numeric distance. Reuse the_canonical_startand_canonical_endhelpers introduced in #88 to canonicalize all four endpoint expressions to 0-based half-open before they reach_generate_distance_case, and drop the implicit "+1" gap counting that closed-interval tables previously triggered.This is a behavior change with no opt-in: tables declared with
interval_type="closed"no longer apply the "+1" gap quirk for either DISTANCE or NEAREST, andinterval_typeno longer drives any gap arithmetic. Callers who relied on the legacy "+1" must add it explicitly in the consuming query (e.g. wrap the result in... + 1). The two pre-existing bedtools-compat tests are rewritten to assert the new behavior, and parametrized regression tests cover canonicalization across all four(coordinate_system, interval_type)combinations for both operators.Literal-form DISTANCE (still
NotImplementedError) and the spatial-predicate code paths fixed in #88 are out of scope.Closes #89
Proposed changes
src/giql/generators/base.pygiqldistance_sql: resolve each side'sTableconfig via_resolve_tableand wrap rawstart/endwith_canonical_start/_canonical_endbefore they reach_generate_distance_case.giqlnearest_sql: hoist the target-Tablelookup and wrap target endpoints via the helpers._resolve_nearest_reference: canonicalize column-ref returns via_canonical_start/_canonical_endin all three branches — standalone column reference, correlated explicit reference, and correlated implicit reference (literal returns are already canonical viaRangeParser.to_zero_based_half_open)._generate_distance_case: drop theadd_one_for_gapparameter and thegap_adjterm; the CASE expression now operates entirely in canonical 0-based half-open space.tests/generators/test_base.py+1gap counting.(coordinate_system, interval_type)combinations, a mixed-conventions DISTANCE test, parametrized NEAREST target-side tests, a column-ref reference NEAREST test (standalone mode), and a correlated/implicit-reference NEAREST test that exercises canonicalization of the outer table's columns.orchain would have silently passed even if only one branch were correct).docs/transpilation/index.rst(coordinate_system, interval_type)config, and add a.. versionchanged::directive flagging the removal of the implicit+1gap counting.Test cases
TestBaseGIQLGeneratorgiqldistance_sqlis calledendas(end + 1)and emits no+ 1in the gap branchesTestBaseGIQLGeneratorgiqlnearest_sqlis calledendas(end + 1)and emits no+ 1in the gap branchesTestBaseGIQLGenerator(coordinate_system, interval_type)combinationsstart/endis wrapped per the canonical 0-based half-open conversion, leaving comparison and gap formulas otherwise unchangedTestBaseGIQLGeneratorstartis wrapped as(start - 1), the default side stays raw, and the comparison/gap formulas use the canonicalized expressionsTestBaseGIQLGenerator(coordinate_system, interval_type)combinations and a literal reference rangegiqlnearest_sqlis calledstart/endper the canonical 0-based half-open conversion against the already-canonical literal referenceTestBaseGIQLGeneratorgiqlnearest_sqlis calledstartis wrapped as(start - 1), the reference'sendstays raw, and the target side stays rawTestBaseGIQLGeneratorCROSS JOIN LATERALwith noreferenceargumentgiqlnearest_sqlis calledstartas(start - 1), leaves itsendraw, and leaves the target side raw