Add DISJOIN operator for splitting intervals at reference breakpoints — Closes #87#100
Merged
Conversation
DISJOIN is a FROM-clause table function that cuts each target interval at every reference breakpoint strictly interior to it, so no resulting sub-interval partially overlaps a reference interval. The target row passes through unchanged and the sub-interval is appended as disjoin_chrom, disjoin_start and disjoin_end. A coverage filter drops sub-intervals overlapping no reference interval. When no reference is given it defaults to the target set, so selecting the distinct sub-intervals reproduces a Bioconductor-style disjoin. The generator emits a portable WITH-CTE subquery using only UNION, LEAD and EXISTS, so it transpiles unchanged to DuckDB, PostgreSQL and SQLite. Sub-interval boundaries are canonicalized to 0-based half-open regardless of the target or reference coordinate system.
Add DISJOIN to the MCP server operator metadata and documentation path map so it is discoverable through list_operators, explain_operator and search_docs. The list_operators count assertion is updated to match the new operator total.
Add parsing tests for the GIQLDisjoin AST node, transpilation tests for the generated SQL shape, end-to-end execution tests on DuckDB, and a coordinate-space matrix asserting convention-invariant results across all four coordinate-system and interval-type encodings.
Add a DISJOIN section to the aggregation operators reference and a new disjoin recipe covering set partitioning and reference-grid splitting, wired into the recipes index and toctree.
Add tests/usage_patterns.py, a catalogue of the query contexts an operator appears in, and tests/test_usage_patterns.py, a functional suite that transpiles every DISJOIN usage pattern, executes it against a database-engine matrix (DuckDB now; SQLite and PostgreSQL are future entries), and snapshots the result rows with pytest-manifest. The catalogue is a per-operator-class descriptor framework. Only the table-function class is populated -- DISJOIN, exercised across 21 usage patterns in self- and reference-mode. AGGREGATE, PREDICATE, and SCALAR are defined extension points. Three patterns GIQL cannot transpile (a subquery, CTE, or nested operator in a table-function's target position) are catalogued as strict xfail. Add pytest-manifest as a dev dependency and register the usage marker.
This was referenced May 18, 2026
DISJOIN bound only the first positional argument and silently discarded any extras, so DISJOIN(features, refs) dropped the intended reference set without warning. Unknown named arguments were passed through unchecked as well. from_arg_list now raises a ParseError naming the single-positional-arg limit and rejecting any named argument outside the DISJOIN schema, so a mistyped or misplaced reference fails loudly at parse time.
A bare DISJOIN reference name was matched against registered tables first and, failing that, assumed to be a canonical CTE. A CTE sharing a registered table's name therefore silently inherited that table's coordinate system, and a name matching neither produced SQL referencing a relation that does not exist. Reference resolution now checks enclosing WITH clauses first, letting an in-query CTE shadow a registered table the way SQL scoping does, and raises a ValueError when a bare name matches neither a CTE nor a registered table. Bare SELECT and UNION references are accepted alongside Subquery nodes, and a reference using the reserved __giql_dj_ prefix is rejected since that prefix names the operator's internal CTEs.
The end-to-end DISJOIN suite ran against DuckDB only. Parametrize it over a run fixture so every execution test also runs on in-memory SQLite, skipping SQLite below 3.25 where the LEAD() window function DISJOIN emits is unavailable.
DISJOIN was documented under the aggregation operators, but it multiplies rows rather than aggregating them. Move its reference into a new set-operators page, list the page in the dialect index and toctree, and point the MCP operator catalog at the relocated doc.
Ground each recipe in a concrete genomics scenario -- pooled ChIP-seq peak calls, ATAC-seq features against a callable mask, a fixed-width binned coverage matrix -- and note that the uniform-grid recipe relies on DuckDB-specific range() syntax and must span every chromosome present in the input.
e115800 to
a3c6168
Compare
DISJOIN, NEAREST, MERGE, and CLUSTER each bound the target only when a positional argument was present, so a call with no target -- such as DISJOIN(reference := refs) or MERGE(stranded := true) -- built a node with the target unset. The failure surfaced far downstream as the misleading "Target table 'None' not found". Each from_arg_list now raises a ParseError naming the missing target the moment the call is parsed, so the diagnostic points at the real problem.
DISJOIN names its internal CTEs with a __giql_dj_ prefix and already rejects a reference relation that uses it. A target table with the same prefix was unguarded and would emit a self-referential CTE that fails with no actionable error. giqldisjoin_sql now rejects a resolved target name carrying the reserved prefix, symmetric with the existing reference check.
The DISJOIN SQL was assembled as one ~30-fragment string concatenation, where a dropped inter-fragment space would silently produce invalid SQL. Build each __giql_dj_ CTE as a named local and join them once at the return, so every CTE reads as a discrete block. The generated SQL is unchanged.
The DISJOIN transpilation and execution test docstrings used a free-form GIVEN/WHEN/THEN prose block. Convert them to the structured Given/When/Then form the test guide specifies and add the Arrange-Act-Assert phase comments, matching test_disjoin_parsing.py. Also add a transpilation test for the reserved-prefix target guard.
Drop the structural banner-comment rule lines that divide usage_patterns.py into sections, keeping the explanatory prose. Make _annotations delegate to _features_like rather than duplicate it. Hoist the duckdb import in test_usage_patterns.py to module scope.
Note in the set-operators reference that disjoin_chrom, disjoin_start, and disjoin_end are reserved output column names that collide with a same-named target column. Trim the recipe's "Coming from Bedtools?" section to drop analogies it does not deliver, and correct two section underlines to their title lengths.
The usage-pattern snapshot suite depends on pytest-manifest, declared in the dev dependency group but never mirrored into the pixi environment that CI builds. CI therefore ran without the plugin and every snapshot test errored with "fixture 'manifest' not found". pytest-manifest has no conda-forge package, so add it under tool.pixi.pypi-dependencies -- the pixi environment now installs it from PyPI alongside the conda-managed test dependencies.
conradbzura
commented
May 19, 2026
| "WHERE s.seg_end IS NOT NULL AND s.seg_end > s.seg_start " | ||
| "AND EXISTS (SELECT 1 FROM __giql_dj_ref AS r " | ||
| f'WHERE r."{ref_chrom}" = s.kc AND {r_start} <= s.seg_start ' | ||
| f"AND {r_end} > s.seg_start)" |
Collaborator
Author
There was a problem hiding this comment.
It would be nice to have template files for these strings — possibly using Jinja?
Add a DISJOIN section to the demo notebook covering self-mode and reference-mode. Self-mode is shown twice: once partitioning the features_a peak set at scale, and once on a small hand-built set of overlapping intervals where every cut is traceable by eye. The reference-mode example splits features_a at the breakpoints of features_b. The summary now lists DISJOIN as the eighth operator.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add
DISJOIN, a table function that splits a set of target intervals against a reference set of intervals.DISJOIN(target, reference := ref)appears in the FROM clause, likeNEAREST, and cuts each target interval at every reference breakpoint strictly interior to it — so no resulting sub-interval partially overlaps a reference interval. The full target row passes through unchanged and the sub-interval is appended asdisjoin_chrom,disjoin_start, anddisjoin_end; sub-intervals overlapping no reference interval are dropped. Whenreferenceis omitted it defaults to the target set, soSELECT DISTINCT disjoin_* FROM DISJOIN(features)reproduces a Bioconductor-styledisjoin()partition.The operator is generator-only — it emits a self-contained
WITH-CTE subquery using onlyUNION,LEAD, andEXISTS(noLATERAL, nogenerate_series), so it transpiles unchanged to DuckDB, PostgreSQL, and SQLite. Sub-interval boundaries are canonicalized to 0-based half-open coordinates regardless of the target or reference table encoding. The output schema is uniform across all backends: rather than renaming the target's columns (not portable without schema introspection), the target row passes through intact and the sub-interval is appended under distinctdisjoin_*names.A
bin_widthconvenience parameter and a revmap-style array of covering reference identifiers are out of scope here and tracked as possible follow-ups.Closes #87
Proposed changes
DISJOIN operator
Add the
GIQLDisjoinAST node and registerDISJOINwith the parser. The generator emits a parenthesizedWITH-CTE subquery: it collects distinct reference breakpoints, builds each target's cut list (own endpoints plus interior breakpoints), forms sub-intervals with aLEADwindow, and applies anEXISTScoverage filter. Segments join back to target rows on raw geometry, so duplicate target rows split independently with no synthetic row id. The reference resolves from a table, a CTE, or a subquery, defaulting to the target set.MCP operator catalog
Register
DISJOINin the MCP server operator metadata and documentation path map so it surfaces throughlist_operators,explain_operator, andsearch_docs.Documentation
Add a
DISJOINsection to the aggregation operators reference and a new disjoin recipe covering set partitioning and reference-grid splitting, wired into the recipes index and toctree.Test cases
TestDisjoinParsingDISJOIN(features)call with a positional targetthisTestDisjoinParsingDISJOIN(features)call with no referencereferenceargumentTestDisjoinParsingreference := refsnamed argumentTestDisjoinParsingreference => refsnamed argumentTestDisjoinParsing(SELECT ...)subqueryTestDisjoinTranspilationSELECT * FROM DISJOIN(features)queryTestDisjoinTranspilationdisjoin_chrom/disjoin_start/disjoin_endTestDisjoinTranspilationTestDisjoinTranspilationTestDisjoinTranspilationTestDisjoinTranspilationTestDisjoinTranspilationTestDisjoinExecutionTestDisjoinExecutionTestDisjoinExecutionTestDisjoinExecutionTestDisjoinExecutionTestDisjoinExecutionTestDisjoinExecutionTestDisjoinExecutionTestDisjoinExecutionTestDisjoinExecutionTestDisjoinCoordinateSpaceTestDisjoinCoordinateSpaceDISJOIN(features, reference := refs)runs