You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add a DuckDB-specific generator that transpiles column-to-column INTERSECTS joins into dynamically-generated per-chromosome IEJoin queries, triggering DuckDB's native IE_JOIN operator instead of the current binned equi-join.
DuckDB's IEJoin handles a.start < b.end AND a.end > b.start in O(n log n + k), but only activates for pure inequality joins — adding an equality condition like chrom = chrom forces a HASH_JOIN that degrades to O(n*m) per partition. The workaround is to emit a two-step dynamic SQL pattern that builds per-partition subqueries at query time:
-- Step 1: dynamically build UNION ALL of per-chrom IEJoin subqueriesSET VARIABLE iejoin_query = (
SELECT string_agg(
'SELECT a.start, a."end" AS a_end, b.start AS b_start, b."end" AS b_end '||'FROM (SELECT start, "end" FROM t1 WHERE chrom = '''|| chrom ||''') a '||'JOIN (SELECT start, "end" FROM t2 WHERE chrom = '''|| chrom ||''') b '||'ON a.start < b."end" AND a."end" > b.start',
' UNION ALL '
)
FROM (
SELECT DISTINCTa.chromFROM t1 a
INNER JOIN (SELECT DISTINCT chrom FROM t2) b ONa.chrom=b.chromORDER BYa.chrom
)
);
-- Step 2: execute the generated query
EXECUTE getvariable('iejoin_query');
This discovers partition values from the data at runtime — no hardcoded chromosome list needed. Each subquery gets the IE_JOIN plan. The string_agg construction adds negligible overhead (~0.1s).
Motivation
Benchmarking on real genomic data (20M BigWig intervals vs 2.3M cCREs, 7.2M overlapping pairs):
Strategy
vs. binned equi-join
DuckDB binned equi-join (current GIQL)
baseline
DuckDB dynamic IEJoin (SET + EXECUTE)
~2x faster
polars-bio Coitrees
~2.6x faster
polars-bio SuperIntervals
~4.4x faster
The binned equi-join pays overhead from UNNEST (row inflation) and DISTINCT (deduplication) that the IEJoin approach avoids entirely. This optimization is DuckDB-specific — vanilla DataFusion uses NestedLoopJoinExec for inequality joins and would not benefit.
Expected outcome
A DuckDB-specific code path in the transpiler that emits the two-step SET VARIABLE / EXECUTE pattern for INTERSECTS joins.
The partition column defaults to the chromosome column but is configurable.
The generic binned equi-join remains the default when no dialect is specified.
transpile() accepts a parameter (e.g. dialect="duckdb") to opt in.
Implementation considerations
Surfaced during a benchmark/review session before implementation begins. The first three are correctness-load-bearing; the rest are quality concerns.
Correctness concerns
Outer join semantics. A UNION ALL of per-chromosome INNER joins drops LEFT / RIGHT / FULL semantics for chromosomes present on only one side. The current binned plan handles outer joins natively. The dialect path must either recover that (e.g. add a separate "unmatched chromosomes" branch) or restrict itself to inner joins and fall back to the binned plan for outer-join shapes.
Coordinate system and interval type. PR Table.coordinate_system and Table.interval_type are silently ignored by spatial-predicate transpilation #88 made spatial predicates honor each Table's coordinate system and interval type (half-open vs closed-closed). The example uses < and >, which is correct only for half-open. The dialect path must thread the same comparison-operator selection through the dynamic-SQL builder or it will silently regress for tables declared as 1-based / closed-closed.
Empty chromosome intersection. When the inner join over distinct chromosomes returns no rows, string_agg returns NULL and EXECUTE getvariable('iejoin_query') fails at runtime. The dialect needs a fallback that emits an empty result set with the correct output schema.
Implementation gaps
Arbitrary projections. The example selects only start / end. Real queries project arbitrary user columns, including renames and aliases from the original SELECT. The dynamic-SQL builder must pass those through, not hardcode a shape.
Chromosome-name escaping. Direct string concatenation breaks if a contig name contains a single quote. Genomic data is usually clean, but non-human assemblies occasionally produce unusual names. Use quote_literal() or DuckDB parameter substitution.
Many-contig genomes. Wheat, maize, and draft assemblies can have thousands of scaffolds. A UNION ALL with several thousand branches may hit parser/planner cliffs or simply lose to binning on plan time alone. Worth benchmarking before committing, and documenting an upper limit (or an automatic fallback above a threshold).
Smaller refinements
Soften the "~2× faster" claim. The synthetic N=1M benchmark showed 1.56× and the real-data BigWig × cCRE benchmark showed 2×. State as "1.5–2× depending on workload."
Don't promise IE_JOIN specifically. The O(n log n + k) claim applies to IEJoin; DuckDB may pick PIECEWISE_MERGE_JOIN for single-inequality cases. Phrase as "DuckDB's range-join family (IE_JOIN or PIECEWISE_MERGE_JOIN)."
Benchmark context
Benchmarks added during the review session (currently uncommitted on 88-honor-table-coordinate-system-and-interval-type):
benchmarks/bench_chrom_vs_binned.py — binned vs raw chrom-equality vs per-chrom IEJoin
benchmarks/bench_iejoin_coercion.py — encoded-BIGINT trick to coerce global IEJoin without UNION ALL
benchmarks/bench_pragma_range_joins.py — SET prefer_range_joins=true test
SET prefer_range_joins=true does coerce IE_JOIN globally on a chrom-equality query, but is slower than binned because the global IEJoin generates cross-chromosome candidate pairs that get discarded as residuals. Not viable.
Encoded BIGINT (pack chrom_id * 1e9 + coord into a single value to eliminate the equality predicate entirely) also coerces global IE_JOIN, but matches binned within ~10%. Same blind spot as the pragma — no per-chrom partitioning. Not adopted.
The per-chrom UNION ALL wins precisely because it never materializes cross-chromosome candidate pairs. That's also why its advantage grows with N (1.07× at 250 k → 1.56× at 1 M).
Description
Add a DuckDB-specific generator that transpiles column-to-column
INTERSECTSjoins into dynamically-generated per-chromosome IEJoin queries, triggering DuckDB's nativeIE_JOINoperator instead of the current binned equi-join.DuckDB's IEJoin handles
a.start < b.end AND a.end > b.startin O(n log n + k), but only activates for pure inequality joins — adding an equality condition likechrom = chromforces aHASH_JOINthat degrades to O(n*m) per partition. The workaround is to emit a two-step dynamic SQL pattern that builds per-partition subqueries at query time:This discovers partition values from the data at runtime — no hardcoded chromosome list needed. Each subquery gets the
IE_JOINplan. Thestring_aggconstruction adds negligible overhead (~0.1s).Motivation
Benchmarking on real genomic data (20M BigWig intervals vs 2.3M cCREs, 7.2M overlapping pairs):
The binned equi-join pays overhead from
UNNEST(row inflation) andDISTINCT(deduplication) that the IEJoin approach avoids entirely. This optimization is DuckDB-specific — vanilla DataFusion usesNestedLoopJoinExecfor inequality joins and would not benefit.Expected outcome
SET VARIABLE/EXECUTEpattern forINTERSECTSjoins.transpile()accepts a parameter (e.g.dialect="duckdb") to opt in.Implementation considerations
Surfaced during a benchmark/review session before implementation begins. The first three are correctness-load-bearing; the rest are quality concerns.
Correctness concerns
Outer join semantics. A
UNION ALLof per-chromosomeINNERjoins dropsLEFT/RIGHT/FULLsemantics for chromosomes present on only one side. The current binned plan handles outer joins natively. The dialect path must either recover that (e.g. add a separate "unmatched chromosomes" branch) or restrict itself to inner joins and fall back to the binned plan for outer-join shapes.Coordinate system and interval type. PR
Table.coordinate_systemandTable.interval_typeare silently ignored by spatial-predicate transpilation #88 made spatial predicates honor eachTable's coordinate system and interval type (half-open vs closed-closed). The example uses<and>, which is correct only for half-open. The dialect path must thread the same comparison-operator selection through the dynamic-SQL builder or it will silently regress for tables declared as 1-based / closed-closed.Empty chromosome intersection. When the inner join over distinct chromosomes returns no rows,
string_aggreturnsNULLandEXECUTE getvariable('iejoin_query')fails at runtime. The dialect needs a fallback that emits an empty result set with the correct output schema.Implementation gaps
Arbitrary projections. The example selects only
start/end. Real queries project arbitrary user columns, including renames and aliases from the originalSELECT. The dynamic-SQL builder must pass those through, not hardcode a shape.Chromosome-name escaping. Direct string concatenation breaks if a contig name contains a single quote. Genomic data is usually clean, but non-human assemblies occasionally produce unusual names. Use
quote_literal()or DuckDB parameter substitution.Many-contig genomes. Wheat, maize, and draft assemblies can have thousands of scaffolds. A
UNION ALLwith several thousand branches may hit parser/planner cliffs or simply lose to binning on plan time alone. Worth benchmarking before committing, and documenting an upper limit (or an automatic fallback above a threshold).Smaller refinements
Soften the "~2× faster" claim. The synthetic N=1M benchmark showed 1.56× and the real-data BigWig × cCRE benchmark showed 2×. State as "1.5–2× depending on workload."
Don't promise
IE_JOINspecifically. TheO(n log n + k)claim applies to IEJoin; DuckDB may pickPIECEWISE_MERGE_JOINfor single-inequality cases. Phrase as "DuckDB's range-join family (IE_JOINorPIECEWISE_MERGE_JOIN)."Benchmark context
Benchmarks added during the review session (currently uncommitted on
88-honor-table-coordinate-system-and-interval-type):benchmarks/bench_chrom_vs_binned.py— binned vs raw chrom-equality vs per-chrom IEJoinbenchmarks/bench_iejoin_coercion.py— encoded-BIGINT trick to coerce global IEJoin without UNION ALLbenchmarks/bench_pragma_range_joins.py—SET prefer_range_joins=truetestbenchmarks/bench_1m.py— head-to-head at N=1MN=1M synthetic results (8.4M output rows, 24 chromosomes, intervals 200–50 kb):
prefer_range_joins=true+ chrom-equalityUNION ALL(this issue)Two alternatives investigated and rejected:
SET prefer_range_joins=truedoes coerceIE_JOINglobally on a chrom-equality query, but is slower than binned because the global IEJoin generates cross-chromosome candidate pairs that get discarded as residuals. Not viable.chrom_id * 1e9 + coordinto a single value to eliminate the equality predicate entirely) also coerces globalIE_JOIN, but matches binned within ~10%. Same blind spot as the pragma — no per-chrom partitioning. Not adopted.The per-chrom
UNION ALLwins precisely because it never materializes cross-chromosome candidate pairs. That's also why its advantage grows with N (1.07× at 250 k → 1.56× at 1 M).