Skip to content

Transpile INTERSECTS joins to per-chromosome IEJoin queries for DuckDB #85

@conradbzura

Description

@conradbzura

Description

Add a DuckDB-specific generator that transpiles column-to-column INTERSECTS joins into dynamically-generated per-chromosome IEJoin queries, triggering DuckDB's native IE_JOIN operator instead of the current binned equi-join.

DuckDB's IEJoin handles a.start < b.end AND a.end > b.start in O(n log n + k), but only activates for pure inequality joins — adding an equality condition like chrom = chrom forces a HASH_JOIN that degrades to O(n*m) per partition. The workaround is to emit a two-step dynamic SQL pattern that builds per-partition subqueries at query time:

-- Step 1: dynamically build UNION ALL of per-chrom IEJoin subqueries
SET VARIABLE iejoin_query = (
    SELECT string_agg(
        'SELECT a.start, a."end" AS a_end, b.start AS b_start, b."end" AS b_end '
        || 'FROM (SELECT start, "end" FROM t1 WHERE chrom = ''' || chrom || ''') a '
        || 'JOIN (SELECT start, "end" FROM t2 WHERE chrom = ''' || chrom || ''') b '
        || 'ON a.start < b."end" AND a."end" > b.start',
        ' UNION ALL '
    )
    FROM (
        SELECT DISTINCT a.chrom FROM t1 a
        INNER JOIN (SELECT DISTINCT chrom FROM t2) b ON a.chrom = b.chrom
        ORDER BY a.chrom
    )
);

-- Step 2: execute the generated query
EXECUTE getvariable('iejoin_query');

This discovers partition values from the data at runtime — no hardcoded chromosome list needed. Each subquery gets the IE_JOIN plan. The string_agg construction adds negligible overhead (~0.1s).

Motivation

Benchmarking on real genomic data (20M BigWig intervals vs 2.3M cCREs, 7.2M overlapping pairs):

Strategy vs. binned equi-join
DuckDB binned equi-join (current GIQL) baseline
DuckDB dynamic IEJoin (SET + EXECUTE) ~2x faster
polars-bio Coitrees ~2.6x faster
polars-bio SuperIntervals ~4.4x faster

The binned equi-join pays overhead from UNNEST (row inflation) and DISTINCT (deduplication) that the IEJoin approach avoids entirely. This optimization is DuckDB-specific — vanilla DataFusion uses NestedLoopJoinExec for inequality joins and would not benefit.

Expected outcome

  • A DuckDB-specific code path in the transpiler that emits the two-step SET VARIABLE / EXECUTE pattern for INTERSECTS joins.
  • The partition column defaults to the chromosome column but is configurable.
  • The generic binned equi-join remains the default when no dialect is specified.
  • transpile() accepts a parameter (e.g. dialect="duckdb") to opt in.

Implementation considerations

Surfaced during a benchmark/review session before implementation begins. The first three are correctness-load-bearing; the rest are quality concerns.

Correctness concerns

  1. Outer join semantics. A UNION ALL of per-chromosome INNER joins drops LEFT / RIGHT / FULL semantics for chromosomes present on only one side. The current binned plan handles outer joins natively. The dialect path must either recover that (e.g. add a separate "unmatched chromosomes" branch) or restrict itself to inner joins and fall back to the binned plan for outer-join shapes.

  2. Coordinate system and interval type. PR Table.coordinate_system and Table.interval_type are silently ignored by spatial-predicate transpilation #88 made spatial predicates honor each Table's coordinate system and interval type (half-open vs closed-closed). The example uses < and >, which is correct only for half-open. The dialect path must thread the same comparison-operator selection through the dynamic-SQL builder or it will silently regress for tables declared as 1-based / closed-closed.

  3. Empty chromosome intersection. When the inner join over distinct chromosomes returns no rows, string_agg returns NULL and EXECUTE getvariable('iejoin_query') fails at runtime. The dialect needs a fallback that emits an empty result set with the correct output schema.

Implementation gaps

  1. Arbitrary projections. The example selects only start / end. Real queries project arbitrary user columns, including renames and aliases from the original SELECT. The dynamic-SQL builder must pass those through, not hardcode a shape.

  2. Chromosome-name escaping. Direct string concatenation breaks if a contig name contains a single quote. Genomic data is usually clean, but non-human assemblies occasionally produce unusual names. Use quote_literal() or DuckDB parameter substitution.

  3. Many-contig genomes. Wheat, maize, and draft assemblies can have thousands of scaffolds. A UNION ALL with several thousand branches may hit parser/planner cliffs or simply lose to binning on plan time alone. Worth benchmarking before committing, and documenting an upper limit (or an automatic fallback above a threshold).

Smaller refinements

  1. Soften the "~2× faster" claim. The synthetic N=1M benchmark showed 1.56× and the real-data BigWig × cCRE benchmark showed 2×. State as "1.5–2× depending on workload."

  2. Don't promise IE_JOIN specifically. The O(n log n + k) claim applies to IEJoin; DuckDB may pick PIECEWISE_MERGE_JOIN for single-inequality cases. Phrase as "DuckDB's range-join family (IE_JOIN or PIECEWISE_MERGE_JOIN)."

Benchmark context

Benchmarks added during the review session (currently uncommitted on 88-honor-table-coordinate-system-and-interval-type):

  • benchmarks/bench_chrom_vs_binned.py — binned vs raw chrom-equality vs per-chrom IEJoin
  • benchmarks/bench_iejoin_coercion.py — encoded-BIGINT trick to coerce global IEJoin without UNION ALL
  • benchmarks/bench_pragma_range_joins.pySET prefer_range_joins=true test
  • benchmarks/bench_1m.py — head-to-head at N=1M

N=1M synthetic results (8.4M output rows, 24 chromosomes, intervals 200–50 kb):

Plan Time (s) vs binned
Binned (current) 0.273 1.00×
prefer_range_joins=true + chrom-equality 0.554 2.03× slower
Per-chrom UNION ALL (this issue) 0.175 1.56× faster

Two alternatives investigated and rejected:

  • SET prefer_range_joins=true does coerce IE_JOIN globally on a chrom-equality query, but is slower than binned because the global IEJoin generates cross-chromosome candidate pairs that get discarded as residuals. Not viable.
  • Encoded BIGINT (pack chrom_id * 1e9 + coord into a single value to eliminate the equality predicate entirely) also coerces global IE_JOIN, but matches binned within ~10%. Same blind spot as the pragma — no per-chrom partitioning. Not adopted.

The per-chrom UNION ALL wins precisely because it never materializes cross-chromosome candidate pairs. That's also why its advantage grows with N (1.07× at 250 k → 1.56× at 1 M).

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions