Transpile INTERSECTS joins to per-chromosome IEJoin queries for DuckDB

## Description

Add a DuckDB-specific generator that transpiles column-to-column `INTERSECTS` joins into dynamically-generated per-chromosome IEJoin queries, triggering DuckDB's native `IE_JOIN` operator instead of the current binned equi-join.

DuckDB's IEJoin handles `a.start < b.end AND a.end > b.start` in O(n log n + k), but only activates for pure inequality joins — adding an equality condition like `chrom = chrom` forces a `HASH_JOIN` that degrades to O(n*m) per partition. The workaround is to emit a two-step dynamic SQL pattern that builds per-partition subqueries at query time:

```sql
-- Step 1: dynamically build UNION ALL of per-chrom IEJoin subqueries
SET VARIABLE iejoin_query = (
    SELECT string_agg(
        'SELECT a.start, a."end" AS a_end, b.start AS b_start, b."end" AS b_end '
        || 'FROM (SELECT start, "end" FROM t1 WHERE chrom = ''' || chrom || ''') a '
        || 'JOIN (SELECT start, "end" FROM t2 WHERE chrom = ''' || chrom || ''') b '
        || 'ON a.start < b."end" AND a."end" > b.start',
        ' UNION ALL '
    )
    FROM (
        SELECT DISTINCT a.chrom FROM t1 a
        INNER JOIN (SELECT DISTINCT chrom FROM t2) b ON a.chrom = b.chrom
        ORDER BY a.chrom
    )
);

-- Step 2: execute the generated query
EXECUTE getvariable('iejoin_query');
```

This discovers partition values from the data at runtime — no hardcoded chromosome list needed. Each subquery gets the `IE_JOIN` plan. The `string_agg` construction adds negligible overhead (~0.1s).

## Motivation

Benchmarking on real genomic data (20M BigWig intervals vs 2.3M cCREs, 7.2M overlapping pairs):

| Strategy | vs. binned equi-join |
|---|---|
| DuckDB binned equi-join (current GIQL) | baseline |
| DuckDB dynamic IEJoin (SET + EXECUTE) | ~2x faster |
| polars-bio Coitrees | ~2.6x faster |
| polars-bio SuperIntervals | ~4.4x faster |

The binned equi-join pays overhead from `UNNEST` (row inflation) and `DISTINCT` (deduplication) that the IEJoin approach avoids entirely. This optimization is DuckDB-specific — vanilla DataFusion uses `NestedLoopJoinExec` for inequality joins and would not benefit.

## Expected outcome

- A DuckDB-specific code path in the transpiler that emits the two-step `SET VARIABLE` / `EXECUTE` pattern for `INTERSECTS` joins.
- The partition column defaults to the chromosome column but is configurable.
- The generic binned equi-join remains the default when no dialect is specified.
- `transpile()` accepts a parameter (e.g. `dialect="duckdb"`) to opt in.

## Implementation considerations

Surfaced during a benchmark/review session before implementation begins. The first three are correctness-load-bearing; the rest are quality concerns.

### Correctness concerns

1. **Outer join semantics.** A `UNION ALL` of per-chromosome `INNER` joins drops `LEFT` / `RIGHT` / `FULL` semantics for chromosomes present on only one side. The current binned plan handles outer joins natively. The dialect path must either recover that (e.g. add a separate "unmatched chromosomes" branch) or restrict itself to inner joins and fall back to the binned plan for outer-join shapes.

2. **Coordinate system and interval type.** PR #88 made spatial predicates honor each `Table`'s coordinate system and interval type (half-open vs closed-closed). The example uses `<` and `>`, which is correct only for half-open. The dialect path must thread the same comparison-operator selection through the dynamic-SQL builder or it will silently regress for tables declared as 1-based / closed-closed.

3. **Empty chromosome intersection.** When the inner join over distinct chromosomes returns no rows, `string_agg` returns `NULL` and `EXECUTE getvariable('iejoin_query')` fails at runtime. The dialect needs a fallback that emits an empty result set with the correct output schema.

### Implementation gaps

4. **Arbitrary projections.** The example selects only `start` / `end`. Real queries project arbitrary user columns, including renames and aliases from the original `SELECT`. The dynamic-SQL builder must pass those through, not hardcode a shape.

5. **Chromosome-name escaping.** Direct string concatenation breaks if a contig name contains a single quote. Genomic data is usually clean, but non-human assemblies occasionally produce unusual names. Use `quote_literal()` or DuckDB parameter substitution.

6. **Many-contig genomes.** Wheat, maize, and draft assemblies can have thousands of scaffolds. A `UNION ALL` with several thousand branches may hit parser/planner cliffs or simply lose to binning on plan time alone. Worth benchmarking before committing, and documenting an upper limit (or an automatic fallback above a threshold).

### Smaller refinements

7. **Soften the "~2× faster" claim.** The synthetic N=1M benchmark showed 1.56× and the real-data BigWig × cCRE benchmark showed 2×. State as "1.5–2× depending on workload."

8. **Don't promise `IE_JOIN` specifically.** The `O(n log n + k)` claim applies to IEJoin; DuckDB may pick `PIECEWISE_MERGE_JOIN` for single-inequality cases. Phrase as "DuckDB's range-join family (`IE_JOIN` or `PIECEWISE_MERGE_JOIN`)."

### Benchmark context

Benchmarks added during the review session (currently uncommitted on `88-honor-table-coordinate-system-and-interval-type`):

- `benchmarks/bench_chrom_vs_binned.py` — binned vs raw chrom-equality vs per-chrom IEJoin
- `benchmarks/bench_iejoin_coercion.py` — encoded-BIGINT trick to coerce global IEJoin without UNION ALL
- `benchmarks/bench_pragma_range_joins.py` — `SET prefer_range_joins=true` test
- `benchmarks/bench_1m.py` — head-to-head at N=1M

N=1M synthetic results (8.4M output rows, 24 chromosomes, intervals 200–50 kb):

| Plan | Time (s) | vs binned |
|---|---|---|
| Binned (current) | 0.273 | 1.00× |
| `prefer_range_joins=true` + chrom-equality | 0.554 | 2.03× slower |
| Per-chrom `UNION ALL` (this issue) | 0.175 | 1.56× faster |

Two alternatives investigated and rejected:

- **`SET prefer_range_joins=true`** does coerce `IE_JOIN` globally on a chrom-equality query, but is *slower* than binned because the global IEJoin generates cross-chromosome candidate pairs that get discarded as residuals. Not viable.
- **Encoded BIGINT** (pack `chrom_id * 1e9 + coord` into a single value to eliminate the equality predicate entirely) also coerces global `IE_JOIN`, but matches binned within ~10%. Same blind spot as the pragma — no per-chrom partitioning. Not adopted.

The per-chrom `UNION ALL` wins precisely because it never materializes cross-chromosome candidate pairs. That's also why its advantage *grows* with N (1.07× at 250 k → 1.56× at 1 M).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transpile INTERSECTS joins to per-chromosome IEJoin queries for DuckDB #85

Description

Motivation

Expected outcome

Implementation considerations

Correctness concerns

Implementation gaps

Smaller refinements

Benchmark context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Strategy	vs. binned equi-join
DuckDB binned equi-join (current GIQL)	baseline
DuckDB dynamic IEJoin (SET + EXECUTE)	~2x faster
polars-bio Coitrees	~2.6x faster
polars-bio SuperIntervals	~4.4x faster

Plan	Time (s)	vs binned
Binned (current)	0.273	1.00×
`prefer_range_joins=true` + chrom-equality	0.554	2.03× slower
Per-chrom `UNION ALL` (this issue)	0.175	1.56× faster

Transpile INTERSECTS joins to per-chromosome IEJoin queries for DuckDB #85

Description

Description

Motivation

Expected outcome

Implementation considerations

Correctness concerns

Implementation gaps

Smaller refinements

Benchmark context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions