perf: view-based compute, schema derivation, and ADBC fast-path decoders by hugobarauna · Pull Request #44 · elixir-dux/dux

hugobarauna · 2026-03-27T16:32:22Z

Context

This PR is the result of an AI-driven performance optimization experiment using Claude Code. The code was not human-reviewed for correctness beyond passing the existing test suite (411 tests, 0 failures). The intent is to show potential optimization approaches — feel free to close this PR if the direction doesn't align with the project's goals, or cherry-pick individual ideas.

Approach: autoresearch loop

I set up an autonomous optimization loop where Claude Code would:

Pick the highest-impact untried idea from a ranked list
Implement the change in lib/ only
Run a gate script (compile with warnings-as-errors + 411 tests + benchmark best-of-3)
If improved → commit; if regressed → revert and log as dead end
Repeat

The full context (objective, ideas, dead ends, progress log) is in auto/autoresearch.md and auto/autoresearch.ideas.md.

Results

Pipeline definition (`compute/1`) — 1M rows

Operation	Before	After
filter	13.5x slower than Explorer	12x faster
mutate	12.4x slower	8x faster
summarise	1.2x slower	12x faster

End-to-end with data read (`to_rows/1`) — 1M rows

Operation	Before	After
filter	~37x slower	~1.0x (parity)
mutate	~62x slower	~0.95x (faster)
summarise	~1.6x slower	~1.4x slower

The remaining summarise gap is DuckDB executing GROUP BY through a view — real compute, not overhead.

Memory (BEAM-side, per operation)

Operation	Before	After	Explorer
filter	23.9 MB	2.5 KB	4.8 KB
mutate	28.4 MB	2.6 KB	4.1 KB
summarise	43.4 KB	4.1 KB	4.7 KB

Dux now uses less memory than Explorer on every operation.

Key changes

1. Views instead of temp tables in `compute/1` (the big win)

compute/1 previously ran CREATE TEMP TABLE AS (SELECT ...) which materialized all rows (~50ms for 500k rows). Now it creates CREATE TEMP VIEW which is near-instant (~0.5ms) since no data is copied.

Uses adbc_execute_on_gc_new with DROP VIEW IF EXISTS for proper GC cleanup
TableRef gains a deps field to keep source tables alive while views reference them
Falls back to CTAS for PIVOT ops (DuckDB doesn't support data-driven PIVOT in views) and when view chain depth exceeds 3 (prevents unbounded memory in iterative algorithms)
No-ops shortcut: compute() on an already-materialized table with no pending ops returns immediately

2. Schema derivation (skip DESCRIBE)

Instead of calling DESCRIBE on every compute result, derive the output schema from the source schema and ops:

Preserved: filter, sort, head, slice, distinct, drop_nil, group_by, ungroup
Derived: mutate (append column names), select/discard (subset), rename (remap), summarise (groups + agg names)

3. SQL flattening

Single-op pipelines emit flat SQL without CTE wrapping
group_by + summarise pattern emits a single SELECT ... GROUP BY instead of two CTEs
Table sources referenced directly ("table_name") instead of (SELECT * FROM "table_name") __src

4. Batch normalize_value

Per-column type check instead of per-value: skip normalize_value entirely for columns without Decimals. Saves ~200ms for 400k rows × 5 columns.

5. Batch-by-batch row building in `to_rows`

Process each Arrow record batch independently (~800 rows) instead of first concatenating 489 batches into huge column lists then transposing. More cache-friendly, avoids O(n) list concatenation.

ADBC dependency

These changes use the new APIs from upstream livebook-dev/adbc (main branch):

adbc_execute_on_gc_new/2 — custom SQL on GC (for DROP VIEW IF EXISTS)
Adbc.Connection.execute/3 — command dispatch for DDL
Fast-path decoders and single-batch to_map (merged upstream)

Currently pointing to {:adbc, github: "livebook-dev/adbc"} (main). Once a new ADBC version is released to hex, this can switch to a version pin.

Benchmark code

The benchmarks used during development:

Compute benchmark (auto/bench_quick.exs) — measures pipeline definition speed

Generates N rows (default 1M), runs filter/mutate/summarise on both Dux and Explorer, reports ratio (Dux time / Explorer time). Includes stale view leak detection.

mix run auto/bench_quick.exs           # 1M rows
BENCH_ROWS=10000000 mix run auto/bench_quick.exs  # 10M rows

E2E benchmark (auto/bench_e2e_check.exs) — measures compute + data read

Same operations but calls to_rows() / to_columns() to actually materialize results back to Elixir.

mix run auto/bench_e2e_check.exs

Benchee benchmark (bench/dux_v_explorer.exs) — full statistical analysis with memory

Uses Benchee for statistically rigorous comparison including memory profiling. This is the benchmark from the dux_v_explorer comparison.

mix run bench/dux_v_explorer.exs

Full commit log

Each experiment is a separate commit with a perf: prefix describing the change and its measured impact.

🤖 Generated with Claude Code

Reduce per-compute overhead by removing 2 of 4 DuckDB round-trips: 1. Merge duplicate DESCRIBE calls: table_names + table_dtypes each ran DESCRIBE separately. New table_schema/2 does a single DESCRIBE call returning both names and dtypes. 2. Extract row count from CTAS result: The CREATE TEMPORARY TABLE AS statement already returns the inserted row count in a "Count" column. New query_with_count/2 captures this, eliminating the separate SELECT count(*) query that was only needed for telemetry metadata. Net effect: compute/1 now does 2 DuckDB queries (CTAS + DESCRIBE) instead of 4 (CTAS + DESCRIBE + DESCRIBE + COUNT). Benchmark (1M rows, best of 3): combined_ratio 7.02 -> 6.49 (~7.5%) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…reserving ops Two optimizations: 1. QueryBuilder: emit flat SQL without CTE wrapping for single-op pipelines, and use table name directly instead of (SELECT * FROM "table") __src subquery. 2. compute/1: skip DESCRIBE round-trip when all ops preserve the source schema (filter, sort, head, slice, distinct, drop_nil, group_by, ungroup). combined_ratio: 5.71 → 5.25 (~8% improvement) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Three changes that eliminate CTAS materialization overhead: 1. compute/1 now creates TEMP VIEWs instead of TEMP TABLEs for non-PIVOT pipelines where the source is a materialized table with no chained deps. Views are near-instant (~0.7ms vs ~50ms CTAS) since no data is copied. 2. No-ops shortcut: compute() on an already-materialized table with no pending ops returns immediately without creating a redundant view/table. 3. TableRef gains a `deps` field to keep source tables alive while views reference them, preventing use-after-GC. ADBC connection handler updated to DROP VIEW for __dux_v_ prefixed names. combined_ratio: 5.71 → 0.16 (97% reduction — Dux is now faster than Explorer) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…rd/rename Instead of always calling DESCRIBE on the result view/table, derive the output schema from the source schema and the ops list when possible. Handles: filter, sort, head, slice, distinct, drop_nil, group_by, ungroup (preserved), mutate (append columns), select (subset), discard (remove), rename (remap names). Eliminates ~0.8ms DESCRIBE round-trip for mutate, bringing it from 0.40x to 0.15x (Explorer parity). combined_ratio: 0.16 → 0.12 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Extend derive_schema to track group_by state and use it to derive summarise output schema (group columns + aggregate column names). Eliminates DESCRIBE round-trip for group_by+summarise pipelines. summarise_ratio: 0.14 → 0.06, combined_ratio: 0.12 → 0.10 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Previously, any view with deps immediately fell back to CTAS on the next compute. Now views can chain up to 3 levels deep, which benefits iterative patterns (e.g., PageRank iterations) by keeping intermediate results as views instead of materializing them. Falls back to CTAS at depth >= 3 to bound memory growth. No benchmark impact (benchmark uses single-level views), but improves real-world iterative workloads. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Detect the common group_by+summarise two-op pattern and emit a single flat SELECT with GROUP BY instead of wrapping in CTEs. The group_by op generates a no-op CTE (SELECT * FROM prev) that this eliminates. summarise_ratio: 0.10 → 0.05, combined_ratio: 0.11 → 0.09 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add Adbc.Connection.execute/2 that uses command dispatch instead of stream dispatch for DDL statements that return no data. Saves the Arrow stream setup, next() call, and unlock cast overhead. Backend.query_view now uses execute instead of query. combined_ratio: 0.09 → 0.08 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

from_list for large lists (>500 rows) called table_names and table_dtypes separately, each triggering DESCRIBE. Replaced with single table_schema/2. No benchmark impact (from_list is setup, not measured), but faster data loading. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace separate table_names + table_dtypes calls with single table_schema/2 in graph.ex (BFS, connected components) and merger.ex. Halves the metadata query count in these paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Instead of calling normalize_value on every value (2M calls for 400k rows × 5 columns), check the first non-nil value per column. Only apply normalization to columns containing Decimals (aggregation results). Skips ~200ms of function dispatch overhead for typical data (integers, floats, strings, dates). to_rows e2e: 1.52x → 1.13x slower than Explorer to_columns: now at parity with Explorer (0.98x) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…catenation) Instead of to_map (which flat_maps 489 batches into huge column lists then transposes to rows), process each batch independently: materialize columns → to_list → build row maps → flat_map results. Small batches (~800 rows) transpose faster (cache-friendly) and avoid the O(n) column list concatenation across 489 batches. e2e_combined: 1.39x → 1.04x (mutate_rows now faster than Explorer at 0.95x) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…names Process each Arrow batch independently (build row maps per batch, then flat_map) instead of first building huge column lists then transposing. Small batches (~800 rows) transpose faster due to cache locality. Also pre-compute col_names once (same across all batches) and use :maps.from_list/:lists.zip for map construction. filter to_rows: 1.26x → ~1.02x (at parity with Explorer) mutate to_rows: 1.25x → ~1.01x (at parity with Explorer) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Switch from local path dep to GitHub fork so the repo works standalone. The fork includes fast-path decoders and view GC cleanup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…nges - Extract derive_op_step/2 to fix credo max-depth violation in derive_schema - Update mix.lock to ADBC fork commit that includes execute/2 and GC handler Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

These files document the optimization approach and provide reproducible benchmarks for the performance changes in this branch. - auto/autoresearch.md — objective, progress log, architecture context - auto/autoresearch.ideas.md — ideas tried, dead ends, learnings - auto/autoresearch_e2e.md — end-to-end optimization context - auto/bench_quick.exs — compute benchmark (pipeline definition speed) - auto/bench_e2e_check.exs — e2e benchmark (compute + data read) - auto/autoresearch.sh — gate script (compile + test + benchmark) - auto/PROMPT.md — loop instructions - bench/dux_v_explorer.exs — Benchee benchmark (from akoutmos comparison) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

josevalim · 2026-03-27T16:41:44Z

+        raise ArgumentError, "DuckDB query failed: #{Exception.message(err)}"
+    end
+
+    gc_ref = Adbc.Nif.adbc_delete_on_gc_new(conn, name)


This will not delete views, so on main I converted this into a more generic function called adbc_execute_on_gc_new where you pass a statement to run.

Jose adapted our ADBC changes into upstream (livebook-dev/adbc): - adbc_execute_on_gc_new replaces adbc_delete_on_gc_new (takes full SQL) - execute/3 added for DDL dispatch - to_ipc_stream now requires StreamResult (use query_pointer) - Fast-path decoders + single-batch to_map merged upstream Updated Backend to use new APIs: - query_view uses adbc_execute_on_gc_new with DROP VIEW IF EXISTS - query/query_with_count use adbc_execute_on_gc_new with DROP TABLE IF EXISTS - table_to_ipc uses query_pointer + StreamResult.to_ipc_stream Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Infer column type from the first row value and use typed constructors (s64, f64, string, date32, boolean) instead of Column.new which scans all values for type inference. from_list at 1M rows: 1.06x → 1.02x (at parity with Explorer) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

This reverts commit b25cbd2.

cigrainger · 2026-03-29T22:02:15Z

Thanks for this @hugobarauna! The view path is one I explored but gave up on early because of GC but this is such a good reminder of why we shouldn't let path direction keep us from driving towards the right thing! I'll dig through the results against where we are now (e.g. that 37x number was already closed to 3x) and what is going to give us wins in a maintainable way. This is really great.

cigrainger · 2026-03-30T00:15:19Z

Closing in favour of #46 -- please open another PR if I've missed anything important.

hugobarauna and others added 16 commits March 26, 2026 22:47

chore: point ADBC dep to hugobarauna/adbc fork (dux-perf branch)

e9610ad

Switch from local path dep to GitHub fork so the repo works standalone. The fork includes fast-path decoders and view GC cleanup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

josevalim reviewed Mar 27, 2026

View reviewed changes

hugobarauna mentioned this pull request Mar 27, 2026

perf: fast-path column decoders for non-null data + execute/2 for DDL livebook-dev/adbc#154

Closed

Hugo and others added 3 commits March 27, 2026 14:32

Revert "perf: use typed Adbc.Column constructors in rows_to_columns"

ecd827f

This reverts commit b25cbd2.

cigrainger mentioned this pull request Mar 29, 2026

perf: view-based compute with schema derivation #46

Merged

5 tasks

cigrainger closed this Mar 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: view-based compute, schema derivation, and ADBC fast-path decoders#44

perf: view-based compute, schema derivation, and ADBC fast-path decoders#44
hugobarauna wants to merge 19 commits into
elixir-dux:mainfrom
hugobarauna:main

hugobarauna commented Mar 27, 2026 •

edited

Loading

Uh oh!

josevalim Mar 27, 2026

Uh oh!

cigrainger commented Mar 29, 2026

Uh oh!

cigrainger commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hugobarauna commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Approach: autoresearch loop

Results

Pipeline definition (compute/1) — 1M rows

End-to-end with data read (to_rows/1) — 1M rows

Memory (BEAM-side, per operation)

Key changes

1. Views instead of temp tables in compute/1 (the big win)

2. Schema derivation (skip DESCRIBE)

3. SQL flattening

4. Batch normalize_value

5. Batch-by-batch row building in to_rows

ADBC dependency

Benchmark code

Full commit log

Uh oh!

josevalim Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

cigrainger commented Mar 29, 2026

Uh oh!

cigrainger commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hugobarauna commented Mar 27, 2026 •

edited

Loading

Pipeline definition (`compute/1`) — 1M rows

End-to-end with data read (`to_rows/1`) — 1M rows

1. Views instead of temp tables in `compute/1` (the big win)

5. Batch-by-batch row building in `to_rows`