Skip to content

perf: view-based compute, schema derivation, and ADBC fast-path decoders#44

Closed
hugobarauna wants to merge 19 commits into
elixir-dux:mainfrom
hugobarauna:main
Closed

perf: view-based compute, schema derivation, and ADBC fast-path decoders#44
hugobarauna wants to merge 19 commits into
elixir-dux:mainfrom
hugobarauna:main

Conversation

@hugobarauna
Copy link
Copy Markdown
Contributor

@hugobarauna hugobarauna commented Mar 27, 2026

Context

This PR is the result of an AI-driven performance optimization experiment using Claude Code. The code was not human-reviewed for correctness beyond passing the existing test suite (411 tests, 0 failures). The intent is to show potential optimization approaches — feel free to close this PR if the direction doesn't align with the project's goals, or cherry-pick individual ideas.

Approach: autoresearch loop

I set up an autonomous optimization loop where Claude Code would:

  1. Pick the highest-impact untried idea from a ranked list
  2. Implement the change in lib/ only
  3. Run a gate script (compile with warnings-as-errors + 411 tests + benchmark best-of-3)
  4. If improved → commit; if regressed → revert and log as dead end
  5. Repeat

The full context (objective, ideas, dead ends, progress log) is in auto/autoresearch.md and auto/autoresearch.ideas.md.

Results

Pipeline definition (compute/1) — 1M rows

Operation Before After
filter 13.5x slower than Explorer 12x faster
mutate 12.4x slower 8x faster
summarise 1.2x slower 12x faster

End-to-end with data read (to_rows/1) — 1M rows

Operation Before After
filter ~37x slower ~1.0x (parity)
mutate ~62x slower ~0.95x (faster)
summarise ~1.6x slower ~1.4x slower

The remaining summarise gap is DuckDB executing GROUP BY through a view — real compute, not overhead.

Memory (BEAM-side, per operation)

Operation Before After Explorer
filter 23.9 MB 2.5 KB 4.8 KB
mutate 28.4 MB 2.6 KB 4.1 KB
summarise 43.4 KB 4.1 KB 4.7 KB

Dux now uses less memory than Explorer on every operation.

Key changes

1. Views instead of temp tables in compute/1 (the big win)

compute/1 previously ran CREATE TEMP TABLE AS (SELECT ...) which materialized all rows (~50ms for 500k rows). Now it creates CREATE TEMP VIEW which is near-instant (~0.5ms) since no data is copied.

  • Uses adbc_execute_on_gc_new with DROP VIEW IF EXISTS for proper GC cleanup
  • TableRef gains a deps field to keep source tables alive while views reference them
  • Falls back to CTAS for PIVOT ops (DuckDB doesn't support data-driven PIVOT in views) and when view chain depth exceeds 3 (prevents unbounded memory in iterative algorithms)
  • No-ops shortcut: compute() on an already-materialized table with no pending ops returns immediately

2. Schema derivation (skip DESCRIBE)

Instead of calling DESCRIBE on every compute result, derive the output schema from the source schema and ops:

  • Preserved: filter, sort, head, slice, distinct, drop_nil, group_by, ungroup
  • Derived: mutate (append column names), select/discard (subset), rename (remap), summarise (groups + agg names)

3. SQL flattening

  • Single-op pipelines emit flat SQL without CTE wrapping
  • group_by + summarise pattern emits a single SELECT ... GROUP BY instead of two CTEs
  • Table sources referenced directly ("table_name") instead of (SELECT * FROM "table_name") __src

4. Batch normalize_value

Per-column type check instead of per-value: skip normalize_value entirely for columns without Decimals. Saves ~200ms for 400k rows × 5 columns.

5. Batch-by-batch row building in to_rows

Process each Arrow record batch independently (~800 rows) instead of first concatenating 489 batches into huge column lists then transposing. More cache-friendly, avoids O(n) list concatenation.

ADBC dependency

These changes use the new APIs from upstream livebook-dev/adbc (main branch):

  • adbc_execute_on_gc_new/2 — custom SQL on GC (for DROP VIEW IF EXISTS)
  • Adbc.Connection.execute/3 — command dispatch for DDL
  • Fast-path decoders and single-batch to_map (merged upstream)

Currently pointing to {:adbc, github: "livebook-dev/adbc"} (main). Once a new ADBC version is released to hex, this can switch to a version pin.

Benchmark code

The benchmarks used during development:

Compute benchmark (auto/bench_quick.exs) — measures pipeline definition speed

Generates N rows (default 1M), runs filter/mutate/summarise on both Dux and Explorer, reports ratio (Dux time / Explorer time). Includes stale view leak detection.

mix run auto/bench_quick.exs           # 1M rows
BENCH_ROWS=10000000 mix run auto/bench_quick.exs  # 10M rows
E2E benchmark (auto/bench_e2e_check.exs) — measures compute + data read

Same operations but calls to_rows() / to_columns() to actually materialize results back to Elixir.

mix run auto/bench_e2e_check.exs
Benchee benchmark (bench/dux_v_explorer.exs) — full statistical analysis with memory

Uses Benchee for statistically rigorous comparison including memory profiling. This is the benchmark from the dux_v_explorer comparison.

mix run bench/dux_v_explorer.exs

Full commit log

Each experiment is a separate commit with a perf: prefix describing the change and its measured impact.


🤖 Generated with Claude Code

hugobarauna and others added 16 commits March 26, 2026 22:47
Reduce per-compute overhead by removing 2 of 4 DuckDB round-trips:

1. Merge duplicate DESCRIBE calls: table_names + table_dtypes each ran
   DESCRIBE separately. New table_schema/2 does a single DESCRIBE call
   returning both names and dtypes.

2. Extract row count from CTAS result: The CREATE TEMPORARY TABLE AS
   statement already returns the inserted row count in a "Count" column.
   New query_with_count/2 captures this, eliminating the separate
   SELECT count(*) query that was only needed for telemetry metadata.

Net effect: compute/1 now does 2 DuckDB queries (CTAS + DESCRIBE)
instead of 4 (CTAS + DESCRIBE + DESCRIBE + COUNT).

Benchmark (1M rows, best of 3): combined_ratio 7.02 -> 6.49 (~7.5%)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…reserving ops

Two optimizations:
1. QueryBuilder: emit flat SQL without CTE wrapping for single-op pipelines,
   and use table name directly instead of (SELECT * FROM "table") __src subquery.
2. compute/1: skip DESCRIBE round-trip when all ops preserve the source schema
   (filter, sort, head, slice, distinct, drop_nil, group_by, ungroup).

combined_ratio: 5.71 → 5.25 (~8% improvement)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three changes that eliminate CTAS materialization overhead:

1. compute/1 now creates TEMP VIEWs instead of TEMP TABLEs for non-PIVOT
   pipelines where the source is a materialized table with no chained deps.
   Views are near-instant (~0.7ms vs ~50ms CTAS) since no data is copied.

2. No-ops shortcut: compute() on an already-materialized table with no
   pending ops returns immediately without creating a redundant view/table.

3. TableRef gains a `deps` field to keep source tables alive while views
   reference them, preventing use-after-GC.

ADBC connection handler updated to DROP VIEW for __dux_v_ prefixed names.

combined_ratio: 5.71 → 0.16 (97% reduction — Dux is now faster than Explorer)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rd/rename

Instead of always calling DESCRIBE on the result view/table, derive the
output schema from the source schema and the ops list when possible.
Handles: filter, sort, head, slice, distinct, drop_nil, group_by, ungroup
(preserved), mutate (append columns), select (subset), discard (remove),
rename (remap names).

Eliminates ~0.8ms DESCRIBE round-trip for mutate, bringing it from 0.40x
to 0.15x (Explorer parity).

combined_ratio: 0.16 → 0.12

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extend derive_schema to track group_by state and use it to derive
summarise output schema (group columns + aggregate column names).
Eliminates DESCRIBE round-trip for group_by+summarise pipelines.

summarise_ratio: 0.14 → 0.06, combined_ratio: 0.12 → 0.10

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously, any view with deps immediately fell back to CTAS on the next
compute. Now views can chain up to 3 levels deep, which benefits iterative
patterns (e.g., PageRank iterations) by keeping intermediate results as
views instead of materializing them.

Falls back to CTAS at depth >= 3 to bound memory growth.

No benchmark impact (benchmark uses single-level views), but improves
real-world iterative workloads.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Detect the common group_by+summarise two-op pattern and emit a single
flat SELECT with GROUP BY instead of wrapping in CTEs. The group_by op
generates a no-op CTE (SELECT * FROM prev) that this eliminates.

summarise_ratio: 0.10 → 0.05, combined_ratio: 0.11 → 0.09

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add Adbc.Connection.execute/2 that uses command dispatch instead of
stream dispatch for DDL statements that return no data. Saves the
Arrow stream setup, next() call, and unlock cast overhead.

Backend.query_view now uses execute instead of query.

combined_ratio: 0.09 → 0.08

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
from_list for large lists (>500 rows) called table_names and table_dtypes
separately, each triggering DESCRIBE. Replaced with single table_schema/2.

No benchmark impact (from_list is setup, not measured), but faster data loading.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace separate table_names + table_dtypes calls with single
table_schema/2 in graph.ex (BFS, connected components) and merger.ex.
Halves the metadata query count in these paths.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of calling normalize_value on every value (2M calls for 400k rows
× 5 columns), check the first non-nil value per column. Only apply
normalization to columns containing Decimals (aggregation results).

Skips ~200ms of function dispatch overhead for typical data (integers,
floats, strings, dates).

to_rows e2e: 1.52x → 1.13x slower than Explorer
to_columns: now at parity with Explorer (0.98x)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…catenation)

Instead of to_map (which flat_maps 489 batches into huge column lists
then transposes to rows), process each batch independently: materialize
columns → to_list → build row maps → flat_map results.

Small batches (~800 rows) transpose faster (cache-friendly) and avoid
the O(n) column list concatenation across 489 batches.

e2e_combined: 1.39x → 1.04x (mutate_rows now faster than Explorer at 0.95x)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…names

Process each Arrow batch independently (build row maps per batch, then
flat_map) instead of first building huge column lists then transposing.
Small batches (~800 rows) transpose faster due to cache locality.

Also pre-compute col_names once (same across all batches) and use
:maps.from_list/:lists.zip for map construction.

filter to_rows: 1.26x → ~1.02x (at parity with Explorer)
mutate to_rows: 1.25x → ~1.01x (at parity with Explorer)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Switch from local path dep to GitHub fork so the repo works standalone.
The fork includes fast-path decoders and view GC cleanup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nges

- Extract derive_op_step/2 to fix credo max-depth violation in derive_schema
- Update mix.lock to ADBC fork commit that includes execute/2 and GC handler

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
These files document the optimization approach and provide reproducible
benchmarks for the performance changes in this branch.

- auto/autoresearch.md — objective, progress log, architecture context
- auto/autoresearch.ideas.md — ideas tried, dead ends, learnings
- auto/autoresearch_e2e.md — end-to-end optimization context
- auto/bench_quick.exs — compute benchmark (pipeline definition speed)
- auto/bench_e2e_check.exs — e2e benchmark (compute + data read)
- auto/autoresearch.sh — gate script (compile + test + benchmark)
- auto/PROMPT.md — loop instructions
- bench/dux_v_explorer.exs — Benchee benchmark (from akoutmos comparison)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread lib/dux/backend.ex Outdated
raise ArgumentError, "DuckDB query failed: #{Exception.message(err)}"
end

gc_ref = Adbc.Nif.adbc_delete_on_gc_new(conn, name)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not delete views, so on main I converted this into a more generic function called adbc_execute_on_gc_new where you pass a statement to run.

Hugo and others added 3 commits March 27, 2026 14:32
Jose adapted our ADBC changes into upstream (livebook-dev/adbc):
- adbc_execute_on_gc_new replaces adbc_delete_on_gc_new (takes full SQL)
- execute/3 added for DDL dispatch
- to_ipc_stream now requires StreamResult (use query_pointer)
- Fast-path decoders + single-batch to_map merged upstream

Updated Backend to use new APIs:
- query_view uses adbc_execute_on_gc_new with DROP VIEW IF EXISTS
- query/query_with_count use adbc_execute_on_gc_new with DROP TABLE IF EXISTS
- table_to_ipc uses query_pointer + StreamResult.to_ipc_stream

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Infer column type from the first row value and use typed constructors
(s64, f64, string, date32, boolean) instead of Column.new which scans
all values for type inference.

from_list at 1M rows: 1.06x → 1.02x (at parity with Explorer)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cigrainger
Copy link
Copy Markdown
Contributor

Thanks for this @hugobarauna! The view path is one I explored but gave up on early because of GC but this is such a good reminder of why we shouldn't let path direction keep us from driving towards the right thing! I'll dig through the results against where we are now (e.g. that 37x number was already closed to 3x) and what is going to give us wins in a maintainable way. This is really great.

@cigrainger
Copy link
Copy Markdown
Contributor

Closing in favour of #46 -- please open another PR if I've missed anything important.

@cigrainger cigrainger closed this Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants