perf: view-based compute, schema derivation, and ADBC fast-path decoders#44
Closed
hugobarauna wants to merge 19 commits into
Closed
perf: view-based compute, schema derivation, and ADBC fast-path decoders#44hugobarauna wants to merge 19 commits into
hugobarauna wants to merge 19 commits into
Conversation
Reduce per-compute overhead by removing 2 of 4 DuckDB round-trips: 1. Merge duplicate DESCRIBE calls: table_names + table_dtypes each ran DESCRIBE separately. New table_schema/2 does a single DESCRIBE call returning both names and dtypes. 2. Extract row count from CTAS result: The CREATE TEMPORARY TABLE AS statement already returns the inserted row count in a "Count" column. New query_with_count/2 captures this, eliminating the separate SELECT count(*) query that was only needed for telemetry metadata. Net effect: compute/1 now does 2 DuckDB queries (CTAS + DESCRIBE) instead of 4 (CTAS + DESCRIBE + DESCRIBE + COUNT). Benchmark (1M rows, best of 3): combined_ratio 7.02 -> 6.49 (~7.5%) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…reserving ops Two optimizations: 1. QueryBuilder: emit flat SQL without CTE wrapping for single-op pipelines, and use table name directly instead of (SELECT * FROM "table") __src subquery. 2. compute/1: skip DESCRIBE round-trip when all ops preserve the source schema (filter, sort, head, slice, distinct, drop_nil, group_by, ungroup). combined_ratio: 5.71 → 5.25 (~8% improvement) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three changes that eliminate CTAS materialization overhead: 1. compute/1 now creates TEMP VIEWs instead of TEMP TABLEs for non-PIVOT pipelines where the source is a materialized table with no chained deps. Views are near-instant (~0.7ms vs ~50ms CTAS) since no data is copied. 2. No-ops shortcut: compute() on an already-materialized table with no pending ops returns immediately without creating a redundant view/table. 3. TableRef gains a `deps` field to keep source tables alive while views reference them, preventing use-after-GC. ADBC connection handler updated to DROP VIEW for __dux_v_ prefixed names. combined_ratio: 5.71 → 0.16 (97% reduction — Dux is now faster than Explorer) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rd/rename Instead of always calling DESCRIBE on the result view/table, derive the output schema from the source schema and the ops list when possible. Handles: filter, sort, head, slice, distinct, drop_nil, group_by, ungroup (preserved), mutate (append columns), select (subset), discard (remove), rename (remap names). Eliminates ~0.8ms DESCRIBE round-trip for mutate, bringing it from 0.40x to 0.15x (Explorer parity). combined_ratio: 0.16 → 0.12 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extend derive_schema to track group_by state and use it to derive summarise output schema (group columns + aggregate column names). Eliminates DESCRIBE round-trip for group_by+summarise pipelines. summarise_ratio: 0.14 → 0.06, combined_ratio: 0.12 → 0.10 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously, any view with deps immediately fell back to CTAS on the next compute. Now views can chain up to 3 levels deep, which benefits iterative patterns (e.g., PageRank iterations) by keeping intermediate results as views instead of materializing them. Falls back to CTAS at depth >= 3 to bound memory growth. No benchmark impact (benchmark uses single-level views), but improves real-world iterative workloads. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Detect the common group_by+summarise two-op pattern and emit a single flat SELECT with GROUP BY instead of wrapping in CTEs. The group_by op generates a no-op CTE (SELECT * FROM prev) that this eliminates. summarise_ratio: 0.10 → 0.05, combined_ratio: 0.11 → 0.09 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add Adbc.Connection.execute/2 that uses command dispatch instead of stream dispatch for DDL statements that return no data. Saves the Arrow stream setup, next() call, and unlock cast overhead. Backend.query_view now uses execute instead of query. combined_ratio: 0.09 → 0.08 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
from_list for large lists (>500 rows) called table_names and table_dtypes separately, each triggering DESCRIBE. Replaced with single table_schema/2. No benchmark impact (from_list is setup, not measured), but faster data loading. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace separate table_names + table_dtypes calls with single table_schema/2 in graph.ex (BFS, connected components) and merger.ex. Halves the metadata query count in these paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of calling normalize_value on every value (2M calls for 400k rows × 5 columns), check the first non-nil value per column. Only apply normalization to columns containing Decimals (aggregation results). Skips ~200ms of function dispatch overhead for typical data (integers, floats, strings, dates). to_rows e2e: 1.52x → 1.13x slower than Explorer to_columns: now at parity with Explorer (0.98x) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…catenation) Instead of to_map (which flat_maps 489 batches into huge column lists then transposes to rows), process each batch independently: materialize columns → to_list → build row maps → flat_map results. Small batches (~800 rows) transpose faster (cache-friendly) and avoid the O(n) column list concatenation across 489 batches. e2e_combined: 1.39x → 1.04x (mutate_rows now faster than Explorer at 0.95x) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…names Process each Arrow batch independently (build row maps per batch, then flat_map) instead of first building huge column lists then transposing. Small batches (~800 rows) transpose faster due to cache locality. Also pre-compute col_names once (same across all batches) and use :maps.from_list/:lists.zip for map construction. filter to_rows: 1.26x → ~1.02x (at parity with Explorer) mutate to_rows: 1.25x → ~1.01x (at parity with Explorer) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Switch from local path dep to GitHub fork so the repo works standalone. The fork includes fast-path decoders and view GC cleanup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nges - Extract derive_op_step/2 to fix credo max-depth violation in derive_schema - Update mix.lock to ADBC fork commit that includes execute/2 and GC handler Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
These files document the optimization approach and provide reproducible benchmarks for the performance changes in this branch. - auto/autoresearch.md — objective, progress log, architecture context - auto/autoresearch.ideas.md — ideas tried, dead ends, learnings - auto/autoresearch_e2e.md — end-to-end optimization context - auto/bench_quick.exs — compute benchmark (pipeline definition speed) - auto/bench_e2e_check.exs — e2e benchmark (compute + data read) - auto/autoresearch.sh — gate script (compile + test + benchmark) - auto/PROMPT.md — loop instructions - bench/dux_v_explorer.exs — Benchee benchmark (from akoutmos comparison) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
josevalim
reviewed
Mar 27, 2026
| raise ArgumentError, "DuckDB query failed: #{Exception.message(err)}" | ||
| end | ||
|
|
||
| gc_ref = Adbc.Nif.adbc_delete_on_gc_new(conn, name) |
There was a problem hiding this comment.
This will not delete views, so on main I converted this into a more generic function called adbc_execute_on_gc_new where you pass a statement to run.
Jose adapted our ADBC changes into upstream (livebook-dev/adbc): - adbc_execute_on_gc_new replaces adbc_delete_on_gc_new (takes full SQL) - execute/3 added for DDL dispatch - to_ipc_stream now requires StreamResult (use query_pointer) - Fast-path decoders + single-batch to_map merged upstream Updated Backend to use new APIs: - query_view uses adbc_execute_on_gc_new with DROP VIEW IF EXISTS - query/query_with_count use adbc_execute_on_gc_new with DROP TABLE IF EXISTS - table_to_ipc uses query_pointer + StreamResult.to_ipc_stream Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Infer column type from the first row value and use typed constructors (s64, f64, string, date32, boolean) instead of Column.new which scans all values for type inference. from_list at 1M rows: 1.06x → 1.02x (at parity with Explorer) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This reverts commit b25cbd2.
Contributor
|
Thanks for this @hugobarauna! The view path is one I explored but gave up on early because of GC but this is such a good reminder of why we shouldn't let path direction keep us from driving towards the right thing! I'll dig through the results against where we are now (e.g. that 37x number was already closed to 3x) and what is going to give us wins in a maintainable way. This is really great. |
5 tasks
Contributor
|
Closing in favour of #46 -- please open another PR if I've missed anything important. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
This PR is the result of an AI-driven performance optimization experiment using Claude Code. The code was not human-reviewed for correctness beyond passing the existing test suite (411 tests, 0 failures). The intent is to show potential optimization approaches — feel free to close this PR if the direction doesn't align with the project's goals, or cherry-pick individual ideas.
Approach: autoresearch loop
I set up an autonomous optimization loop where Claude Code would:
lib/onlyThe full context (objective, ideas, dead ends, progress log) is in
auto/autoresearch.mdandauto/autoresearch.ideas.md.Results
Pipeline definition (
compute/1) — 1M rowsEnd-to-end with data read (
to_rows/1) — 1M rowsThe remaining summarise gap is DuckDB executing GROUP BY through a view — real compute, not overhead.
Memory (BEAM-side, per operation)
Dux now uses less memory than Explorer on every operation.
Key changes
1. Views instead of temp tables in
compute/1(the big win)compute/1previously ranCREATE TEMP TABLE AS (SELECT ...)which materialized all rows (~50ms for 500k rows). Now it createsCREATE TEMP VIEWwhich is near-instant (~0.5ms) since no data is copied.adbc_execute_on_gc_newwithDROP VIEW IF EXISTSfor proper GC cleanupTableRefgains adepsfield to keep source tables alive while views reference themcompute()on an already-materialized table with no pending ops returns immediately2. Schema derivation (skip DESCRIBE)
Instead of calling
DESCRIBEon every compute result, derive the output schema from the source schema and ops:3. SQL flattening
group_by + summarisepattern emits a singleSELECT ... GROUP BYinstead of two CTEs"table_name") instead of(SELECT * FROM "table_name") __src4. Batch normalize_value
Per-column type check instead of per-value: skip
normalize_valueentirely for columns without Decimals. Saves ~200ms for 400k rows × 5 columns.5. Batch-by-batch row building in
to_rowsProcess each Arrow record batch independently (~800 rows) instead of first concatenating 489 batches into huge column lists then transposing. More cache-friendly, avoids O(n) list concatenation.
ADBC dependency
These changes use the new APIs from upstream
livebook-dev/adbc(main branch):adbc_execute_on_gc_new/2— custom SQL on GC (forDROP VIEW IF EXISTS)Adbc.Connection.execute/3— command dispatch for DDLto_map(merged upstream)Currently pointing to
{:adbc, github: "livebook-dev/adbc"}(main). Once a new ADBC version is released to hex, this can switch to a version pin.Benchmark code
The benchmarks used during development:
Compute benchmark (auto/bench_quick.exs) — measures pipeline definition speed
Generates N rows (default 1M), runs filter/mutate/summarise on both Dux and Explorer, reports ratio (Dux time / Explorer time). Includes stale view leak detection.
E2E benchmark (auto/bench_e2e_check.exs) — measures compute + data read
Same operations but calls
to_rows()/to_columns()to actually materialize results back to Elixir.Benchee benchmark (bench/dux_v_explorer.exs) — full statistical analysis with memory
Uses Benchee for statistically rigorous comparison including memory profiling. This is the benchmark from the dux_v_explorer comparison.
Full commit log
Each experiment is a separate commit with a
perf:prefix describing the change and its measured impact.🤖 Generated with Claude Code