Summary
Constructing XorqBuckarooWidget issues 51 separate SQL queries to the backend for an 8-column dataframe (50k rows), and 87 queries for a 26-column dataframe. This is the dominant cost of widget construction (700–800 ms vs polars's 60–100 ms doing the same work in-memory).
The expectation, based on how polars does it, is that the entire stats pipeline collapses into one complex SQL query (or at most a small constant number).
Repro
.venv/bin/python -W ignore::DeprecationWarning -c "
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import xorq.api as xo
from buckaroo.pluggable_analysis_framework.xorq_stat_pipeline import XorqStatPipeline
from buckaroo.xorq_buckaroo import XorqBuckarooWidget
import sys, traceback
sys.path.insert(0, 'scripts/perf')
from perf_data import make_pandas
from collections import Counter
orig = XorqStatPipeline._execute
sites = []
def counting(self, q):
stack = traceback.extract_stack()
site = next((f'{f.filename.split(\"/\")[-1]}:{f.lineno}:{f.name}'
for f in reversed(stack)
if 'buckaroo' in f.filename and 'xorq_stat_pipeline' not in f.filename), '?')
sites.append(site)
return orig(self, q)
XorqStatPipeline._execute = counting
pdf = make_pandas(50_000)
con = xo.connect()
table = con.create_table('t', pdf)
XorqBuckarooWidget(table)
print(f'total queries: {len(sites)}')
for site, n in Counter(sites).most_common():
print(f' {n:3} {site}')
"
Output (datafusion backend, 50k × 8)
total queries: 51
40 xorq_stats_v2.py:245:_categorical_histogram
6 xorq_stats_v2.py:213:_numeric_histogram
4 dataflow.py:421:_get_summary_sd # the batched-aggregate query
1 dataflow.py:345:setup_options_from_analysis
Same numbers on duckdb backend.
On 50k × 26 (Boston restaurant inspections head): 87 queries.
Two problems stacked
A. Pipeline runs the full query set 4× per widget construction
dataflow.py:421:_get_summary_sd is called 4 times. Each call re-runs the entire stat pipeline — meaning 4 batched aggregates and ~4× the per-column histogram queries. Likely a traitlet observer chain firing on every transient trait assignment during __init__. Most of these calls are redundant; the final widget_args_tuple is the only one whose output is consumed.
A user constructing the widget once should be paying for one pipeline run, not four. This is similar in spirit to issue #706 (over-eager change-detection on traitlets-held DataFrames) — except here it's doing actual work, not just comparing.
B. Histograms are one query per column instead of one query total
_numeric_histogram and _categorical_histogram (buckaroo/customizations/xorq_stats_v2.py:191 and :238) each issue a per-column query — an expr.group_by(col).aggregate(count).order_by(...).limit(10) for categoricals, a bucketed group-by for numerics. 40 + 6 = 46 of the 51 queries (90%) are histograms.
For categorical histograms specifically, all N column queries can be expressed as one SQL pass using array_agg/struct_agg over per-column top-K subqueries, or one CTE that pivots. For numerics, the bucket aggregation can be unified across columns by using the already-known min/max per column (which the batched aggregate already computes).
The Phase-1 batched aggregate (table.aggregate(...) for length / null_count / min / max / etc.) already proves the pattern works — extend it to cover histograms too.
Why this matters
Comparison to pandas/polars on the same data (no SQL, all in-memory):
| widget |
Boston 883k × 26 |
BuckarooInfiniteWidget (pandas, post-#706 fix) |
715 ms |
PolarsBuckarooInfiniteWidget |
96 ms |
XorqBuckarooInfiniteWidget (datafusion) |
732 ms |
XorqBuckarooInfiniteWidget (duckdb) |
680 ms |
Even on an in-process embedded engine like datafusion, per-query overhead × dozens of queries ≈ pandas's worst case (which itself is dominated by an unrelated traitlets bug). On a remote backend (Trino, DuckDB-over-network, etc.) the per-query latency would push this from "slow" into "completely unusable."
Suggested fix shape
In rough priority order:
- Stop redundant pipeline runs. Investigate why
_get_summary_sd fires 4× per construction — likely some @observe chain. Coalesce into one run, or short-circuit when widget_args_tuple hasn't actually changed shape.
- Batch the histograms. Either:
- Fold all per-column histogram aggregates into the existing Phase-1
table.aggregate(...) call (one SELECT, many …name(f"{col}|histogram_bucket_{i}") columns).
- Or run all categorical histograms as one
UNPIVOT / UNION ALL query, all numeric histograms as another.
- Goal: one batched query for stats, plus one per-column-group histogram query (still much better than 46). Long-term: one query for everything, like polars.
Test plan
References
- Investigation:
docs/research/perf-polars-stats.md (xorq section)
- Bench:
scripts/perf/perf_xorq.py (on branch perf/polars-stats-instrumentation)
- Stat pipeline:
buckaroo/pluggable_analysis_framework/xorq_stat_pipeline.py:168 (process_table)
- Histograms:
buckaroo/customizations/xorq_stats_v2.py:191,238
Summary
Constructing
XorqBuckarooWidgetissues 51 separate SQL queries to the backend for an 8-column dataframe (50k rows), and 87 queries for a 26-column dataframe. This is the dominant cost of widget construction (700–800 ms vs polars's 60–100 ms doing the same work in-memory).The expectation, based on how polars does it, is that the entire stats pipeline collapses into one complex SQL query (or at most a small constant number).
Repro
Output (datafusion backend, 50k × 8)
Same numbers on duckdb backend.
On 50k × 26 (Boston restaurant inspections head): 87 queries.
Two problems stacked
A. Pipeline runs the full query set 4× per widget construction
dataflow.py:421:_get_summary_sdis called 4 times. Each call re-runs the entire stat pipeline — meaning 4 batched aggregates and ~4× the per-column histogram queries. Likely a traitlet observer chain firing on every transient trait assignment during__init__. Most of these calls are redundant; the finalwidget_args_tupleis the only one whose output is consumed.A user constructing the widget once should be paying for one pipeline run, not four. This is similar in spirit to issue #706 (over-eager change-detection on traitlets-held DataFrames) — except here it's doing actual work, not just comparing.
B. Histograms are one query per column instead of one query total
_numeric_histogramand_categorical_histogram(buckaroo/customizations/xorq_stats_v2.py:191and:238) each issue a per-column query — anexpr.group_by(col).aggregate(count).order_by(...).limit(10)for categoricals, a bucketed group-by for numerics. 40 + 6 = 46 of the 51 queries (90%) are histograms.For categorical histograms specifically, all N column queries can be expressed as one SQL pass using
array_agg/struct_aggover per-column top-K subqueries, or one CTE that pivots. For numerics, the bucket aggregation can be unified across columns by using the already-knownmin/maxper column (which the batched aggregate already computes).The Phase-1 batched aggregate (
table.aggregate(...)for length / null_count / min / max / etc.) already proves the pattern works — extend it to cover histograms too.Why this matters
Comparison to pandas/polars on the same data (no SQL, all in-memory):
BuckarooInfiniteWidget(pandas, post-#706 fix)PolarsBuckarooInfiniteWidgetXorqBuckarooInfiniteWidget(datafusion)XorqBuckarooInfiniteWidget(duckdb)Even on an in-process embedded engine like datafusion, per-query overhead × dozens of queries ≈ pandas's worst case (which itself is dominated by an unrelated traitlets bug). On a remote backend (Trino, DuckDB-over-network, etc.) the per-query latency would push this from "slow" into "completely unusable."
Suggested fix shape
In rough priority order:
_get_summary_sdfires 4× per construction — likely some@observechain. Coalesce into one run, or short-circuit whenwidget_args_tuplehasn't actually changed shape.table.aggregate(...)call (one SELECT, many…name(f"{col}|histogram_bucket_{i}")columns).UNPIVOT/UNION ALLquery, all numeric histograms as another.Test plan
len(queries) <= Nfor some small N (probably ≤ 10) on a 26-column input. Pipeline-runs-4× alone should be caught with≤ 2.scripts/perf/perf_xorq.py. Boston datafusion target: well under 200 ms (down from 793 ms).References
docs/research/perf-polars-stats.md(xorq section)scripts/perf/perf_xorq.py(on branchperf/polars-stats-instrumentation)buckaroo/pluggable_analysis_framework/xorq_stat_pipeline.py:168(process_table)buckaroo/customizations/xorq_stats_v2.py:191,238