Skip to content

XorqStatPipeline issues 51 SQL queries per widget construction (should be 1-2) #709

@paddymul

Description

@paddymul

Summary

Constructing XorqBuckarooWidget issues 51 separate SQL queries to the backend for an 8-column dataframe (50k rows), and 87 queries for a 26-column dataframe. This is the dominant cost of widget construction (700–800 ms vs polars's 60–100 ms doing the same work in-memory).

The expectation, based on how polars does it, is that the entire stats pipeline collapses into one complex SQL query (or at most a small constant number).

Repro

.venv/bin/python -W ignore::DeprecationWarning -c "
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import xorq.api as xo
from buckaroo.pluggable_analysis_framework.xorq_stat_pipeline import XorqStatPipeline
from buckaroo.xorq_buckaroo import XorqBuckarooWidget
import sys, traceback
sys.path.insert(0, 'scripts/perf')
from perf_data import make_pandas
from collections import Counter

orig = XorqStatPipeline._execute
sites = []
def counting(self, q):
    stack = traceback.extract_stack()
    site = next((f'{f.filename.split(\"/\")[-1]}:{f.lineno}:{f.name}'
                 for f in reversed(stack)
                 if 'buckaroo' in f.filename and 'xorq_stat_pipeline' not in f.filename), '?')
    sites.append(site)
    return orig(self, q)
XorqStatPipeline._execute = counting

pdf = make_pandas(50_000)
con = xo.connect()
table = con.create_table('t', pdf)
XorqBuckarooWidget(table)
print(f'total queries: {len(sites)}')
for site, n in Counter(sites).most_common():
    print(f'  {n:3}  {site}')
"

Output (datafusion backend, 50k × 8)

total queries: 51
   40  xorq_stats_v2.py:245:_categorical_histogram
    6  xorq_stats_v2.py:213:_numeric_histogram
    4  dataflow.py:421:_get_summary_sd          # the batched-aggregate query
    1  dataflow.py:345:setup_options_from_analysis

Same numbers on duckdb backend.

On 50k × 26 (Boston restaurant inspections head): 87 queries.

Two problems stacked

A. Pipeline runs the full query set 4× per widget construction

dataflow.py:421:_get_summary_sd is called 4 times. Each call re-runs the entire stat pipeline — meaning 4 batched aggregates and ~4× the per-column histogram queries. Likely a traitlet observer chain firing on every transient trait assignment during __init__. Most of these calls are redundant; the final widget_args_tuple is the only one whose output is consumed.

A user constructing the widget once should be paying for one pipeline run, not four. This is similar in spirit to issue #706 (over-eager change-detection on traitlets-held DataFrames) — except here it's doing actual work, not just comparing.

B. Histograms are one query per column instead of one query total

_numeric_histogram and _categorical_histogram (buckaroo/customizations/xorq_stats_v2.py:191 and :238) each issue a per-column query — an expr.group_by(col).aggregate(count).order_by(...).limit(10) for categoricals, a bucketed group-by for numerics. 40 + 6 = 46 of the 51 queries (90%) are histograms.

For categorical histograms specifically, all N column queries can be expressed as one SQL pass using array_agg/struct_agg over per-column top-K subqueries, or one CTE that pivots. For numerics, the bucket aggregation can be unified across columns by using the already-known min/max per column (which the batched aggregate already computes).

The Phase-1 batched aggregate (table.aggregate(...) for length / null_count / min / max / etc.) already proves the pattern works — extend it to cover histograms too.

Why this matters

Comparison to pandas/polars on the same data (no SQL, all in-memory):

widget Boston 883k × 26
BuckarooInfiniteWidget (pandas, post-#706 fix) 715 ms
PolarsBuckarooInfiniteWidget 96 ms
XorqBuckarooInfiniteWidget (datafusion) 732 ms
XorqBuckarooInfiniteWidget (duckdb) 680 ms

Even on an in-process embedded engine like datafusion, per-query overhead × dozens of queries ≈ pandas's worst case (which itself is dominated by an unrelated traitlets bug). On a remote backend (Trino, DuckDB-over-network, etc.) the per-query latency would push this from "slow" into "completely unusable."

Suggested fix shape

In rough priority order:

  1. Stop redundant pipeline runs. Investigate why _get_summary_sd fires 4× per construction — likely some @observe chain. Coalesce into one run, or short-circuit when widget_args_tuple hasn't actually changed shape.
  2. Batch the histograms. Either:
    • Fold all per-column histogram aggregates into the existing Phase-1 table.aggregate(...) call (one SELECT, many …name(f"{col}|histogram_bucket_{i}") columns).
    • Or run all categorical histograms as one UNPIVOT / UNION ALL query, all numeric histograms as another.
  3. Goal: one batched query for stats, plus one per-column-group histogram query (still much better than 46). Long-term: one query for everything, like polars.

Test plan

  • Add a regression test that asserts len(queries) <= N for some small N (probably ≤ 10) on a 26-column input. Pipeline-runs-4× alone should be caught with ≤ 2.
  • Re-run scripts/perf/perf_xorq.py. Boston datafusion target: well under 200 ms (down from 793 ms).
  • Confirm the resulting summary stats are byte-identical to the pre-fix pipeline on a representative test fixture.

References

  • Investigation: docs/research/perf-polars-stats.md (xorq section)
  • Bench: scripts/perf/perf_xorq.py (on branch perf/polars-stats-instrumentation)
  • Stat pipeline: buckaroo/pluggable_analysis_framework/xorq_stat_pipeline.py:168 (process_table)
  • Histograms: buckaroo/customizations/xorq_stats_v2.py:191,238

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions