feat(paf): process_table_scalars / _aggregates entry points (Phase 2) by paddymul · Pull Request #809 · buckaroo-data/buckaroo

paddymul · 2026-05-21T11:27:54Z

Summary

Phase 2 of plans/js-driven-stat-debounce.md. Splits the stat pipeline by StatFunc.cost (added in #808) so the JS-driven progressive-stats router can run cheap stats immediately on state_change and expensive ones after a debounce.

This PR ships the server-side API surface only. The WS handler that calls these new entry points + the JS orchestrator (Phase 3) are separate PRs. No existing caller changes behavior — every existing path defaults to cost_classes=None which runs all costs.

Changes

StatPipeline (pandas):

New cost_classes kwarg on process_column / process_df.
process_df_scalars(df) — runs only cost=scalar stats.
process_df_aggregates(df) — runs full pipeline (aggregates need scalar inputs), filters response to aggregate-cost provides only.

XorqStatPipeline:

Same shape: cost_classes kwarg on process_table + process_table_scalars + process_table_aggregates convenience methods.
For xorq, the scalars-only path skips the per-column histogram query loop in phase 2 entirely — that's the ~6.5 s of work that dominates boston state_change latency.

Tests

3 new in TestStatPipeline (pandas) — scalars-only filters histograms out, aggregates-only ships only aggregate keys, default runs all.
3 new in TestCostClassFilter (xorq) — scalars-only emits strictly fewer SQL queries than full pipeline, scalars output omits histogram key, aggregates output omits scalar keys.
Full Python suite: 974 passed, 1 skipped (the pre-existing flaky MCP test).

Depends on #808

This PR branches from feat/stat-cost-decorator and uses the StatFunc.cost field added there. Once #808 lands on main, this can rebase to main.

What's still ahead

Phase 3 — JS BuckarooStateOrchestrator class with the 2× adaptive debounce. Separate PR.
Phase 4 — wire the new server entry points into a compute_stat_group WS handler + flip the default client to use it.

🤖 Generated with Claude Code

Phase 2 of plans/js-driven-stat-debounce.md. Splits the stat pipeline by cost class so the JS-driven progressive-stats router can run cheap stats immediately on state_change and expensive stats after a debounce. Server-side changes: - ``StatPipeline.process_df(cost_classes=None)`` — new kwarg filtering which stat funcs execute. Default ``None`` = run all costs (back-compat for every existing caller). - ``StatPipeline.process_df_scalars(df)`` — convenience wrapper for ``cost_classes={"scalar"}``. Histograms etc. skipped entirely. - ``StatPipeline.process_df_aggregates(df)`` — runs full pipeline (aggregates depend on scalar inputs), filters the response to just aggregate-cost provides. - ``XorqStatPipeline.process_table(cost_classes=None)`` + ``process_table_scalars()`` + ``process_table_aggregates()``. For xorq the scalars-only path skips the per-column query loop in ``process_table`` phase 2 — exactly the ~6.5 s of histogram queries that dominates the boston state_change latency. Test ``test_scalars_only_skips_histogram_queries`` asserts strictly fewer queries vs the full pipeline. This commit ships the API surface; the WS handler that calls these new entry points is the next PR (Phase 4 wiring). No behavior change for any existing caller — all default to ``cost_classes=None``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…kips histogram on filt+clean Adds a ``DataFlow.skip_stats_by_scope`` class attribute that lets a backend declare "don't run stat X when computing scope Y's SD". The motivating case (and only override shipping here): ``XorqDataflow`` sets ``{'filt': {'histogram'}, 'clean': {'histogram'}}`` — on xorq the per-column histogram queries dominate state_change latency on remote tables, and the user only renders the bare (raw) histogram in ``merged_sd``. Running the histogram on filt + clean too was wasted work that didn't show up in the UI. Pandas / polars dataflows keep the default empty dict, so their filt/clean histograms continue to appear. ## Plumbing ``skip_stat_names`` threads from the dataflow into the underlying pipelines: - ``_summary_sd`` (filt scope) — applies filt skip when ``quick_command_args`` is truthy (the cascade hasn't updated ``operations`` yet when this fires, so we can't use the chains). - ``_populate_sd_cache`` (raw + clean scopes) — uses ``_effective_skip(scope, chains)`` which returns ``None`` when the scope is degenerate (filt == clean → no filter; clean == raw → no cleaning). Keeps the no-filter / no-cleaning case collapsing raw + clean + filt under one cache key so the lazy-postprocessor ``add_processing(big_a)`` invariant on ``XorqBuckarooInfiniteWidget`` still holds (postprocessor runs once per state, not three times). - ``_scope_cache_key`` incorporates the effective skip so the cache doesn't collide raw and filt entries when their skips differ. - ``StatPipeline.process_column`` / ``process_df`` / ``process_df_v1_compat``: skip at iteration (stat-func name match) plus a final output-level strip (covers v1 ``ColAnalysis`` wrappers where one ``StatFunc`` provides many keys — ``DefaultSummaryStats__ series`` provides ``mean``, ``max``, ``std``, ...; we can't skip the whole wrapper for one key). - ``XorqStatPipeline.process_table`` / ``process_table_v1_compat``: skip in both the batch-aggregate phase and the per-column phase. - ``DfStatsV2``, ``PlDfStatsV2``, ``XorqDfStatsV2``: accept and forward. - ``DfStats`` (v1) accepts the kwarg for API parity but strips at the output level since v1 ``AnalysisPipeline`` doesn't have per-stat filtering. ## Tests (failing-then-fix split per repo TDD policy) Five new tests in the previous commit, all green now: - ``test_skip_stats_by_scope_excludes_named_stat_from_filt`` — skip ``mean`` on filt; ``filtered_mean`` absent, bare ``mean`` present, other ``filtered_*`` keys present. - ``test_skip_stats_by_scope_excludes_from_raw_and_clean`` — skip on raw/clean; bare ``mean`` absent, ``filtered_mean`` present. - ``test_skip_stats_by_scope_default_empty_runs_all_stats`` — no override means behaviour unchanged. - ``test_xorq_dataflow_skips_histogram_on_filt_and_clean`` — pins XorqDataflow's default. - ``test_polars_dataflow_keeps_histogram`` — Polars side still has empty skiplist. ## Regression Full ``tests/unit/`` suite green (1049 passed, 6 skipped) — including ``TestLazyPostprocessor`` on ``XorqBuckarooInfiniteWidget`` which caught the early version of the cache-key change that recomputed raw+clean scopes whenever skip differed from filt. Adapted from the working draft on smorg/post-785-playground (``4949e56c``), stripped of references to the closed #787 spike (``recompute_summary_sd`` / ``_defer_summary_sd``) and the not-yet-merged #809 (``cost_classes``). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

paddymul mentioned this pull request May 21, 2026

feat(js): StateOrchestrator for progressive-stats protocol (Phase 3) #810

Merged

paddymul added a commit that referenced this pull request May 21, 2026

Merge #809: process_table_scalars / _aggregates (Phase 2)

63bf655

paddymul mentioned this pull request May 21, 2026

Status bar shows no signal that the server / dataflow is still computing after a state_change #813

Closed

paddymul added the triaged Reviewed and triaged label May 21, 2026

paddymul mentioned this pull request May 21, 2026

spike(rows-first): two-message state_change protocol behind env gate #787

Closed

paddymul mentioned this pull request May 21, 2026

feat(dataflow): skip_stats_by_scope — per-scope stat skiplist; xorq skips histogram on filt+clean #829

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(paf): process_table_scalars / _aggregates entry points (Phase 2)#809

feat(paf): process_table_scalars / _aggregates entry points (Phase 2)#809
paddymul wants to merge 1 commit into
feat/stat-cost-decoratorfrom
feat/progressive-stats-server

paddymul commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

paddymul commented May 21, 2026

Summary

Changes

Tests

Depends on #808

What's still ahead

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant