feat(paf): process_table_scalars / _aggregates entry points (Phase 2)#809
Open
paddymul wants to merge 1 commit into
Open
feat(paf): process_table_scalars / _aggregates entry points (Phase 2)#809paddymul wants to merge 1 commit into
paddymul wants to merge 1 commit into
Conversation
Phase 2 of plans/js-driven-stat-debounce.md. Splits the stat pipeline
by cost class so the JS-driven progressive-stats router can run
cheap stats immediately on state_change and expensive stats after a
debounce.
Server-side changes:
- ``StatPipeline.process_df(cost_classes=None)`` — new kwarg
filtering which stat funcs execute. Default ``None`` = run all
costs (back-compat for every existing caller).
- ``StatPipeline.process_df_scalars(df)`` — convenience wrapper for
``cost_classes={"scalar"}``. Histograms etc. skipped entirely.
- ``StatPipeline.process_df_aggregates(df)`` — runs full pipeline
(aggregates depend on scalar inputs), filters the response to
just aggregate-cost provides.
- ``XorqStatPipeline.process_table(cost_classes=None)`` +
``process_table_scalars()`` + ``process_table_aggregates()``.
For xorq the scalars-only path skips the per-column query loop in
``process_table`` phase 2 — exactly the ~6.5 s of histogram queries
that dominates the boston state_change latency. Test
``test_scalars_only_skips_histogram_queries`` asserts strictly fewer
queries vs the full pipeline.
This commit ships the API surface; the WS handler that calls these
new entry points is the next PR (Phase 4 wiring). No behavior change
for any existing caller — all default to ``cost_classes=None``.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
paddymul
added a commit
that referenced
this pull request
May 21, 2026
paddymul
added a commit
that referenced
this pull request
May 21, 2026
…kips histogram on filt+clean
Adds a ``DataFlow.skip_stats_by_scope`` class attribute that lets a
backend declare "don't run stat X when computing scope Y's SD". The
motivating case (and only override shipping here): ``XorqDataflow``
sets ``{'filt': {'histogram'}, 'clean': {'histogram'}}`` — on xorq
the per-column histogram queries dominate state_change latency on
remote tables, and the user only renders the bare (raw) histogram in
``merged_sd``. Running the histogram on filt + clean too was wasted
work that didn't show up in the UI. Pandas / polars dataflows keep
the default empty dict, so their filt/clean histograms continue to
appear.
## Plumbing
``skip_stat_names`` threads from the dataflow into the underlying
pipelines:
- ``_summary_sd`` (filt scope) — applies filt skip when
``quick_command_args`` is truthy (the cascade hasn't updated
``operations`` yet when this fires, so we can't use the chains).
- ``_populate_sd_cache`` (raw + clean scopes) — uses
``_effective_skip(scope, chains)`` which returns ``None`` when the
scope is degenerate (filt == clean → no filter; clean == raw → no
cleaning). Keeps the no-filter / no-cleaning case collapsing raw +
clean + filt under one cache key so the lazy-postprocessor
``add_processing(big_a)`` invariant on ``XorqBuckarooInfiniteWidget``
still holds (postprocessor runs once per state, not three times).
- ``_scope_cache_key`` incorporates the effective skip so the cache
doesn't collide raw and filt entries when their skips differ.
- ``StatPipeline.process_column`` / ``process_df`` /
``process_df_v1_compat``: skip at iteration (stat-func name match)
plus a final output-level strip (covers v1 ``ColAnalysis`` wrappers
where one ``StatFunc`` provides many keys — ``DefaultSummaryStats__
series`` provides ``mean``, ``max``, ``std``, ...; we can't skip
the whole wrapper for one key).
- ``XorqStatPipeline.process_table`` / ``process_table_v1_compat``:
skip in both the batch-aggregate phase and the per-column phase.
- ``DfStatsV2``, ``PlDfStatsV2``, ``XorqDfStatsV2``: accept and forward.
- ``DfStats`` (v1) accepts the kwarg for API parity but strips at the
output level since v1 ``AnalysisPipeline`` doesn't have per-stat
filtering.
## Tests (failing-then-fix split per repo TDD policy)
Five new tests in the previous commit, all green now:
- ``test_skip_stats_by_scope_excludes_named_stat_from_filt`` — skip
``mean`` on filt; ``filtered_mean`` absent, bare ``mean`` present,
other ``filtered_*`` keys present.
- ``test_skip_stats_by_scope_excludes_from_raw_and_clean`` — skip on
raw/clean; bare ``mean`` absent, ``filtered_mean`` present.
- ``test_skip_stats_by_scope_default_empty_runs_all_stats`` — no
override means behaviour unchanged.
- ``test_xorq_dataflow_skips_histogram_on_filt_and_clean`` — pins
XorqDataflow's default.
- ``test_polars_dataflow_keeps_histogram`` — Polars side still has
empty skiplist.
## Regression
Full ``tests/unit/`` suite green (1049 passed, 6 skipped) — including
``TestLazyPostprocessor`` on ``XorqBuckarooInfiniteWidget`` which
caught the early version of the cache-key change that recomputed
raw+clean scopes whenever skip differed from filt.
Adapted from the working draft on smorg/post-785-playground
(``4949e56c``), stripped of references to the closed #787 spike
(``recompute_summary_sd`` / ``_defer_summary_sd``) and the
not-yet-merged #809 (``cost_classes``).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 2 of plans/js-driven-stat-debounce.md. Splits the stat pipeline by
StatFunc.cost(added in #808) so the JS-driven progressive-stats router can run cheap stats immediately on state_change and expensive ones after a debounce.This PR ships the server-side API surface only. The WS handler that calls these new entry points + the JS orchestrator (Phase 3) are separate PRs. No existing caller changes behavior — every existing path defaults to
cost_classes=Nonewhich runs all costs.Changes
StatPipeline(pandas):cost_classeskwarg onprocess_column/process_df.process_df_scalars(df)— runs only cost=scalar stats.process_df_aggregates(df)— runs full pipeline (aggregates need scalar inputs), filters response to aggregate-cost provides only.XorqStatPipeline:cost_classeskwarg onprocess_table+process_table_scalars+process_table_aggregatesconvenience methods.Tests
TestStatPipeline(pandas) — scalars-only filters histograms out, aggregates-only ships only aggregate keys, default runs all.TestCostClassFilter(xorq) — scalars-only emits strictly fewer SQL queries than full pipeline, scalars output omits histogram key, aggregates output omits scalar keys.Depends on #808
This PR branches from
feat/stat-cost-decoratorand uses theStatFunc.costfield added there. Once #808 lands on main, this can rebase to main.What's still ahead
BuckarooStateOrchestratorclass with the 2× adaptive debounce. Separate PR.compute_stat_groupWS handler + flip the default client to use it.🤖 Generated with Claude Code