Skip to content

feat(paf): process_table_scalars / _aggregates entry points (Phase 2)#809

Open
paddymul wants to merge 1 commit into
feat/stat-cost-decoratorfrom
feat/progressive-stats-server
Open

feat(paf): process_table_scalars / _aggregates entry points (Phase 2)#809
paddymul wants to merge 1 commit into
feat/stat-cost-decoratorfrom
feat/progressive-stats-server

Conversation

@paddymul
Copy link
Copy Markdown
Collaborator

Summary

Phase 2 of plans/js-driven-stat-debounce.md. Splits the stat pipeline by StatFunc.cost (added in #808) so the JS-driven progressive-stats router can run cheap stats immediately on state_change and expensive ones after a debounce.

This PR ships the server-side API surface only. The WS handler that calls these new entry points + the JS orchestrator (Phase 3) are separate PRs. No existing caller changes behavior — every existing path defaults to cost_classes=None which runs all costs.

Changes

StatPipeline (pandas):

  • New cost_classes kwarg on process_column / process_df.
  • process_df_scalars(df) — runs only cost=scalar stats.
  • process_df_aggregates(df) — runs full pipeline (aggregates need scalar inputs), filters response to aggregate-cost provides only.

XorqStatPipeline:

  • Same shape: cost_classes kwarg on process_table + process_table_scalars + process_table_aggregates convenience methods.
  • For xorq, the scalars-only path skips the per-column histogram query loop in phase 2 entirely — that's the ~6.5 s of work that dominates boston state_change latency.

Tests

  • 3 new in TestStatPipeline (pandas) — scalars-only filters histograms out, aggregates-only ships only aggregate keys, default runs all.
  • 3 new in TestCostClassFilter (xorq) — scalars-only emits strictly fewer SQL queries than full pipeline, scalars output omits histogram key, aggregates output omits scalar keys.
  • Full Python suite: 974 passed, 1 skipped (the pre-existing flaky MCP test).

Depends on #808

This PR branches from feat/stat-cost-decorator and uses the StatFunc.cost field added there. Once #808 lands on main, this can rebase to main.

What's still ahead

  • Phase 3 — JS BuckarooStateOrchestrator class with the 2× adaptive debounce. Separate PR.
  • Phase 4 — wire the new server entry points into a compute_stat_group WS handler + flip the default client to use it.

🤖 Generated with Claude Code

Phase 2 of plans/js-driven-stat-debounce.md. Splits the stat pipeline
by cost class so the JS-driven progressive-stats router can run
cheap stats immediately on state_change and expensive stats after a
debounce.

Server-side changes:

  - ``StatPipeline.process_df(cost_classes=None)`` — new kwarg
    filtering which stat funcs execute. Default ``None`` = run all
    costs (back-compat for every existing caller).
  - ``StatPipeline.process_df_scalars(df)`` — convenience wrapper for
    ``cost_classes={"scalar"}``. Histograms etc. skipped entirely.
  - ``StatPipeline.process_df_aggregates(df)`` — runs full pipeline
    (aggregates depend on scalar inputs), filters the response to
    just aggregate-cost provides.
  - ``XorqStatPipeline.process_table(cost_classes=None)`` +
    ``process_table_scalars()`` + ``process_table_aggregates()``.

For xorq the scalars-only path skips the per-column query loop in
``process_table`` phase 2 — exactly the ~6.5 s of histogram queries
that dominates the boston state_change latency. Test
``test_scalars_only_skips_histogram_queries`` asserts strictly fewer
queries vs the full pipeline.

This commit ships the API surface; the WS handler that calls these
new entry points is the next PR (Phase 4 wiring). No behavior change
for any existing caller — all default to ``cost_classes=None``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@paddymul paddymul added the triaged Reviewed and triaged label May 21, 2026
paddymul added a commit that referenced this pull request May 21, 2026
…kips histogram on filt+clean

Adds a ``DataFlow.skip_stats_by_scope`` class attribute that lets a
backend declare "don't run stat X when computing scope Y's SD". The
motivating case (and only override shipping here): ``XorqDataflow``
sets ``{'filt': {'histogram'}, 'clean': {'histogram'}}`` — on xorq
the per-column histogram queries dominate state_change latency on
remote tables, and the user only renders the bare (raw) histogram in
``merged_sd``. Running the histogram on filt + clean too was wasted
work that didn't show up in the UI. Pandas / polars dataflows keep
the default empty dict, so their filt/clean histograms continue to
appear.

## Plumbing

``skip_stat_names`` threads from the dataflow into the underlying
pipelines:

- ``_summary_sd`` (filt scope) — applies filt skip when
  ``quick_command_args`` is truthy (the cascade hasn't updated
  ``operations`` yet when this fires, so we can't use the chains).
- ``_populate_sd_cache`` (raw + clean scopes) — uses
  ``_effective_skip(scope, chains)`` which returns ``None`` when the
  scope is degenerate (filt == clean → no filter; clean == raw → no
  cleaning). Keeps the no-filter / no-cleaning case collapsing raw +
  clean + filt under one cache key so the lazy-postprocessor
  ``add_processing(big_a)`` invariant on ``XorqBuckarooInfiniteWidget``
  still holds (postprocessor runs once per state, not three times).
- ``_scope_cache_key`` incorporates the effective skip so the cache
  doesn't collide raw and filt entries when their skips differ.
- ``StatPipeline.process_column`` / ``process_df`` /
  ``process_df_v1_compat``: skip at iteration (stat-func name match)
  plus a final output-level strip (covers v1 ``ColAnalysis`` wrappers
  where one ``StatFunc`` provides many keys — ``DefaultSummaryStats__
  series`` provides ``mean``, ``max``, ``std``, ...; we can't skip
  the whole wrapper for one key).
- ``XorqStatPipeline.process_table`` / ``process_table_v1_compat``:
  skip in both the batch-aggregate phase and the per-column phase.
- ``DfStatsV2``, ``PlDfStatsV2``, ``XorqDfStatsV2``: accept and forward.
- ``DfStats`` (v1) accepts the kwarg for API parity but strips at the
  output level since v1 ``AnalysisPipeline`` doesn't have per-stat
  filtering.

## Tests (failing-then-fix split per repo TDD policy)

Five new tests in the previous commit, all green now:

- ``test_skip_stats_by_scope_excludes_named_stat_from_filt`` — skip
  ``mean`` on filt; ``filtered_mean`` absent, bare ``mean`` present,
  other ``filtered_*`` keys present.
- ``test_skip_stats_by_scope_excludes_from_raw_and_clean`` — skip on
  raw/clean; bare ``mean`` absent, ``filtered_mean`` present.
- ``test_skip_stats_by_scope_default_empty_runs_all_stats`` — no
  override means behaviour unchanged.
- ``test_xorq_dataflow_skips_histogram_on_filt_and_clean`` — pins
  XorqDataflow's default.
- ``test_polars_dataflow_keeps_histogram`` — Polars side still has
  empty skiplist.

## Regression

Full ``tests/unit/`` suite green (1049 passed, 6 skipped) — including
``TestLazyPostprocessor`` on ``XorqBuckarooInfiniteWidget`` which
caught the early version of the cache-key change that recomputed
raw+clean scopes whenever skip differed from filt.

Adapted from the working draft on smorg/post-785-playground
(``4949e56c``), stripped of references to the closed #787 spike
(``recompute_summary_sd`` / ``_defer_summary_sd``) and the
not-yet-merged #809 (``cost_classes``).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

triaged Reviewed and triaged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant