Skip to content

feat(scoped-sd): scoped summary stats (raw + filtered) in merged_sd#778

Closed
paddymul wants to merge 4 commits into
mainfrom
feat/scoped-summary-stats
Closed

feat(scoped-sd): scoped summary stats (raw + filtered) in merged_sd#778
paddymul wants to merge 4 commits into
mainfrom
feat/scoped-summary-stats

Conversation

@paddymul
Copy link
Copy Markdown
Collaborator

Summary

V1 of the scoped summary stats described in docs/plans/async-stats.md (forthcoming): merged_sd now carries both the pre-filter baseline (bare keys) and, when a search filter is active, the filtered view (filtered_* keys), so the frontend can render both side-by-side via the ? optional-pinned-row mechanism in #777.

  • New trait summary_sd_raw holds the bare-key sd. With no filter it aliases summary_sd (no extra compute). With a filter active it re-runs autocleaning with empty quick_command_args to materialize the unfiltered df, then computes stats on that.
  • _merged_sd wires summary_sd_raw into the bare-key layer and layers filtered_* keys from summary_sd on top when the filter is on.

Breaking change

merged_sd[col]["mean"] (and the other bare stat names) now refer to the pre-filter dataset. Callers needing the post-filter values should read filtered_mean etc. when a filter is active. No in-repo callers needed updating — existing tests construct dataflows without filters, where pre- and post-filter values coincide.

What's out of V1 (separate follow-ups)

  • cleaned_* scope (third scope, only meaningful when cleaning_method != ""). The current shape proves out raw + filtered first; cleaned slots in via a third sd on the cleaned-but-unfiltered df.
  • Histogram bin-edge sharing across scopes; categorical↔numerical column-type handling. Tracked in the plan doc.
  • Async (asyncio.to_thread) compute and two-message WS protocol. V1 is sync.

Depends on

Test plan

  • 3 new failing-first tests in tests/unit/dataflow/scoped_summary_stats_test.py (commit cd2ee5b) — no-cleaning-no-filter baseline, filter activates filtered_* keys, bare keys reflect raw (pre-filter) dataset.
  • All 939 Python tests pass locally after the fix commit.
  • CI: verify the failing-tests commit fails and the fix commit passes.

🤖 Generated with Claude Code

paddymul and others added 3 commits May 20, 2026 14:35
… merged_sd

Three integration tests covering the dataflow-level shape of the new
scoped summary stats:

- no cleaning, no filter → only bare-key raw scope (baseline; passes
  today)
- filter active → bare-key raw + `filtered_*` scope
- bare `length` reflects the raw (pre-filter) dataset, not the
  post-everything view — deliberate breaking change to today's
  bare-name semantics

The second and third currently fail because today's dataflow runs the
stats pipeline against the post-everything `processed_df` and overwrites
the unfiltered baseline on each state change.

See docs/plans/async-stats.md (forthcoming) for the design.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… is active

Introduces the V1 scoped summary stats: when a search filter is active,
``merged_sd`` carries both the unfiltered baseline (bare keys) and the
filtered view (``filtered_*`` keys), so a frontend can render both side
by side via the ``?`` optional-pinned-row mechanism shipping in #777.

Mechanism:

- ``summary_sd_raw`` holds the bare-key (pre-filter) sd. When no filter
  is active it's an alias for ``summary_sd`` (no extra compute). When a
  filter is active it re-runs autocleaning with empty
  ``quick_command_args`` to materialize the unfiltered cleaned df and
  computes stats on that.
- ``_merged_sd`` now wires ``summary_sd_raw`` into the bare-key layer
  (replacing today's ``summary_sd``) and layers ``filtered_*`` keys on
  top from ``summary_sd`` when ``quick_command_args`` is non-empty.

Deliberate breaking change: ``merged_sd[col]["mean"]`` (and other bare
stat names) now refer to the pre-filter dataset, not the
post-everything view. Callers that need the post-filter values should
read ``filtered_mean`` etc. when a filter is active. Documented in
docs/plans/async-stats.md; no in-repo callers needed updating because
existing tests construct dataflows without filters (where pre- and
post-filter values coincide).

Cleaned-scope (``cleaned_*`` keys) is deferred to a follow-up — the
current shape proves out raw + filtered first, and the same pattern
extends to cleaned via a third sd computed on the cleaned-but-
unfiltered df.

Full Python suite passes (939 tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tripped ruff F401 on the pre-push full-tree check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 20, 2026

📦 TestPyPI package published

pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.3.dev26186264748

or with uv:

uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.3.dev26186264748

MCP server for Claude Code

claude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.14.3.dev26186264748" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table

📖 Docs preview

🎨 Storybook preview

… (one call per state change)

The prior shape of this PR added a `_raw_summary_sd` observer that
called `handle_ops_and_clean` a second time with empty
`quick_command_args` to materialize the unfiltered df. Combined with
#780's "one call per state change" invariant, this caused
`handle_ops_and_clean` to fire 2–3× per filter change (once from
`_operation_result` plus 1–2× from the observer cascade — the observer
watched both `summary_sd` and `quick_command_args`).

This commit folds the unfiltered-df materialization into
`PandasAutocleaning.handle_ops_and_clean` itself. The method now runs
the interpreter once with the full ops and once with cleaning-only ops
(when they differ), returning a 5-tuple:

    [cleaned_df, cleaning_sd, generated_code, final_ops,
     cleaned_df_unfiltered]

When no quick ops are present, `cleaned_df_unfiltered is cleaned_df`
(no second interpreter run, preserves the no-op short-circuit identity
invariant — see feedback_short_circuit_identity).

Downstream:

- `cleaned_df_unfiltered` exposed as a property on the dataflow.
- `_raw_summary_sd` observer removed; raw-scope sd is now computed
  inside `_summary_sd`, reading from `cleaned[4]`. Post-processing is
  applied to the unfiltered cleaned df so the raw scope reflects the
  visible view, not just the cleaning output.
- Test unpacks of `handle_ops_and_clean`'s return updated to swallow
  the new element with `*_` (forward-compatible across any future
  trailing additions).

Net behaviour: `handle_ops_and_clean` fires exactly once per state
change, restoring #780's invariant. The xorq cost benefit is real —
filter changes now pay one round-trip instead of two.

Full Python suite: 939 passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@paddymul
Copy link
Copy Markdown
Collaborator Author

Superseded by #785, which uses #783's keyed SD cache as the substrate instead of the parallel 5-tuple handle_ops_and_clean shape this PR took. Closing.

@paddymul paddymul closed this May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant