Skip to content

feat(sd-cache): keyed summary-stats cache (raw / clean / filt scopes)#783

Merged
paddymul merged 2 commits into
mainfrom
feat/sd-keyed-cache
May 20, 2026
Merged

feat(sd-cache): keyed summary-stats cache (raw / clean / filt scopes)#783
paddymul merged 2 commits into
mainfrom
feat/sd-keyed-cache

Conversation

@paddymul
Copy link
Copy Markdown
Collaborator

Summary

A keyed summary-stats cache: three pointer traits (raw_sd_key /
clean_sd_key / filt_sd_key) carry an opaque hash of the op
chain that produced each scope's SD, and a new summary_stats_cache: Dict traitlet maps those hashes to parquet-b64 SD blobs. The frontend
reads <scope>_sd_key and looks the entry up in the cache; Python is
the sole writer of cache keys, so there's no canonical-JSON contract
to drift across the boundary.

The substrate that makes the whole "stop recomputing SDs you already
have" line of work tractable. Sits on top of whatever encoding
sd_to_parquet_b64 produces (today's row format, or #782's wide-typed
format when that lands — they compose).

The invariant that matters

A state change that doesn't move a scope's chain must not produce a
new SD computation for that scope. Concretely, flipping
quick_command_args keeps the raw and clean keys constant — those
entries are cache hits — and only the filt scope sees a new key.

That's the test (tests/unit/dataflow/sd_cache_test.py):

raw_before  == raw_after     # raw pointer unchanged
clean_before == clean_after  # clean pointer unchanged
filt_before != filt_after    # filt pointer moved
len(cache_after) == len(cache_before) + 1   # one new entry, not three

Why this lands cleanly with no perf regression

The naive shape — "compute SD for all three scopes from scratch every
state change" — would have tripled the analysis-pipeline work on
xorq backends (the test_widget_construction_query_count regression
test catches that).

The observer instead reuses self.summary_sd for the filt
scope (it's already computed by _summary_sd), so this lands
without adding a second pass through the analysis pipeline for the
scope the old flow was already covering. Raw and clean do fresh SD
compute, but only on cache miss and only when their hash differs from
filt's. A widget with no cleaning method and no filter does exactly
one SD compute total — same as before this PR. Query count test
stays green.

Why this isn't built on top of any other open PR

There are related in-flight PRs around scoped summary stats (#778 and
the smorgasbord stack) and SD encoding (#782). This branches cleanly
from main:

Test plan

  • 4/4 new tests in tests/unit/dataflow/sd_cache_test.py
  • Full Python suite (1015 passed, 1 unrelated MCP test deselected
    for empty static artifact in worktree)
  • test_widget_construction_query_count stays ≤6 queries
  • ruff check + paddy-format clean
  • CI green

🤖 Generated with Claude Code

paddymul and others added 2 commits May 20, 2026 17:21
Failing-tests commit — exercises the keyed-cache substrate that doesn't
exist yet on main. The implementation lands in the next commit.

Asserts: chain hash is deterministic, split_chain_by_scope partitions
quick-command ops from cleaning ops, widget init populates the cache
and pointer traits, and a quick_command_args flip does not move the
raw/clean pointers (cache hit for those scopes; only the filt scope's
chain has new bytes).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three pointer traits — raw_sd_key, clean_sd_key, filt_sd_key — carry an
opaque hash of the op chain that produced each scope's SD. A new
``summary_stats_cache: Dict`` traitlet maps those hashes to the
parquet-b64 SD blob. The frontend reads <scope>_sd_key and looks the
entry up in the cache; Python is the sole writer of cache keys, so
there's no canonical-JSON contract to drift across the boundary.

The key invariant: a state change that doesn't move a scope's chain
must not produce a new SD computation. Concretely, flipping
``quick_command_args`` keeps the raw and clean keys constant — the
existing entries are cache hits — and only the filt scope sees a new
key and recomputes.

The observer reuses ``self.summary_sd`` for the filt scope (it's
already computed by ``_summary_sd``), so this lands without adding a
second pass through the analysis pipeline for the scope that the old
flow was already covering. Raw and clean scopes do new SD compute, but
only on cache miss and only when their hash differs from filt's — so a
widget with no cleaning method and no filter does exactly one SD
compute total, the same as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

📦 TestPyPI package published

pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.3.dev26190778427

or with uv:

uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.3.dev26190778427

MCP server for Claude Code

claude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.14.3.dev26190778427" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table

📖 Docs preview

🎨 Storybook preview

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7ee861edb8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

if self.processed_df is None:
return
chains = split_chain_by_scope(self.operations)
keys = {scope: hash_chain(chain) for scope, chain in chains.items()}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Invalidate SD cache when source dataframe changes

The cache key is derived only from operations, so changing raw_df or sample_method with the same op chain reuses the old key and skips recomputation, leaving summary_stats_cache entries stale for the new dataset. In this flow, summary_sd is recomputed for the new data, but the if keys[scope] in new_cache guard prevents updating the cached blob, so consumers of raw_sd_key/clean_sd_key/filt_sd_key can read stats from the previous dataframe.

Useful? React with 👍 / 👎.

Comment on lines +467 to +468
if keys[scope] in new_cache:
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include analysis class set in SD cache identity

When analysis_klasses changes, summary stats semantics change even if the op chain is identical, but the key still hashes only the chain; existing entries are treated as cache hits and never rewritten. That means adding/removing analyses can leave summary_stats_cache serving blobs computed with the old analysis set, despite observing analysis_klasses changes.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file an issue about this, but its out of scope for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant