Skip to content

feat(jlisp): sd channel — SDResult return + sd-as-arg read-only view#755

Merged
paddymul merged 6 commits into
mainfrom
feat/jlisp-sd-channel
May 17, 2026
Merged

feat(jlisp): sd channel — SDResult return + sd-as-arg read-only view#755
paddymul merged 6 commits into
mainfrom
feat/jlisp-sd-channel

Conversation

@paddymul
Copy link
Copy Markdown
Collaborator

Summary

Extends the jlisp interpreter so a Command can interact with the running summary-stats dict (sd) alongside the dataframe. Supersedes #744 with a cleaner design.

Three transform shapes share one mutable sd dict in the lisp env and compose freely in the same pipeline:

  • bare df (legacy) — passed through unchanged. All current commands.
  • SDResult(df, sd_updates) — interpreter merges sd_updates into the running sd via apply-result! (a per-call closure over the mutable dict).
  • s('sd') in command_default — transform receives a MappingProxyType read-only view of the current sd. To write, return SDResult from the same call (read-then-write).

Design doc: docs/lisp-sd-channel-plan.md.

Key decisions (from grilling pass)

  • SDResult is @dataclass(frozen=True) in buckaroo/jlisp/configure_utils.py, re-exported from the four Command-defining modules.
  • Dispatch is isinstance(result, SDResult) — no tuple-shape heuristic. Future PivotResult etc. add cases.
  • sd-as-arg is contractually read-only; MappingProxyType raises TypeError on top-level mutation.
  • apply-result! is a per-call closure (not a registered primitive) so it has access to the mutable dict while only the proxy is bound under name sd.
  • initial_sd is deep-copied at the buckaroo_transform boundary; caller's nested dicts can't leak mutations.
  • handle_ops_and_clean passes cleaning_sd as initial_sd so sd-as-arg readers see autocleaning analysis state (orig_col_name, _type, etc.) alongside upstream op writes.
  • s('sd') lives at index 2 in command_default (right after s('df')); transform sig is transform(df, sd, ...).

Test plan

  • pytest tests/unit/jlisp/ tests/unit/dataflow/ tests/unit/commands/ — passes
  • Full unit suite (excluding contrib/file_cache): 819 passed, 7 skipped
  • CI green

Tests

3 interpreter-level tests in tests/unit/jlisp/test_sd_channel.py:

  • test_sdresult_merges_via_apply_result_closure
  • test_sd_arg_is_read_only_mappingproxy
  • test_sd_arg_sees_upstream_sdresult_mutations

1 integration test in tests/unit/dataflow/test_sd_channel_integration.py:

  • test_three_shapes_compose_through_handle_ops_and_clean

TDD: failing-tests commit (b4b15bfd) was visible failing on CI before the implementation commit (5b74824) landed.

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 17, 2026

📦 TestPyPI package published

pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.13.5.dev25996720230

or with uv:

uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.13.5.dev25996720230

MCP server for Claude Code

claude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.13.5.dev25996720230" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table

📖 Docs preview

🎨 Storybook preview

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5b74824b0a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

buckaroo_eval, raw_parse = make_interpreter(transform_lisp_primitives)

def buckaroo_transform(instructions, df):
def buckaroo_transform(instructions, df, initial_sd):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restore transform compatibility for auto_type_df2

This required initial_sd argument leaves the existing auto_type_df2() helper in buckaroo/auto_clean/cleaning_commands.py still calling transform(full_ops, df) with the old two-argument shape, so any caller of that helper now gets a TypeError before auto-typing runs. I checked repo-wide references with rg "auto_type_df2|buckaroo_transform"; the helper is the in-repo call site that was not updated, and even after passing {} it would need to preserve its previous df-only return contract rather than leaking the new (df, sd) tuple.

Useful? React with 👍 / 👎.

return ret_val

sd_dict = copy.deepcopy(initial_sd)
sd_view = MappingProxyType(sd_dict)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Deep-freeze sd before passing it to commands

Wrapping only the top-level dict with MappingProxyType does not make the nested per-column metadata read-only, so an sd-aware command can still mutate state with inputs like sd[col]["note"] = val and bypass the intended SDResult write path. Because sd is explicitly documented here as {col: {key: value}} and read-only for command authors, this scenario silently changes the live copied sd_dict instead of raising like the new tests expect for top-level writes.

Useful? React with 👍 / 👎.

paddymul added a commit that referenced this pull request May 17, 2026
The helper has been broken since Oct 2023 — get_auto_type_operations
gained `metadata_f` and `recommend_f` required args in 2325d10, but
auto_type_df2 still called it with a single argument. The signature
change in PR #755 (transform now takes initial_sd) stacks a second
TypeError on the same dead path.

No in-repo callers (zero grep hits across .py/.ipynb/.md/configs) and
the function has no tests or docs. Drop it along with the now-unused
cleaning_classes list and configure_buckaroo / get_auto_type_operations
imports; keep the SDResult re-export.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
paddymul and others added 6 commits May 17, 2026 12:43
Pins the contract for three jlisp transform return shapes (df-only,
SDResult-return, sd-as-arg) sharing one mutable sd dict in the lisp env.
Tests import SDResult from configure_utils and polars_commands; both
fail at collection until the implementation lands.

- tests/unit/jlisp/test_sd_channel.py — three interpreter-level tests:
  SDResult merge via apply-result!, sd-as-arg read-only proxy enforcement,
  sd-as-arg sees upstream SDResult mutations.
- tests/unit/dataflow/test_sd_channel_integration.py — one end-to-end
  test through handle_ops_and_clean covering all three shapes composing.
- docs/lisp-sd-channel-plan.md — design plan with decisions recorded
  from grilling pass (frozen dataclass, MappingProxyType view, per-call
  closure, deepcopy seed, etc.).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three transform shapes share one mutable sd dict in the lisp env:
  - bare df             : passed through unchanged (legacy)
  - SDResult(df, sd_updates) : interpreter merges sd_updates into the
                          running sd dict via apply-result! (a per-call
                          closure over the mutable dict)
  - s('sd') in command_default : transform receives a MappingProxyType
                          view; top-level mutation raises TypeError.
                          To write, return SDResult.

The two channels compose: a downstream sd-as-arg reader sees mutations
from upstream SDResult ops within the same pipeline. A single command
can do both (read via arg, write via SDResult).

SDResult is @DataClass(frozen=True) defined in configure_utils.py and
re-exported from the four Command-defining modules (all_transforms.py,
polars_commands.py, pandas_commands.py, auto_clean/cleaning_commands.py)
so authors import it alongside Command.

buckaroo_transform(instructions, df, initial_sd) deep-copies initial_sd
at the boundary and returns (df, sd). _run_df_interpreter mirrors the
signature; wrap_set_df threads each form through apply-result!.
handle_ops_and_clean passes cleaning_sd as initial_sd so sd-as-arg
readers see autocleaning analysis state alongside upstream op writes.

The to_py codegen interpreter binds {'df': 5, 'sd': {}} so sd-aware
commands' transform_to_py can be invoked; authors emit standalone
Python (typically df-only equivalent) since generated code has no sd.

Design plan in docs/lisp-sd-channel-plan.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The empty-ops short-circuit exists to skip self.df_interpreter — which
internally does df.copy()/clone() and would churn df identity, firing
traitlets and causing a frontend resync of unchanged data. My initial
implementation defeated that on the sd side with copy.deepcopy(initial_sd);
fix is to return initial_sd as-is. Nothing mutated it (no ops ran), so
identity preservation is safe and matches the df side.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two no-op short-circuits return df by reference rather than letting the
interpreter run and copy: one in _run_df_interpreter, one in
handle_ops_and_clean. The short-circuits are load-bearing for two
reasons, the second of which is non-obvious and easy to "clean up":

1. df.copy()/clone() churns DfTrait identity → traitlets fires →
   frontend resyncs unchanged data over the anywidget boundary.
2. During widget init, where the df/operations traits can be set in
   either order, creating fresh objects on the no-op path used to
   cascade into observer-chain infinite loops.

Comments capture both reasons so future reviewers don't strip the
short-circuit as redundant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The helper has been broken since Oct 2023 — get_auto_type_operations
gained `metadata_f` and `recommend_f` required args in 2325d10, but
auto_type_df2 still called it with a single argument. The signature
change in PR #755 (transform now takes initial_sd) stacks a second
TypeError on the same dead path.

No in-repo callers (zero grep hits across .py/.ipynb/.md/configs) and
the function has no tests or docs. Drop it along with the now-unused
cleaning_classes list and configure_buckaroo / get_auto_type_operations
imports; keep the SDResult re-export.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes the Python / Lint CI failure on this branch — paddy_format.py
--check flagged these four files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@paddymul paddymul force-pushed the feat/jlisp-sd-channel branch from e114926 to 0ec5968 Compare May 17, 2026 16:43
@paddymul paddymul enabled auto-merge May 17, 2026 16:44
@paddymul paddymul added this pull request to the merge queue May 17, 2026
Merged via the queue into main with commit 08cc2ce May 17, 2026
27 checks passed
paddymul added a commit that referenced this pull request May 17, 2026
… channel

Mirrors #758 for the xorq backend. The xorq path doesn't go through
configure_buckaroo's lisp interpreter (ibis exprs can't .copy()), so it
can't reuse the SDResult machinery from #755 directly. Instead this
adds an analogous sd channel inside XorqAutocleaning:

- Handlers in _XORQ_OP_HANDLERS may now return either a bare expr
  (legacy) or (expr, sd_updates). _apply_xorq_ops accumulates the
  per-column sd entries across ops, merging col-by-col.
- handle_ops_and_clean runs the accumulated updates through
  _rekey_op_sd_to_internal (the same helper PandasAutocleaning uses
  since #758) so orig-named entries land on buckaroo's internal a/b/c
  letter keys and compose cleanly with the summary_sd that
  XorqDataflow._get_summary_sd produces (also keyed by letter).
- _xorq_search returns the filtered expr plus {col: {'highlight_phrase':
  [val]}} for every ibis-String column.

Uses highlight_phrase (list of literal needles) rather than
highlight_regex because ibis StringValue.contains is a literal
substring match — matching the filter semantics on the highlight side
avoids regex-metacharacter divergence.

Scope: only the search command is wired today. The sd channel itself
is generic — other ops can opt in by returning (expr, sd_updates).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant