Skip to content

feat(polars-search): plumb search term to JS as highlight_regex via SDResult#758

Merged
paddymul merged 4 commits into
mainfrom
feat/polars-search-highlight-sd-rebased
May 17, 2026
Merged

feat(polars-search): plumb search term to JS as highlight_regex via SDResult#758
paddymul merged 4 commits into
mainfrom
feat/polars-search-highlight-sd-rebased

Conversation

@paddymul
Copy link
Copy Markdown
Collaborator

Summary

Polars Search.transform returns SDResult(filtered_df, sd_updates) so the search term flows into cleaning_sd as highlight_regex on every polars-String column. DefaultMainStyling.style_column reads highlight_phrase / highlight_regex / highlight_color off col_meta and threads them into the string displayer_args, where the JS side already renders matches as <mark>.

First concrete consumer of the SDResult channel from #755 (which superseded #744).

Replaces #745 — rebased onto main after #755 merged, with the obsolete (df, sd_updates) tuple commit dropped and Search ported to return SDResult instead.

Supporting changes

  • autocleaning.handle_ops_and_clean rekeys op-supplied sd entries from orig col names onto buckaroo's internal a/b/c letter keys, so they merge into the matching analysis entry instead of sitting alongside as orphans. Runs after make_origs (which uses the keys as column names on cleaned_df).
  • style_column falls back to obj when _type is absent, so a stray op-only entry (no analysis ran for that column) can't KeyError the styling pass.

Test plan

  • pytest tests/unit/dataflow/ tests/unit/jlisp/ tests/unit/commands/polars_command_test.py — 147 pass
  • Full unit suite (excluding contrib/file_cache): 825 passed, 7 skipped
  • CI green

Tests added

  • test_sdresult_lands_in_cleaning_sd_through_handle_ops_and_clean — wiring of SDResult.sd_updates into cleaning_sd via handle_ops_and_clean
  • test_search_threads_highlight_regex_into_cleaning_sd_under_rename — Search contributes highlight_regex keyed by the renamed (a/b/c) column, not the orig name
  • test_default_main_styling_emits_highlight_regex_into_displayer_argsstyle_column copies highlight_regex into the string displayer_args
  • test_style_column_handles_col_meta_missing_type — fallback to obj when _type absent
  • test_search_op_delivers_highlight_regex_into_displayer_args — e2e through PolarsBuckarooInfiniteWidget into df_viewer_config.column_config
  • test_column_config_overrides_preserves_highlight_phrase_and_color — user-supplied overrides survive the merge

🤖 Generated with Claude Code

…DResult

Polars Search.transform now returns SDResult(filtered_df, sd_updates) so
the search term flows into cleaning_sd as `highlight_regex` on every
polars-String column. DefaultMainStyling.style_column reads
`highlight_phrase` / `highlight_regex` / `highlight_color` off col_meta
and threads them into the string displayer_args, where the JS-side
displayer already renders matches as <mark>.

First concrete consumer of the SDResult channel from #755.

Supporting changes:
- autocleaning.handle_ops_and_clean rekeys op-supplied sd entries from
  orig col names onto buckaroo's internal a/b/c letter keys, so they
  merge into the matching analysis entry instead of sitting alongside
  as orphans. Runs after make_origs (which uses the keys as column
  names on cleaned_df).
- style_column falls back to obj when `_type` is absent, so a stray
  op-only entry (no analysis ran for that column) can't KeyError the
  styling pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 17, 2026

📦 TestPyPI package published

pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.0.dev25997510667

or with uv:

uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.0.dev25997510667

MCP server for Claude Code

claude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.14.0.dev25997510667" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table

📖 Docs preview

🎨 Storybook preview

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: de38f9c3f3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread buckaroo/dataflow/autocleaning.py Outdated
rewrites = dict(old_col_new_col(cleaned_df))
out = {}
for col, kv in cleaning_sd.items():
target = rewrites.get(col, col)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Do not rekey existing analysis metadata

This pass rewrites every cleaning_sd key that happens to match an original column name, but the dict already contains analysis entries keyed by Buckaroo's internal names. With a frame like ['b', 'foo'] and any non-empty cleaning method, the analysis entry for internal column b (the second column) matches rewrites['b'] == 'a' here and gets merged into column a, so styling/metadata such as _type, orig_col_name, and highlights can be applied to the wrong column. The rekey needs to distinguish op-contributed original-name entries from existing rewritten-name analysis entries before applying old_col_new_col.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in f57289f. With cols ['b', 'foo'] and any non-empty cleaning method the analysis entry for internal b (orig foo) was being merged into a because rewrites['b']=='a'. Both pandas (analysis_management.py:42) and polars (polars_analysis_management.py:203) analysis pipelines set rewritten_col_name on every entry, and op-contributed SDResult entries don't — so the rekey now skips any entry carrying that marker. Failing repro test landed in 87419f2 first.

paddymul and others added 2 commits May 17, 2026 13:14
…ollide with letters

Failing test for the codex review on #758. With df cols ['b', 'foo'],
analysis assigns internal a='b' and b='foo'. _rekey_op_sd_to_internal
sees cleaning_sd['b'] (an analysis entry for orig 'foo') and rewrites
it to 'a' because rewrites['b']=='a' — merging the second column's
analysis metadata into the first column's entry and dropping the
second column's entry entirely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…name marker

Codex review on #758 found that _rekey_op_sd_to_internal corrupted
analysis metadata for frames where an orig col name collides with an
internal letter key. Example: cols ['b', 'foo'] → internal a='b' and
b='foo'. The pass would see cleaning_sd['b'] (the analysis entry for
orig 'foo'), look up rewrites['b']=='a', and merge it into 'a' — so
the first column ends up with the second column's _type, orig_col_name,
and any contributed highlights.

Both pandas and polars analysis pipelines set `rewritten_col_name` on
every entry (analysis_management.py:42, polars_analysis_management.py:203);
op-contributed SDResult entries do not. Skip rekeying for any entry
carrying the marker.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@paddymul paddymul enabled auto-merge May 17, 2026 17:18
@paddymul paddymul added this pull request to the merge queue May 17, 2026
Merged via the queue into main with commit 22fe196 May 17, 2026
26 of 27 checks passed
paddymul added a commit that referenced this pull request May 17, 2026
Fixes paddy-format --check failure on main (introduced via #758) that
was blocking CI on this branch. Single-file mechanical reformat — no
test behavior changes. Unblocks #759.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
paddymul added a commit that referenced this pull request May 17, 2026
…SDResult

Mirrors #758 for the pandas backend. Pandas `Search.transform` now
returns `SDResult(filtered_df, sd_updates)` so the search term flows
into `cleaning_sd` as `highlight_phrase` on every string/object column.
Together with the existing `style_column` reader (added in #758) the
phrase lands in the string `displayer_args`, where the JS-side
displayer already renders matches as `<mark>`.

Uses `highlight_phrase` (list of literal needles) rather than the
`highlight_regex` (single regex string) variant polars emits because
`search_df_str` uses `Series.str.find` — a literal substring match.
Matching the filter semantics on the highlight side avoids the case
where a search term containing regex metacharacters would filter on
literal text but try to highlight as a regex.

The string-column detection mirrors `search_df_str`: union of
`select_dtypes("string")` and `select_dtypes("object")` columns.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
paddymul added a commit that referenced this pull request May 17, 2026
…SDResult

Mirrors #758 for the pandas backend. Pandas `Search.transform` now
returns `SDResult(filtered_df, sd_updates)` so the search term flows
into `cleaning_sd` as `highlight_phrase` on every string/object column.
Together with the existing `style_column` reader (added in #758) the
phrase lands in the string `displayer_args`, where the JS-side
displayer already renders matches as `<mark>`.

Uses `highlight_phrase` (list of literal needles) rather than the
`highlight_regex` (single regex string) variant polars emits because
`search_df_str` uses `Series.str.find` — a literal substring match.
Matching the filter semantics on the highlight side avoids the case
where a search term containing regex metacharacters would filter on
literal text but try to highlight as a regex.

The string-column detection mirrors `search_df_str`: union of
`select_dtypes("string")` and `select_dtypes("object")` columns.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
paddymul added a commit that referenced this pull request May 17, 2026
… channel

Mirrors #758 for the xorq backend. The xorq path doesn't go through
configure_buckaroo's lisp interpreter (ibis exprs can't .copy()), so it
can't reuse the SDResult machinery from #755 directly. Instead this
adds an analogous sd channel inside XorqAutocleaning:

- Handlers in _XORQ_OP_HANDLERS may now return either a bare expr
  (legacy) or (expr, sd_updates). _apply_xorq_ops accumulates the
  per-column sd entries across ops, merging col-by-col.
- handle_ops_and_clean runs the accumulated updates through
  _rekey_op_sd_to_internal (the same helper PandasAutocleaning uses
  since #758) so orig-named entries land on buckaroo's internal a/b/c
  letter keys and compose cleanly with the summary_sd that
  XorqDataflow._get_summary_sd produces (also keyed by letter).
- _xorq_search returns the filtered expr plus {col: {'highlight_phrase':
  [val]}} for every ibis-String column.

Uses highlight_phrase (list of literal needles) rather than
highlight_regex because ibis StringValue.contains is a literal
substring match — matching the filter semantics on the highlight side
avoids regex-metacharacter divergence.

Scope: only the search command is wired today. The sd channel itself
is generic — other ops can opt in by returning (expr, sd_updates).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant