feat(polars-search): plumb search term to JS as highlight_regex via SDResult#758
Conversation
…DResult Polars Search.transform now returns SDResult(filtered_df, sd_updates) so the search term flows into cleaning_sd as `highlight_regex` on every polars-String column. DefaultMainStyling.style_column reads `highlight_phrase` / `highlight_regex` / `highlight_color` off col_meta and threads them into the string displayer_args, where the JS-side displayer already renders matches as <mark>. First concrete consumer of the SDResult channel from #755. Supporting changes: - autocleaning.handle_ops_and_clean rekeys op-supplied sd entries from orig col names onto buckaroo's internal a/b/c letter keys, so they merge into the matching analysis entry instead of sitting alongside as orphans. Runs after make_origs (which uses the keys as column names on cleaned_df). - style_column falls back to obj when `_type` is absent, so a stray op-only entry (no analysis ran for that column) can't KeyError the styling pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
📦 TestPyPI package publishedpip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.0.dev25997510667or with uv: uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.0.dev25997510667MCP server for Claude Codeclaude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.14.0.dev25997510667" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table📖 Docs preview🎨 Storybook preview |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: de38f9c3f3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| rewrites = dict(old_col_new_col(cleaned_df)) | ||
| out = {} | ||
| for col, kv in cleaning_sd.items(): | ||
| target = rewrites.get(col, col) |
There was a problem hiding this comment.
Do not rekey existing analysis metadata
This pass rewrites every cleaning_sd key that happens to match an original column name, but the dict already contains analysis entries keyed by Buckaroo's internal names. With a frame like ['b', 'foo'] and any non-empty cleaning method, the analysis entry for internal column b (the second column) matches rewrites['b'] == 'a' here and gets merged into column a, so styling/metadata such as _type, orig_col_name, and highlights can be applied to the wrong column. The rekey needs to distinguish op-contributed original-name entries from existing rewritten-name analysis entries before applying old_col_new_col.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — fixed in f57289f. With cols ['b', 'foo'] and any non-empty cleaning method the analysis entry for internal b (orig foo) was being merged into a because rewrites['b']=='a'. Both pandas (analysis_management.py:42) and polars (polars_analysis_management.py:203) analysis pipelines set rewritten_col_name on every entry, and op-contributed SDResult entries don't — so the rekey now skips any entry carrying that marker. Failing repro test landed in 87419f2 first.
…ollide with letters Failing test for the codex review on #758. With df cols ['b', 'foo'], analysis assigns internal a='b' and b='foo'. _rekey_op_sd_to_internal sees cleaning_sd['b'] (an analysis entry for orig 'foo') and rewrites it to 'a' because rewrites['b']=='a' — merging the second column's analysis metadata into the first column's entry and dropping the second column's entry entirely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…name marker Codex review on #758 found that _rekey_op_sd_to_internal corrupted analysis metadata for frames where an orig col name collides with an internal letter key. Example: cols ['b', 'foo'] → internal a='b' and b='foo'. The pass would see cleaning_sd['b'] (the analysis entry for orig 'foo'), look up rewrites['b']=='a', and merge it into 'a' — so the first column ends up with the second column's _type, orig_col_name, and any contributed highlights. Both pandas and polars analysis pipelines set `rewritten_col_name` on every entry (analysis_management.py:42, polars_analysis_management.py:203); op-contributed SDResult entries do not. Skip rekeying for any entry carrying the marker. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…SDResult Mirrors #758 for the pandas backend. Pandas `Search.transform` now returns `SDResult(filtered_df, sd_updates)` so the search term flows into `cleaning_sd` as `highlight_phrase` on every string/object column. Together with the existing `style_column` reader (added in #758) the phrase lands in the string `displayer_args`, where the JS-side displayer already renders matches as `<mark>`. Uses `highlight_phrase` (list of literal needles) rather than the `highlight_regex` (single regex string) variant polars emits because `search_df_str` uses `Series.str.find` — a literal substring match. Matching the filter semantics on the highlight side avoids the case where a search term containing regex metacharacters would filter on literal text but try to highlight as a regex. The string-column detection mirrors `search_df_str`: union of `select_dtypes("string")` and `select_dtypes("object")` columns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…SDResult Mirrors #758 for the pandas backend. Pandas `Search.transform` now returns `SDResult(filtered_df, sd_updates)` so the search term flows into `cleaning_sd` as `highlight_phrase` on every string/object column. Together with the existing `style_column` reader (added in #758) the phrase lands in the string `displayer_args`, where the JS-side displayer already renders matches as `<mark>`. Uses `highlight_phrase` (list of literal needles) rather than the `highlight_regex` (single regex string) variant polars emits because `search_df_str` uses `Series.str.find` — a literal substring match. Matching the filter semantics on the highlight side avoids the case where a search term containing regex metacharacters would filter on literal text but try to highlight as a regex. The string-column detection mirrors `search_df_str`: union of `select_dtypes("string")` and `select_dtypes("object")` columns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… channel Mirrors #758 for the xorq backend. The xorq path doesn't go through configure_buckaroo's lisp interpreter (ibis exprs can't .copy()), so it can't reuse the SDResult machinery from #755 directly. Instead this adds an analogous sd channel inside XorqAutocleaning: - Handlers in _XORQ_OP_HANDLERS may now return either a bare expr (legacy) or (expr, sd_updates). _apply_xorq_ops accumulates the per-column sd entries across ops, merging col-by-col. - handle_ops_and_clean runs the accumulated updates through _rekey_op_sd_to_internal (the same helper PandasAutocleaning uses since #758) so orig-named entries land on buckaroo's internal a/b/c letter keys and compose cleanly with the summary_sd that XorqDataflow._get_summary_sd produces (also keyed by letter). - _xorq_search returns the filtered expr plus {col: {'highlight_phrase': [val]}} for every ibis-String column. Uses highlight_phrase (list of literal needles) rather than highlight_regex because ibis StringValue.contains is a literal substring match — matching the filter semantics on the highlight side avoids regex-metacharacter divergence. Scope: only the search command is wired today. The sd channel itself is generic — other ops can opt in by returning (expr, sd_updates). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Polars
Search.transformreturnsSDResult(filtered_df, sd_updates)so the search term flows intocleaning_sdashighlight_regexon every polars-String column.DefaultMainStyling.style_columnreadshighlight_phrase/highlight_regex/highlight_coloroffcol_metaand threads them into the stringdisplayer_args, where the JS side already renders matches as<mark>.First concrete consumer of the SDResult channel from #755 (which superseded #744).
Replaces #745 — rebased onto main after #755 merged, with the obsolete
(df, sd_updates)tuple commit dropped andSearchported to returnSDResultinstead.Supporting changes
autocleaning.handle_ops_and_cleanrekeys op-supplied sd entries from orig col names onto buckaroo's internal a/b/c letter keys, so they merge into the matching analysis entry instead of sitting alongside as orphans. Runs aftermake_origs(which uses the keys as column names oncleaned_df).style_columnfalls back toobjwhen_typeis absent, so a stray op-only entry (no analysis ran for that column) can'tKeyErrorthe styling pass.Test plan
pytest tests/unit/dataflow/ tests/unit/jlisp/ tests/unit/commands/polars_command_test.py— 147 passTests added
test_sdresult_lands_in_cleaning_sd_through_handle_ops_and_clean— wiring ofSDResult.sd_updatesintocleaning_sdviahandle_ops_and_cleantest_search_threads_highlight_regex_into_cleaning_sd_under_rename— Search contributeshighlight_regexkeyed by the renamed (a/b/c) column, not the orig nametest_default_main_styling_emits_highlight_regex_into_displayer_args—style_columncopieshighlight_regexinto the stringdisplayer_argstest_style_column_handles_col_meta_missing_type— fallback toobjwhen_typeabsenttest_search_op_delivers_highlight_regex_into_displayer_args— e2e throughPolarsBuckarooInfiniteWidgetintodf_viewer_config.column_configtest_column_config_overrides_preserves_highlight_phrase_and_color— user-supplied overrides survive the merge🤖 Generated with Claude Code