feat(xorq): XorqBuckarooWidget — buckaroo against an ibis/xorq expression#703
Merged
feat(xorq): XorqBuckarooWidget — buckaroo against an ibis/xorq expression#703
Conversation
Contributor
📦 TestPyPI package publishedpip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.13.5.dev25438596086or with uv: uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.13.5.dev25438596086MCP server for Claude Codeclaude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.13.5.dev25438596086" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table📖 Docs preview🎨 Storybook preview |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 202c5e44e0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
paddymul
added a commit
that referenced
this pull request
May 6, 2026
Subprocess + meta-path blocker — importing ``buckaroo.xorq_buckaroo`` fails fast if polars (or any polars-only buckaroo module) gets pulled in transitively, and vice versa for the polars side. Catches the regression Codex flagged on PR #703 (NoCleaningConfPl import) before it reaches a ``buckaroo[xorq]`` install with no polars. Both extras are optional; cross-imports turn one extra into a hard requirement for the other and break the advertised install paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
paddymul
commented
May 6, 2026
paddymul
added a commit
that referenced
this pull request
May 6, 2026
Three checks were red, all from the last test push:
- JS / Build + Test, Build JS + Python Wheel, ReadTheDocs (all run
``tsc``): xorqWindow.dom.test.tsx used ``{ displayer: "float" }``
which is missing the required ``min_fraction_digits`` /
``max_fraction_digits`` from FloatDisplayerA. Add them.
- Python / Lint (ruff F841): the multi-stage step test bound
``df = pd.read_parquet(...)`` but never used it. Renamed to
``_bufs`` and dropped the call.
Verified locally: ruff clean, tsc -b passes, vite build succeeds,
36 python tests + 2 dom tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sion Stats push down via XorqDfStatsV2 (one batched aggregate + per-column histogram queries); only a display-sized sample lands in pandas. add_processing takes an expression-to-expression function so postprocessing also pushes down — filtered stats reflect the filtered expr without re-materialising. - buckaroo/xorq_buckaroo.py: XorqBuckarooWidget plus XorqSampling (limit+execute), XorqAutocleaning (no-op; the lisp interpreter targets pandas), and XorqDataflow (count() for df_meta, rewrites SD keys orig→a/b/c since XorqStatPipeline keeps orig names but the frontend works in rewritten space). - buckaroo/buckaroo_widget.py: dataflow_klass class attribute on BuckarooWidgetBase so the inner dataflow can be swapped without duplicating __init__. Pandas widget unchanged. - tests/unit/test_xorq_buckaroo_widget.py: instantiation, df_meta from count(), column rewrite, push-down filter through add_processing, switch-back, error-frame fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
XorqBuckarooInfiniteWidget pushes pagination down to the backend. Each infinite_request slices via expr.limit(end - start, offset=start).execute(); sort fires expr.order_by(asc/desc).limit(...) on the backend. df_data_dict.main stays empty — pagination drives loading. The total length in each response is one expr.count().execute(); the small-window pandas frame is what crosses the parquet boundary. Search wired through quick_command_args. XorqSearch carries the 'search' lisp symbol and the frontend quick-args pattern, but the actual filter dispatch happens directly in XorqAutocleaning — the lisp interpreter built by configure_buckaroo bakes in df.copy()/df.clone(), neither of which works on immutable ibis exprs. _xorq_search builds an OR of contains() across all string columns; an empty value clears the filter. NoCleaningConfXorq replaces NoCleaningConfPl in the autocleaning config so command symbols are registered for the frontend. Tests: 9 new (3 search behaviours, 6 infinite widget — slice, sort pushdown, second_request piggyback, search-then-paginate). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Subprocess + meta-path blocker — importing ``buckaroo.xorq_buckaroo`` fails fast if polars (or any polars-only buckaroo module) gets pulled in transitively, and vice versa for the polars side. Catches the regression Codex flagged on PR #703 (NoCleaningConfPl import) before it reaches a ``buckaroo[xorq]`` install with no polars. Both extras are optional; cross-imports turn one extra into a hard requirement for the other and break the advertised install paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run paddy-format on the two test files added this branch — collapses multiline literals with trailing commas to a single line. CI lint job was failing for these two files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes to address paddymul's note that it wasn't obvious that the expression stays lazy until a *bounded* execute on a small slice: - Inline the slicing inside _handle_payload_args. Sort and slice flow through one branch, the explicit ``expr.limit(end - start, offset=start) .execute()`` lives in _execute_window (renamed from _sliced_pandas) with a docstring spelling out that this is the only place the expr is materialised, and only over rows [start, end). - Refuse a bare expr.execute() in XorqSampling.serialize_sample. Previously the ``serialize_limit == 0`` else branch would have done ``df_or_expr.execute()`` — unreachable in practice (basic widget has serialize_limit=5_000; infinite widget skips this path via skip_main_serial=True), but exactly the kind of thing the reviewer was guarding against. Now an unbounded serialize_limit raises RuntimeError instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirror the polars infinite widget's wire path: arrow → parquet, no
pandas detour. The ibis branch of _window_to_parquet (renamed from
_execute_window) goes:
expr.select(rename).limit(end - start, offset=start)
.to_pyarrow() -> pa.Table -> pq.write_table
instead of `.execute()` (pandas DataFrame) followed by fastparquet.
The pandas branch (postprocessor materialised, error frame) still
flows through fastparquet — there's no expression to push down.
Side benefit of the arrow path: string columns travel as native
parquet UTF8 instead of fastparquet's JSON-encoded strings.
Tests around the infinite behaviour — went from 18 to 35:
- TestInfiniteWindowEdges: partial-at-end window, past-end empty,
index-column carries absolute offsets across pages.
- TestInfiniteSort: descending, by string column, with non-zero
offset, second_request skipped when sorted.
- TestInfinitePostprocessing: filter-returning processor stays
push-down; pandas-returning processor takes the in-process branch.
- TestInfiniteErrorPath: bad sort column emits error_info + length=0.
- TestLazyPostprocessor: postprocessor body runs without triggering
any materialisation (pandas or arrow); paginates with bounded plan.
- TestInfiniteBoundedExecution: the central PR-review invariant —
the only ops materialised are CountStar (aggregate) and Limit
(windowed slice), never InMemoryTable. Includes a no-pandas-
detour spy that confirms the row window goes through to_pyarrow,
never .execute().
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two separate _handle_payload_args calls — [0, 50) then a jump to [300, 350) — confirm the infinite widget handles scroll-jumps without materialising rows outside either window. Each request emits exactly one CountStar (total) and one Limit (slice); InMemoryTable never appears in the spy log. This exercises the contract a frontend would hit when the user scrolls past the loaded buffer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The xorq infinite widget switched to arrow-direct serialization (expr.to_pyarrow() -> pyarrow.parquet.write_table) instead of pandas / fastparquet. The wire format changes — most notably string columns travel as native parquet UTF8 instead of fastparquet's JSON-encoded form — so we need a contract test that the JS frontend's hyparquet reader handles the new bytes. Pattern (mirrors the existing summary_stats_parquet_b64 fixture): - scripts/generate_xorq_window_fixture.py — emits the fixture by running XorqBuckarooInfiniteWidget._handle_payload_args on a small deterministic dataset and base64-encoding the parquet response. - tests/unit/test_xorq_window_fixture.py — re-runs the generator in memory and diffs every field (including the parquet bytes) against the committed fixture, so drift between Python output and the JS fixture surfaces in CI. - packages/.../xorqWindow.test.ts — uses hyparquet to decode the fixture and asserts: row count, schema in the rewritten 'a, b, index' space, strings come through native UTF8 (not JSON-quoted), and the index column carries absolute window offsets. Run scripts/generate_xorq_window_fixture.py to refresh the fixture when the wire format intentionally changes; the drift test will tell you if you forgot. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
xorqWindow.test.ts proved the parquet bytes are decodable; this test
adds the next layer: the decoded rows actually reach AgGridReact's
rowData with the right values. End-to-end:
parquet bytes → hyparquet decode → DFViewerInfinite (Raw mode)
→ AgGridReact rowData
Mocking AgGridReact (the same pattern DFViewerInfinite.test.tsx uses;
real AG-Grid uses virtual scrolling + canvas which jsdom can't render
cleanly) and inspecting ``latestAgGridProps.gridOptions.rowData`` is
the surface where 'what will be rendered into the grid DOM' is
captured. Asserts:
- row count matches the request window;
- index column carries absolute window offsets;
- name strings come through native UTF8 (no JSON quotes — proves the
arrow-direct serialization survives the full pipeline);
- price column is numeric (not strings);
- column_defs' ``field`` keys match what's in rowData (otherwise
AG-Grid would render blank cells).
Sanitises BigInt → Number on decode, mirroring resolveDFData's
parseParquetRow. Hyparquet decodes parquet INT64 as JS BigInt, and
DFViewerInfinite's Raw path JSON-stringifies rowData for a memo
signature; without sanitising, the test crashes on a BigInt that
production never encounters (production uses DataSource mode).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end check that the arrow-direct serialization (Python) lines up
with hyparquet decoding (JS) and lands real values in real AG-Grid
cells. Loads a 5000-row XorqBuckarooInfiniteWidget into JupyterLab,
exercises three scenarios:
1. Initial render — row 0's str_col cell ("foo_10") is visible and
matches the predictable invariant
``str_col == "foo_{row_num + 10}"``.
2. Deep scroll — scrolls ~30000px, waits for the follow-up
infinite_request to round-trip, then reads every visible body row
(filtered to ``[row-index]`` so the pinned dtype row doesn't leak
in) and asserts the same invariant. The min visible row_num must be
> 500 — proves we genuinely scrolled rather than re-rendering the
top.
3. Jump back to top — after a deep scroll, scrollTop=0 and verify
row 0 is rendered again (cache hit or re-fetch, both fine).
Test setup follows the existing infinite-scroll-transcript.spec.ts
pattern: notebook lives in tests/integration_notebooks/, the runner
script copies it into the JupyterLab cwd, and the spec file is
matched by playwright.config.integration.ts. The script's
notebook→spec routing now sends test_xorq_infinite_scroll.ipynb to
xorq-infinite-scroll.spec.ts with a 60s timeout (xorq + DataFusion
startup is heavier than polars).
Verified locally: 3/3 sub-tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three checks were red, all from the last test push:
- JS / Build + Test, Build JS + Python Wheel, ReadTheDocs (all run
``tsc``): xorqWindow.dom.test.tsx used ``{ displayer: "float" }``
which is missing the required ``min_fraction_digits`` /
``max_fraction_digits`` from FloatDisplayerA. Add them.
- Python / Lint (ruff F841): the multi-stage step test bound
``df = pd.read_parquet(...)`` but never used it. Renamed to
``_bufs`` and dropped the call.
Verified locally: ruff clean, tsc -b passes, vite build succeeds,
36 python tests + 2 dom tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Walks through the regular and infinite-scroll variants against citibike parquet (100k rows) and a 1.2M-row NumPy-generated frame, plus a lazy postprocessing demo via expr.filter(). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
16711c9 to
28db376
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
XorqBuckarooWidget(expr)— buckaroo over a xorq/ibis expression. Stats push down throughXorqDfStatsV2(one batched aggregate + per-column histogram queries); only a display-sized sample lands in pandas viaexpr.limit(N).execute().add_processing(fn)takes an expression-to-expression function so postprocessing also pushes down — filtered stats reflect the filtered expression without re-materialising.BuckarooWidgetBasegains adataflow_klassclass attribute so the inner dataflow can be swapped without duplicating__init__. Pandas widget unchanged.What's new in
buckaroo/xorq_buckaroo.pyXorqSampling—serialize_sampledoesexpr.limit().execute(); pre-stats sampling is a no-op (the point is push-down).XorqAutocleaning— pass-through. The lisp interpreter targets pandas DataFrames; for ibis exprs we skip cleaning entirely.XorqDataflow— overridespopulate_df_metato useexpr.count().execute()instead oflen(), and rewrites the summary-dict keys from original column names (whatXorqStatPipelineproduces) to the rewrittena, b, cspace the frontend works in.XorqBuckarooWidget— wires the above together withXORQ_STATS_V2+ the default styling klasses, and exposesadd_processing(expr_func).Test plan
tests/unit/test_xorq_buckaroo_widget.pycover instantiation,df_metafromcount(), column rewrite, push-down filter viaadd_processing, switching back to no-op processing, and the error-frame fallback when a postprocessor raisesXorqBuckarooWidget(memtable).add_processing(lambda e: e.filter(e.qty > 2))→ 4-row filtered display, stats min/max reflect the filtered expr (9.9 / 50.0 vs 7.4 / 50.0 unfiltered)🤖 Generated with Claude Code