Fix BuckarooCompare for arbitrary join keys#589
Conversation
Extracts col_join_dfs into buckaroo/compare.py and fixes: - Hardcoded m_df['a'] membership logic replaced with pd.merge indicator - join_columns now consistently treated as list[str] end-to-end - Diff stats computed from key-aligned merged rows instead of positional - column_config_overrides applied per join column (not list as dict key) Closes #588 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 90eb5eae4f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
buckaroo/compare.py
Outdated
|
|
||
| df2_suffix = "|df2" | ||
| for col in df1.columns: | ||
| if df2_suffix in col: |
There was a problem hiding this comment.
Handle non-string column labels in sentinel validation
The sentinel-name guard calls if df2_suffix in col on every column label, but pandas columns can be non-strings (for example integers or tuples). In those cases this expression raises a TypeError before the merge runs, so otherwise valid comparisons (including integer-labeled join keys) fail immediately. Because this function is now the reusable compare entrypoint, the sentinel check should only do substring matching for string labels (or safely stringify labels first).
Useful? React with 👍 / 👎.
📦 TestPyPI package publishedpip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.12.12.dev22417904063or with uv: uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.12.12.dev22417904063MCP server for Claude Codeclaude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.12.12.dev22417904063" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 90eb5eae4f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Guard sentinel check against non-string column labels (skip substring match for int/tuple column names) - Validate join key uniqueness to prevent cartesian explosion from duplicate keys - Use custom indicator column name (__buckaroo_merge) instead of default _merge to avoid conflicts with user data - Handle pandas coercing non-string column labels to strings after merge with suffixes - Add tests for all three review items Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
col_join_dfsfrom the marimo demo notebook intobuckaroo/compare.pyas a reusable, tested modulem_df['a']membership logic — now usespd.merge(indicator=True)to work with any join key namejoin_columnshandling — now alwayslist[str]end-to-end (no morejoin_columns[0]unwrap)column_config_overrides[join_columns](was passing list as dict key) — now applies config per join columnTest plan
tests/unit/compare_test.pycovering:ajoin keyaccount_id,as_of_date)join_columnsnormalizationCloses #588
🤖 Generated with Claude Code