Skip to content

Fix BuckarooCompare for arbitrary join keys#589

Merged
paddymul merged 3 commits intomainfrom
fix/compare-generic-join-keys
Feb 25, 2026
Merged

Fix BuckarooCompare for arbitrary join keys#589
paddymul merged 3 commits intomainfrom
fix/compare-generic-join-keys

Conversation

@paddymul
Copy link
Collaborator

Summary

  • Extracts col_join_dfs from the marimo demo notebook into buckaroo/compare.py as a reusable, tested module
  • Fixes hardcoded m_df['a'] membership logic — now uses pd.merge(indicator=True) to work with any join key name
  • Fixes inconsistent join_columns handling — now always list[str] end-to-end (no more join_columns[0] unwrap)
  • Computes diff stats from key-aligned merged rows instead of positional comparison
  • Fixes column_config_overrides[join_columns] (was passing list as dict key) — now applies config per join column

Test plan

  • 9 new tests in tests/unit/compare_test.py covering:
    • Non-a join key
    • Multi-key joins (account_id, as_of_date)
    • Outer join membership (df1-only, df2-only, both)
    • Reordered rows (key-aligned diff correctness)
    • One-sided extra columns
    • String join_columns normalization
    • Sentinel column name rejection
    • Inner join
    • Null-heavy comparisons
  • All 9 tests pass locally

Closes #588

🤖 Generated with Claude Code

Extracts col_join_dfs into buckaroo/compare.py and fixes:
- Hardcoded m_df['a'] membership logic replaced with pd.merge indicator
- join_columns now consistently treated as list[str] end-to-end
- Diff stats computed from key-aligned merged rows instead of positional
- column_config_overrides applied per join column (not list as dict key)

Closes #588

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 90eb5eae4f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


df2_suffix = "|df2"
for col in df1.columns:
if df2_suffix in col:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle non-string column labels in sentinel validation

The sentinel-name guard calls if df2_suffix in col on every column label, but pandas columns can be non-strings (for example integers or tuples). In those cases this expression raises a TypeError before the merge runs, so otherwise valid comparisons (including integer-labeled join keys) fail immediately. Because this function is now the reusable compare entrypoint, the sentinel check should only do substring matching for string labels (or safely stringify labels first).

Useful? React with 👍 / 👎.

@github-actions
Copy link

github-actions bot commented Feb 25, 2026

📦 TestPyPI package published

pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.12.12.dev22417904063

or with uv:

uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.12.12.dev22417904063

MCP server for Claude Code

claude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.12.12.dev22417904063" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table

@paddymul
Copy link
Collaborator Author

@codex

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 90eb5eae4f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- Guard sentinel check against non-string column labels (skip substring
  match for int/tuple column names)
- Validate join key uniqueness to prevent cartesian explosion from
  duplicate keys
- Use custom indicator column name (__buckaroo_merge) instead of default
  _merge to avoid conflicts with user data
- Handle pandas coercing non-string column labels to strings after merge
  with suffixes
- Add tests for all three review items

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@paddymul paddymul added this pull request to the merge queue Feb 25, 2026
@paddymul paddymul mentioned this pull request Feb 25, 2026
2 tasks
Merged via the queue into main with commit 8e9e1ed Feb 25, 2026
23 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

issues with the compare tool

1 participant