New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Table.join no longer respecting coalesce_keys
parameter (PyArrow 12)
#35389
Comments
coalesce_keys
parameter (Python, PyArrow 12)coalesce_keys
parameter (PyArrow 12)
@joocer thanks for the report! That was certainly not an intentional change in behaviour, but the implementation was refactored to use a different invocation of the C++ APIs (now relying on the Could you provide a small reproducible code example that illustrates the change in behaviour? |
No worries - here's a contrived snippet to demonstrate: import pyarrow
movie_vampires = pyarrow.Table.from_pydict(
{
"Movie": ["Twilight", "Interview with the Vampire", "Dracula", "Blade", "Underworld"],
"Vampire": ["Edward Cullen", "Lestat de Lioncourt", "Count Dracula", "Blade", "Selene"],
}
)
actors = pyarrow.Table.from_pydict(
{
"Character": ["Edward Cullen", "Lestat de Lioncourt", "Count Dracula", "Blade", "Selene"],
"Actor": ["Robert Pattinson", "Tom Cruise", "Gary Oldman", "Wesley Snipes", "Kate Beckinsale"],
}
)
movie_actors = movie_vampires.join(
actors,
keys=["Vampire"],
right_keys=["Character"],
join_type="inner",
coalesce_keys=False,
)
print(movie_actors) In pyarrow 11, this is the result (note the four columns):
in pyarrow 12 this is the result (note only three columns)
This output is from a box with Python 3.10.7 on Debian Buster x86 64bit, but it appears to happen on all the Python versions and OSes I've tried. |
Both pyarrow and R have been doing this on their own and so I think the motivation to add the fix to C++ directly was low. However, now that pyarrow is using C++ more directly, it sounds like we need it. There was a very old PR here: zagto#1 that should provide a rough approach. However, it is probably out of date. |
While I think it would certainly be nice to move this coalesce logic into C++, in this case it was just a small oversight in my refactor of the python bindings that caused it (I always the selection of columns to the join node, instead of only when |
…35505) ### Rationale for this change During the refactor of the join cython code to use the new `pyarrow.acero` code, I accidentally ignored the `coalesce_keys=False` option. This PR restores the previous behaviour (by not passing a custom subset of column names to the HashJoinNode, but relying on its default behaviour to include all fields from left and right data (depending on the join type)). ### Are these changes tested? Expanded the existing tests to now properly cover the coalesce_keys=False option for all join types. ### Are there any user-facing changes? Fixes a regression in 12.0, restoring behaviour of 11.0 * Closes: #35389 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…tion (apache#35505) ### Rationale for this change During the refactor of the join cython code to use the new `pyarrow.acero` code, I accidentally ignored the `coalesce_keys=False` option. This PR restores the previous behaviour (by not passing a custom subset of column names to the HashJoinNode, but relying on its default behaviour to include all fields from left and right data (depending on the join type)). ### Are these changes tested? Expanded the existing tests to now properly cover the coalesce_keys=False option for all join types. ### Are there any user-facing changes? Fixes a regression in 12.0, restoring behaviour of 11.0 * Closes: apache#35389 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…tion (apache#35505) ### Rationale for this change During the refactor of the join cython code to use the new `pyarrow.acero` code, I accidentally ignored the `coalesce_keys=False` option. This PR restores the previous behaviour (by not passing a custom subset of column names to the HashJoinNode, but relying on its default behaviour to include all fields from left and right data (depending on the join type)). ### Are these changes tested? Expanded the existing tests to now properly cover the coalesce_keys=False option for all join types. ### Are there any user-facing changes? Fixes a regression in 12.0, restoring behaviour of 11.0 * Closes: apache#35389 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…35505) ### Rationale for this change During the refactor of the join cython code to use the new `pyarrow.acero` code, I accidentally ignored the `coalesce_keys=False` option. This PR restores the previous behaviour (by not passing a custom subset of column names to the HashJoinNode, but relying on its default behaviour to include all fields from left and right data (depending on the join type)). ### Are these changes tested? Expanded the existing tests to now properly cover the coalesce_keys=False option for all join types. ### Are there any user-facing changes? Fixes a regression in 12.0, restoring behaviour of 11.0 * Closes: #35389 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Describe the bug, including details regarding any error messages, version, and platform.
Table.join has an attribute
coalesce_keys
, which the documentation says:In PyArrow v11, the columns used to perform the join were retained in the resultant table when this parameter was set to
False
. However, in v12, the column from the 'right' table (the one in the parameters) is omitted from the result.A review of the change log for v12 doesn't suggest this change in behaviour is intentional.
This change in behaviour was observed in the matrix regression testing for Opteryx, which has matrix regression testing across Mac, Linux and Windows, for Python versions 3.8, 3.9, 3.10 and 3.11, and for Left and Inner JOINs - all variations in the test matrix appear to have this same behaviour.
No error is observed, only a change in behaviour.
Component(s)
Python
The text was updated successfully, but these errors were encountered: