[Python] Table.join no longer respecting `coalesce_keys` parameter (PyArrow 12) #35389

joocer · 2023-05-02T17:03:28Z

Describe the bug, including details regarding any error messages, version, and platform.

Table.join has an attribute coalesce_keys, which the documentation says:

coalesce_keysbool, default True
If the duplicated keys should be omitted from one of the sides in the join result.

In PyArrow v11, the columns used to perform the join were retained in the resultant table when this parameter was set to False. However, in v12, the column from the 'right' table (the one in the parameters) is omitted from the result.

A review of the change log for v12 doesn't suggest this change in behaviour is intentional.

This change in behaviour was observed in the matrix regression testing for Opteryx, which has matrix regression testing across Mac, Linux and Windows, for Python versions 3.8, 3.9, 3.10 and 3.11, and for Left and Inner JOINs - all variations in the test matrix appear to have this same behaviour.

No error is observed, only a change in behaviour.

Component(s)

Python

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2023-05-03T12:27:29Z

@joocer thanks for the report! That was certainly not an intentional change in behaviour, but the implementation was refactored to use a different invocation of the C++ APIs (now relying on the pyarrow.acero Declaration bindings). So something might have gone wrong in this refactor (and if so, unfortunately this doesn't seem to have been covered properly by our tests).

Could you provide a small reproducible code example that illustrates the change in behaviour?

joocer · 2023-05-03T13:20:32Z

No worries - here's a contrived snippet to demonstrate:

import pyarrow

movie_vampires = pyarrow.Table.from_pydict(
    {
        "Movie": ["Twilight", "Interview with the Vampire", "Dracula", "Blade", "Underworld"],
        "Vampire": ["Edward Cullen", "Lestat de Lioncourt", "Count Dracula", "Blade", "Selene"],
    }
)

actors = pyarrow.Table.from_pydict(
    {
        "Character": ["Edward Cullen", "Lestat de Lioncourt", "Count Dracula", "Blade", "Selene"],
        "Actor": ["Robert Pattinson", "Tom Cruise", "Gary Oldman", "Wesley Snipes", "Kate Beckinsale"],
    }
)

movie_actors = movie_vampires.join(
    actors,
    keys=["Vampire"],
    right_keys=["Character"],
    join_type="inner",
    coalesce_keys=False,
)

print(movie_actors)

In pyarrow 11, this is the result (note the four columns):

Movie: string
Vampire: string
Character: string
Actor: string
----
Movie: [["Twilight","Interview with the Vampire","Dracula","Blade","Underworld"]]
Vampire: [["Edward Cullen","Lestat de Lioncourt","Count Dracula","Blade","Selene"]]
Character: [["Edward Cullen","Lestat de Lioncourt","Count Dracula","Blade","Selene"]]
Actor: [["Robert Pattinson","Tom Cruise","Gary Oldman","Wesley Snipes","Kate Beckinsale"]]

in pyarrow 12 this is the result (note only three columns)

Movie: string
Vampire: string
Actor: string
----
Movie: [["Twilight","Interview with the Vampire","Dracula","Blade","Underworld"]]
Vampire: [["Edward Cullen","Lestat de Lioncourt","Count Dracula","Blade","Selene"]]
Actor: [["Robert Pattinson","Tom Cruise","Gary Oldman","Wesley Snipes","Kate Beckinsale"]]

This output is from a box with Python 3.10.7 on Debian Buster x86 64bit, but it appears to happen on all the Python versions and OSes I've tried.

westonpace · 2023-05-08T17:56:02Z

Both pyarrow and R have been doing this on their own and so I think the motivation to add the fix to C++ directly was low. However, now that pyarrow is using C++ more directly, it sounds like we need it. There was a very old PR here: zagto#1 that should provide a rough approach. However, it is probably out of date.

…tion

jorisvandenbossche · 2023-05-09T09:03:44Z

While I think it would certainly be nice to move this coalesce logic into C++, in this case it was just a small oversight in my refactor of the python bindings that caused it (I always the selection of columns to the join node, instead of only when coalesce_keys=True)
PR to fix this -> #35505

…35505) ### Rationale for this change During the refactor of the join cython code to use the new `pyarrow.acero` code, I accidentally ignored the `coalesce_keys=False` option. This PR restores the previous behaviour (by not passing a custom subset of column names to the HashJoinNode, but relying on its default behaviour to include all fields from left and right data (depending on the join type)). ### Are these changes tested? Expanded the existing tests to now properly cover the coalesce_keys=False option for all join types. ### Are there any user-facing changes? Fixes a regression in 12.0, restoring behaviour of 11.0 * Closes: #35389 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

…tion (apache#35505) ### Rationale for this change During the refactor of the join cython code to use the new `pyarrow.acero` code, I accidentally ignored the `coalesce_keys=False` option. This PR restores the previous behaviour (by not passing a custom subset of column names to the HashJoinNode, but relying on its default behaviour to include all fields from left and right data (depending on the join type)). ### Are these changes tested? Expanded the existing tests to now properly cover the coalesce_keys=False option for all join types. ### Are there any user-facing changes? Fixes a regression in 12.0, restoring behaviour of 11.0 * Closes: apache#35389 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

…35505) ### Rationale for this change During the refactor of the join cython code to use the new `pyarrow.acero` code, I accidentally ignored the `coalesce_keys=False` option. This PR restores the previous behaviour (by not passing a custom subset of column names to the HashJoinNode, but relying on its default behaviour to include all fields from left and right data (depending on the join type)). ### Are these changes tested? Expanded the existing tests to now properly cover the coalesce_keys=False option for all join types. ### Are there any user-facing changes? Fixes a regression in 12.0, restoring behaviour of 11.0 * Closes: #35389 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

joocer added the Type: bug label May 2, 2023

github-actions bot added the Component: Python label May 2, 2023

joocer changed the title ~~Table.join no longer respecting coalesce_keys parameter (Python, PyArrow 12)~~ [Python] Table.join no longer respecting coalesce_keys parameter (PyArrow 12) May 2, 2023

joocer mentioned this issue May 2, 2023

🪲 pyarrow 12 changed JOIN behavior mabel-dev/opteryx#1020

Closed

jorisvandenbossche added this to the 12.0.1 milestone May 3, 2023

jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue May 9, 2023

apacheGH-35389: [Python] Fix coalesce_keys=False option in join opera…

9cd29f2

…tion

github-actions bot mentioned this issue May 9, 2023

GH-35389: [Python] Fix coalesce_keys=False option in join operation #35505

Merged

github-actions bot assigned jorisvandenbossche May 9, 2023

jorisvandenbossche closed this as completed in #35505 May 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Table.join no longer respecting `coalesce_keys` parameter (PyArrow 12) #35389

[Python] Table.join no longer respecting `coalesce_keys` parameter (PyArrow 12) #35389

joocer commented May 2, 2023

jorisvandenbossche commented May 3, 2023

joocer commented May 3, 2023

westonpace commented May 8, 2023

jorisvandenbossche commented May 9, 2023

[Python] Table.join no longer respecting coalesce_keys parameter (PyArrow 12) #35389

[Python] Table.join no longer respecting coalesce_keys parameter (PyArrow 12) #35389

Comments

joocer commented May 2, 2023

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

jorisvandenbossche commented May 3, 2023

joocer commented May 3, 2023

westonpace commented May 8, 2023

jorisvandenbossche commented May 9, 2023

[Python] Table.join no longer respecting `coalesce_keys` parameter (PyArrow 12) #35389

[Python] Table.join no longer respecting `coalesce_keys` parameter (PyArrow 12) #35389