Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Table.join no longer respecting coalesce_keys parameter (PyArrow 12) #35389

Closed
joocer opened this issue May 2, 2023 · 4 comments · Fixed by #35505
Closed

[Python] Table.join no longer respecting coalesce_keys parameter (PyArrow 12) #35389

joocer opened this issue May 2, 2023 · 4 comments · Fixed by #35505

Comments

@joocer
Copy link

joocer commented May 2, 2023

Describe the bug, including details regarding any error messages, version, and platform.

Table.join has an attribute coalesce_keys, which the documentation says:

coalesce_keysbool, default True
If the duplicated keys should be omitted from one of the sides in the join result.

In PyArrow v11, the columns used to perform the join were retained in the resultant table when this parameter was set to False. However, in v12, the column from the 'right' table (the one in the parameters) is omitted from the result.

A review of the change log for v12 doesn't suggest this change in behaviour is intentional.

This change in behaviour was observed in the matrix regression testing for Opteryx, which has matrix regression testing across Mac, Linux and Windows, for Python versions 3.8, 3.9, 3.10 and 3.11, and for Left and Inner JOINs - all variations in the test matrix appear to have this same behaviour.

No error is observed, only a change in behaviour.

Component(s)

Python

@joocer joocer changed the title Table.join no longer respecting coalesce_keys parameter (Python, PyArrow 12) [Python] Table.join no longer respecting coalesce_keys parameter (PyArrow 12) May 2, 2023
@jorisvandenbossche
Copy link
Member

@joocer thanks for the report! That was certainly not an intentional change in behaviour, but the implementation was refactored to use a different invocation of the C++ APIs (now relying on the pyarrow.acero Declaration bindings). So something might have gone wrong in this refactor (and if so, unfortunately this doesn't seem to have been covered properly by our tests).

Could you provide a small reproducible code example that illustrates the change in behaviour?

@jorisvandenbossche jorisvandenbossche added this to the 12.0.1 milestone May 3, 2023
@joocer
Copy link
Author

joocer commented May 3, 2023

No worries - here's a contrived snippet to demonstrate:

import pyarrow

movie_vampires = pyarrow.Table.from_pydict(
    {
        "Movie": ["Twilight", "Interview with the Vampire", "Dracula", "Blade", "Underworld"],
        "Vampire": ["Edward Cullen", "Lestat de Lioncourt", "Count Dracula", "Blade", "Selene"],
    }
)

actors = pyarrow.Table.from_pydict(
    {
        "Character": ["Edward Cullen", "Lestat de Lioncourt", "Count Dracula", "Blade", "Selene"],
        "Actor": ["Robert Pattinson", "Tom Cruise", "Gary Oldman", "Wesley Snipes", "Kate Beckinsale"],
    }
)

movie_actors = movie_vampires.join(
    actors,
    keys=["Vampire"],
    right_keys=["Character"],
    join_type="inner",
    coalesce_keys=False,
)

print(movie_actors)

In pyarrow 11, this is the result (note the four columns):

Movie: string
Vampire: string
Character: string
Actor: string
----
Movie: [["Twilight","Interview with the Vampire","Dracula","Blade","Underworld"]]
Vampire: [["Edward Cullen","Lestat de Lioncourt","Count Dracula","Blade","Selene"]]
Character: [["Edward Cullen","Lestat de Lioncourt","Count Dracula","Blade","Selene"]]
Actor: [["Robert Pattinson","Tom Cruise","Gary Oldman","Wesley Snipes","Kate Beckinsale"]]

in pyarrow 12 this is the result (note only three columns)

Movie: string
Vampire: string
Actor: string
----
Movie: [["Twilight","Interview with the Vampire","Dracula","Blade","Underworld"]]
Vampire: [["Edward Cullen","Lestat de Lioncourt","Count Dracula","Blade","Selene"]]
Actor: [["Robert Pattinson","Tom Cruise","Gary Oldman","Wesley Snipes","Kate Beckinsale"]]

This output is from a box with Python 3.10.7 on Debian Buster x86 64bit, but it appears to happen on all the Python versions and OSes I've tried.

@westonpace
Copy link
Member

Both pyarrow and R have been doing this on their own and so I think the motivation to add the fix to C++ directly was low. However, now that pyarrow is using C++ more directly, it sounds like we need it. There was a very old PR here: zagto#1 that should provide a rough approach. However, it is probably out of date.

@jorisvandenbossche
Copy link
Member

While I think it would certainly be nice to move this coalesce logic into C++, in this case it was just a small oversight in my refactor of the python bindings that caused it (I always the selection of columns to the join node, instead of only when coalesce_keys=True)
PR to fix this -> #35505

jorisvandenbossche added a commit that referenced this issue May 11, 2023
…35505)

### Rationale for this change

During the refactor of the join cython code to use the new `pyarrow.acero` code, I accidentally ignored the `coalesce_keys=False` option. 
This PR restores the previous behaviour (by not passing a custom subset of column names to the HashJoinNode, but relying on its default behaviour to include all fields from left and right data (depending on the join type)).

### Are these changes tested?

Expanded the existing tests to now properly cover the coalesce_keys=False option for all join types.

### Are there any user-facing changes?

Fixes a regression in 12.0, restoring behaviour of 11.0
* Closes: #35389

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this issue May 15, 2023
…tion (apache#35505)

### Rationale for this change

During the refactor of the join cython code to use the new `pyarrow.acero` code, I accidentally ignored the `coalesce_keys=False` option. 
This PR restores the previous behaviour (by not passing a custom subset of column names to the HashJoinNode, but relying on its default behaviour to include all fields from left and right data (depending on the join type)).

### Are these changes tested?

Expanded the existing tests to now properly cover the coalesce_keys=False option for all join types.

### Are there any user-facing changes?

Fixes a regression in 12.0, restoring behaviour of 11.0
* Closes: apache#35389

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
rtpsw pushed a commit to rtpsw/arrow that referenced this issue May 16, 2023
…tion (apache#35505)

### Rationale for this change

During the refactor of the join cython code to use the new `pyarrow.acero` code, I accidentally ignored the `coalesce_keys=False` option. 
This PR restores the previous behaviour (by not passing a custom subset of column names to the HashJoinNode, but relying on its default behaviour to include all fields from left and right data (depending on the join type)).

### Are these changes tested?

Expanded the existing tests to now properly cover the coalesce_keys=False option for all join types.

### Are there any user-facing changes?

Fixes a regression in 12.0, restoring behaviour of 11.0
* Closes: apache#35389

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
raulcd pushed a commit that referenced this issue May 30, 2023
…35505)

### Rationale for this change

During the refactor of the join cython code to use the new `pyarrow.acero` code, I accidentally ignored the `coalesce_keys=False` option. 
This PR restores the previous behaviour (by not passing a custom subset of column names to the HashJoinNode, but relying on its default behaviour to include all fields from left and right data (depending on the join type)).

### Are these changes tested?

Expanded the existing tests to now properly cover the coalesce_keys=False option for all join types.

### Are there any user-facing changes?

Fixes a regression in 12.0, restoring behaviour of 11.0
* Closes: #35389

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants