Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-35389: [Python] Fix coalesce_keys=False option in join operation #35505

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented May 9, 2023

Rationale for this change

During the refactor of the join cython code to use the new pyarrow.acero code, I accidentally ignored the coalesce_keys=False option.
This PR restores the previous behaviour (by not passing a custom subset of column names to the HashJoinNode, but relying on its default behaviour to include all fields from left and right data (depending on the join type)).

Are these changes tested?

Expanded the existing tests to now properly cover the coalesce_keys=False option for all join types.

Are there any user-facing changes?

Fixes a regression in 12.0, restoring behaviour of 11.0

@github-actions
Copy link

github-actions bot commented May 9, 2023

@github-actions
Copy link

github-actions bot commented May 9, 2023

⚠️ GitHub issue #35389 has been automatically assigned in GitHub to PR creator.

Comment on lines 54 to +61
decl = Declaration("scan", ScanNodeOptions(dataset, use_threads=use_threads))

# Get rid of special dataset columns
# "__fragment_index", "__batch_index", "__last_in_fragment", "__filename"
projections = [field(f) for f in dataset.schema.names]
decl = Declaration.from_sequence(
[decl, Declaration("project", ProjectNodeOptions(projections))]
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@westonpace do you know if there is an easier way to do this? (there is not an option to just turn off adding those fields in the scanner to start with?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was on my "new scan node" todo list but sadly that list has been stalled for a while now.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason, I thought coalescing a full outer join required an actual call to the coalesce function (e.g. with a project node). However, I don't see that here. Maybe you just aren't coalescing on a full outer join (which is probably fine)?

python/pyarrow/tests/test_exec_plan.py Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member Author

There is such an actual coalesce call in the case of a full outer join, but that's not visible in the diff here (since this PR is not fixing that case):

Expression._call("coalesce", [

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for the clarification. This looks good then.

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels May 9, 2023
@jorisvandenbossche jorisvandenbossche merged commit 9ef2f65 into apache:main May 11, 2023
15 of 16 checks passed
@jorisvandenbossche jorisvandenbossche deleted the gh-35389-regression-coalesce-keys branch May 11, 2023 09:00
@ursabot
Copy link

ursabot commented May 12, 2023

Benchmark runs are scheduled for baseline = e5405e7 and contender = 9ef2f65. 9ef2f65 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️2.35% ⬆️0.03%] test-mac-arm
[Finished ⬇️0.25% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.57% ⬆️0.03%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 9ef2f655 ec2-t3-xlarge-us-east-2
[Finished] 9ef2f655 test-mac-arm
[Finished] 9ef2f655 ursa-i9-9960x
[Finished] 9ef2f655 ursa-thinkcentre-m75q
[Finished] e5405e77 ec2-t3-xlarge-us-east-2
[Finished] e5405e77 test-mac-arm
[Finished] e5405e77 ursa-i9-9960x
[Finished] e5405e77 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented May 12, 2023

['Python', 'R'] benchmarks have high level of regressions.
test-mac-arm

ArgusLi pushed a commit to Bit-Quill/arrow that referenced this pull request May 15, 2023
…tion (apache#35505)

### Rationale for this change

During the refactor of the join cython code to use the new `pyarrow.acero` code, I accidentally ignored the `coalesce_keys=False` option. 
This PR restores the previous behaviour (by not passing a custom subset of column names to the HashJoinNode, but relying on its default behaviour to include all fields from left and right data (depending on the join type)).

### Are these changes tested?

Expanded the existing tests to now properly cover the coalesce_keys=False option for all join types.

### Are there any user-facing changes?

Fixes a regression in 12.0, restoring behaviour of 11.0
* Closes: apache#35389

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
rtpsw pushed a commit to rtpsw/arrow that referenced this pull request May 16, 2023
…tion (apache#35505)

### Rationale for this change

During the refactor of the join cython code to use the new `pyarrow.acero` code, I accidentally ignored the `coalesce_keys=False` option. 
This PR restores the previous behaviour (by not passing a custom subset of column names to the HashJoinNode, but relying on its default behaviour to include all fields from left and right data (depending on the join type)).

### Are these changes tested?

Expanded the existing tests to now properly cover the coalesce_keys=False option for all join types.

### Are there any user-facing changes?

Fixes a regression in 12.0, restoring behaviour of 11.0
* Closes: apache#35389

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
raulcd pushed a commit that referenced this pull request May 30, 2023
…35505)

### Rationale for this change

During the refactor of the join cython code to use the new `pyarrow.acero` code, I accidentally ignored the `coalesce_keys=False` option. 
This PR restores the previous behaviour (by not passing a custom subset of column names to the HashJoinNode, but relying on its default behaviour to include all fields from left and right data (depending on the join type)).

### Are these changes tested?

Expanded the existing tests to now properly cover the coalesce_keys=False option for all join types.

### Are there any user-facing changes?

Fixes a regression in 12.0, restoring behaviour of 11.0
* Closes: #35389

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Python] Table.join no longer respecting coalesce_keys parameter (PyArrow 12)
3 participants