GH-35389: [Python] Fix coalesce_keys=False option in join operation #35505

jorisvandenbossche · 2023-05-09T08:58:37Z

Rationale for this change

During the refactor of the join cython code to use the new pyarrow.acero code, I accidentally ignored the coalesce_keys=False option.
This PR restores the previous behaviour (by not passing a custom subset of column names to the HashJoinNode, but relying on its default behaviour to include all fields from left and right data (depending on the join type)).

Are these changes tested?

Expanded the existing tests to now properly cover the coalesce_keys=False option for all join types.

Are there any user-facing changes?

Fixes a regression in 12.0, restoring behaviour of 11.0

Closes: [Python] Table.join no longer respecting coalesce_keys parameter (PyArrow 12) #35389

…tion

github-actions · 2023-05-09T08:59:03Z

Closes: [Python] Table.join no longer respecting coalesce_keys parameter (PyArrow 12) #35389

github-actions · 2023-05-09T08:59:06Z

⚠️ GitHub issue #35389 has been automatically assigned in GitHub to PR creator.

jorisvandenbossche · 2023-05-09T09:01:24Z

python/pyarrow/acero.py

    decl = Declaration("scan", ScanNodeOptions(dataset, use_threads=use_threads))

+    # Get rid of special dataset columns
+    # "__fragment_index", "__batch_index", "__last_in_fragment", "__filename"
+    projections = [field(f) for f in dataset.schema.names]
+    decl = Declaration.from_sequence(
+        [decl, Declaration("project", ProjectNodeOptions(projections))]
+    )


@westonpace do you know if there is an easier way to do this? (there is not an option to just turn off adding those fields in the scanner to start with?)

This was on my "new scan node" todo list but sadly that list has been stalled for a while now.

westonpace

For some reason, I thought coalescing a full outer join required an actual call to the coalesce function (e.g. with a project node). However, I don't see that here. Maybe you just aren't coalescing on a full outer join (which is probably fine)?

python/pyarrow/tests/test_exec_plan.py

jorisvandenbossche · 2023-05-09T17:13:56Z

There is such an actual coalesce call in the case of a full outer join, but that's not visible in the diff here (since this PR is not fixing that case):

arrow/python/pyarrow/acero.py

Line 196 in c3acc91

Expression._call("coalesce", [

westonpace

Ah, thanks for the clarification. This looks good then.

ursabot · 2023-05-12T11:54:32Z

Benchmark runs are scheduled for baseline = e5405e7 and contender = 9ef2f65. 9ef2f65 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️2.35% ⬆️0.03%] test-mac-arm
[Finished ⬇️0.25% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.57% ⬆️0.03%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 9ef2f655 ec2-t3-xlarge-us-east-2
[Finished] 9ef2f655 test-mac-arm
[Finished] 9ef2f655 ursa-i9-9960x
[Finished] 9ef2f655 ursa-thinkcentre-m75q
[Finished] e5405e77 ec2-t3-xlarge-us-east-2
[Finished] e5405e77 test-mac-arm
[Finished] e5405e77 ursa-i9-9960x
[Finished] e5405e77 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot · 2023-05-12T11:55:41Z

['Python', 'R'] benchmarks have high level of regressions.
test-mac-arm

…tion (apache#35505) ### Rationale for this change During the refactor of the join cython code to use the new `pyarrow.acero` code, I accidentally ignored the `coalesce_keys=False` option. This PR restores the previous behaviour (by not passing a custom subset of column names to the HashJoinNode, but relying on its default behaviour to include all fields from left and right data (depending on the join type)). ### Are these changes tested? Expanded the existing tests to now properly cover the coalesce_keys=False option for all join types. ### Are there any user-facing changes? Fixes a regression in 12.0, restoring behaviour of 11.0 * Closes: apache#35389 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

…35505) ### Rationale for this change During the refactor of the join cython code to use the new `pyarrow.acero` code, I accidentally ignored the `coalesce_keys=False` option. This PR restores the previous behaviour (by not passing a custom subset of column names to the HashJoinNode, but relying on its default behaviour to include all fields from left and right data (depending on the join type)). ### Are these changes tested? Expanded the existing tests to now properly cover the coalesce_keys=False option for all join types. ### Are there any user-facing changes? Fixes a regression in 12.0, restoring behaviour of 11.0 * Closes: #35389 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

apacheGH-35389: [Python] Fix coalesce_keys=False option in join opera…

9cd29f2

…tion

jorisvandenbossche requested a review from AlenkaF as a code owner May 9, 2023 08:58

github-actions bot added Component: Python awaiting committer review Awaiting committer review labels May 9, 2023

jorisvandenbossche commented May 9, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels May 9, 2023

jorisvandenbossche mentioned this pull request May 9, 2023

[Python] Table.join no longer respecting coalesce_keys parameter (PyArrow 12) #35389

Closed

westonpace reviewed May 9, 2023

View reviewed changes

python/pyarrow/tests/test_exec_plan.py Show resolved Hide resolved

westonpace approved these changes May 9, 2023

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels May 9, 2023

jorisvandenbossche merged commit 9ef2f65 into apache:main May 11, 2023
15 of 16 checks passed

jorisvandenbossche deleted the gh-35389-regression-coalesce-keys branch May 11, 2023 09:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-35389: [Python] Fix coalesce_keys=False option in join operation #35505

GH-35389: [Python] Fix coalesce_keys=False option in join operation #35505

jorisvandenbossche commented May 9, 2023 •

edited by github-actions bot

github-actions bot commented May 9, 2023

github-actions bot commented May 9, 2023

jorisvandenbossche May 9, 2023

westonpace May 9, 2023

westonpace left a comment

jorisvandenbossche commented May 9, 2023

westonpace left a comment

ursabot commented May 12, 2023

ursabot commented May 12, 2023

GH-35389: [Python] Fix coalesce_keys=False option in join operation #35505

GH-35389: [Python] Fix coalesce_keys=False option in join operation #35505

Conversation

jorisvandenbossche commented May 9, 2023 • edited by github-actions bot

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented May 9, 2023

github-actions bot commented May 9, 2023

jorisvandenbossche May 9, 2023

Choose a reason for hiding this comment

westonpace May 9, 2023

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented May 9, 2023

westonpace left a comment

Choose a reason for hiding this comment

ursabot commented May 12, 2023

ursabot commented May 12, 2023

jorisvandenbossche commented May 9, 2023 •

edited by github-actions bot