-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-35389: [Python] Fix coalesce_keys=False option in join operation #35505
GH-35389: [Python] Fix coalesce_keys=False option in join operation #35505
Conversation
|
decl = Declaration("scan", ScanNodeOptions(dataset, use_threads=use_threads)) | ||
|
||
# Get rid of special dataset columns | ||
# "__fragment_index", "__batch_index", "__last_in_fragment", "__filename" | ||
projections = [field(f) for f in dataset.schema.names] | ||
decl = Declaration.from_sequence( | ||
[decl, Declaration("project", ProjectNodeOptions(projections))] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@westonpace do you know if there is an easier way to do this? (there is not an option to just turn off adding those fields in the scanner to start with?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was on my "new scan node" todo list but sadly that list has been stalled for a while now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason, I thought coalescing a full outer join required an actual call to the coalesce
function (e.g. with a project node). However, I don't see that here. Maybe you just aren't coalescing on a full outer join (which is probably fine)?
There is such an actual coalesce call in the case of a full outer join, but that's not visible in the diff here (since this PR is not fixing that case): Line 196 in c3acc91
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks for the clarification. This looks good then.
Benchmark runs are scheduled for baseline = e5405e7 and contender = 9ef2f65. 9ef2f65 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
['Python', 'R'] benchmarks have high level of regressions. |
…tion (apache#35505) ### Rationale for this change During the refactor of the join cython code to use the new `pyarrow.acero` code, I accidentally ignored the `coalesce_keys=False` option. This PR restores the previous behaviour (by not passing a custom subset of column names to the HashJoinNode, but relying on its default behaviour to include all fields from left and right data (depending on the join type)). ### Are these changes tested? Expanded the existing tests to now properly cover the coalesce_keys=False option for all join types. ### Are there any user-facing changes? Fixes a regression in 12.0, restoring behaviour of 11.0 * Closes: apache#35389 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…tion (apache#35505) ### Rationale for this change During the refactor of the join cython code to use the new `pyarrow.acero` code, I accidentally ignored the `coalesce_keys=False` option. This PR restores the previous behaviour (by not passing a custom subset of column names to the HashJoinNode, but relying on its default behaviour to include all fields from left and right data (depending on the join type)). ### Are these changes tested? Expanded the existing tests to now properly cover the coalesce_keys=False option for all join types. ### Are there any user-facing changes? Fixes a regression in 12.0, restoring behaviour of 11.0 * Closes: apache#35389 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…35505) ### Rationale for this change During the refactor of the join cython code to use the new `pyarrow.acero` code, I accidentally ignored the `coalesce_keys=False` option. This PR restores the previous behaviour (by not passing a custom subset of column names to the HashJoinNode, but relying on its default behaviour to include all fields from left and right data (depending on the join type)). ### Are these changes tested? Expanded the existing tests to now properly cover the coalesce_keys=False option for all join types. ### Are there any user-facing changes? Fixes a regression in 12.0, restoring behaviour of 11.0 * Closes: #35389 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Rationale for this change
During the refactor of the join cython code to use the new
pyarrow.acero
code, I accidentally ignored thecoalesce_keys=False
option.This PR restores the previous behaviour (by not passing a custom subset of column names to the HashJoinNode, but relying on its default behaviour to include all fields from left and right data (depending on the join type)).
Are these changes tested?
Expanded the existing tests to now properly cover the coalesce_keys=False option for all join types.
Are there any user-facing changes?
Fixes a regression in 12.0, restoring behaviour of 11.0
coalesce_keys
parameter (PyArrow 12) #35389