Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-44325][SQL] Use PartitionEvaluator API in SortMergeJoinExec #41884

Conversation

vinodkc
Copy link
Contributor

@vinodkc vinodkc commented Jul 6, 2023

What changes were proposed in this pull request?

SQL operator SortMergeJoinExec updated to use the PartitionEvaluator API to do execution.

Why are the changes needed?

To avoid the use of lambda during distributed execution.
Ref: SPARK-43061 for more details.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Updated 1 test case, once all the SQL operators are migrated, the flag spark.sql.execution.useTaskEvaluator will be enabled by default to avoid running the tests with and without this TaskEvaluator

@github-actions github-actions bot added the SQL label Jul 6, 2023
@vinodkc vinodkc changed the title [SPARK-44325][SQL]Use PartitionEvaluator API for SortMergeJoinExec [SPARK-44325][SQL] Use PartitionEvaluator API for SortMergeJoinExec Jul 7, 2023
@vinodkc vinodkc changed the title [SPARK-44325][SQL] Use PartitionEvaluator API for SortMergeJoinExec [SPARK-44325][SQL] Use PartitionEvaluator API in SortMergeJoinExec Jul 7, 2023
@vinodkc
Copy link
Contributor Author

vinodkc commented Jul 7, 2023

@vinodkc vinodkc force-pushed the br_refactorSortMergeJoinEvaluatorFactory branch from 406cb4b to 2fad7f4 Compare July 8, 2023 05:31
@vinodkc vinodkc force-pushed the br_refactorSortMergeJoinEvaluatorFactory branch 3 times, most recently from 9ca4b50 to 8717928 Compare July 8, 2023 21:21
Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks okay but I'm just wondering is there the closure issue in SortMergeJoinExec so we need to change it to PartitionEvaluator? Seems I don't find it.

@vinodkc vinodkc force-pushed the br_refactorSortMergeJoinEvaluatorFactory branch from 8717928 to 80d960d Compare July 9, 2023 16:02
@vinodkc vinodkc force-pushed the br_refactorSortMergeJoinEvaluatorFactory branch from 80d960d to 79937b5 Compare July 10, 2023 15:20
@cloud-fan
Copy link
Contributor

@viirya I think the benefit is that, we make it more clear what gets serialized and sent to the executor side.

@beliefer
Copy link
Contributor

LGTM+1

@yaooqinn yaooqinn closed this in ce359bc Jul 12, 2023
@yaooqinn
Copy link
Member

thanks @vinodkc and @cloud-fan @viirya @beliefer. merged to master

} else {
left.execute().zipPartitions(right.execute()) { (leftIter, rightIter) =>
val evaluator = evaluatorFactory.createEvaluator()
evaluator.eval(0, leftIter, rightIter)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zipPartitionsWithIndex method is currently absent, hence 0 index is passed to evaluator.eval(0, ...)

Once spark.sql.execution.useTaskEvaluator is set to true by default, this block will not be executed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use TaskContext.getPartitionId?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, it's different from index. Maybe we should just leave a note here about why we always use 0.

ragnarok56 pushed a commit to ragnarok56/spark that referenced this pull request Mar 2, 2024
### What changes were proposed in this pull request?

SQL operator `SortMergeJoinExec` updated to use the `PartitionEvaluator` API to do execution.

### Why are the changes needed?

To avoid the use of lambda during distributed execution.
Ref: SPARK-43061 for more details.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Updated 1 test case, once all the SQL operators are migrated, the flag `spark.sql.execution.useTaskEvaluator` will be enabled by default to avoid running the tests with and without this TaskEvaluator

Closes apache#41884 from vinodkc/br_refactorSortMergeJoinEvaluatorFactory.

Authored-by: Vinod KC <vinod.kc.in@gmail.com>
Signed-off-by: Kent Yao <yao@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants