Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-43952][SQL][FOLLOWUP] Correct AQE cancel broadcast job tag #41979

Closed
wants to merge 2 commits into from

Conversation

ulysses-you
Copy link
Contributor

@ulysses-you ulysses-you commented Jul 13, 2023

What changes were proposed in this pull request?

This pr changes cancelJobGroup to cancelJobsWithTag in AQE, so that broadcast exchange can be cancelled correctly.

Since we do not set job id when executing broadcast job and use job tag to cancel it, this pr adds jobTag to BroadcastExchangeLike.

Why are the changes needed?

fix regression

Does this PR introduce any user-facing change?

no, not released yet

How was this patch tested?

test manully

select * from t1 
join (select c1, java_method('java.lang.Thread', 'sleep', 5000l) from t2)t2 on t1.c1 = t2.c1
join (select c1, raise_error('force_fail') from t3)t3 on t1.c1 = t3.c1

before:
image

after:
image

@github-actions github-actions bot added the SQL label Jul 13, 2023
@ulysses-you
Copy link
Contributor Author

@ulysses-you ulysses-you force-pushed the jobtag-followup branch 2 times, most recently from ee9c1c6 to afda62c Compare July 13, 2023 04:11
*/
def runId: UUID = UUID.randomUUID
@deprecated("Use jobTag", "3.5.0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this annotation necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is that, the semantics of runId lightly changed. Before it represents the job id, and now it is the unique id in job tag. But I'm fine to remove it if you think it is unnecessary.

Comment on lines 50 to 51
@transient
val runId: UUID = UUID.randomUUID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before transient was on jobTag, which is deterministically derived from runId. Now as the runId is transient, will it not cause a new UUID.randomUUID to be regenerated if this classes is serialized and deserialized?
I don't have full context of the implications so defer to @HyukjinKwon and @cloud-fan .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting.. Let me remove it for safe.

@@ -250,7 +250,7 @@ case class BroadcastQueryStageExec(

override def cancel(): Unit = {
if (!broadcast.relationFuture.isDone) {
sparkContext.cancelJobGroup(broadcast.runId.toString)
sparkContext.cancelJobsWithTag(broadcast.jobTag)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm,
currently BroadcastExchangeExec.doBroacast only does cancelJobsWithTag when it catches a TimeoutException.
Should we cancel there for broader exceptions (even any Throwable), and then broadcast.relationFuture.cancel(true) here should throw there and cancel there, and this place does not have to cancel separately?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When AQE materialize broadcast relation, it does not call BroadcastExchangeExec.doBroacast, so it has no affect to cache exception in BroadcastExchangeExec.doBroacast.

@ulysses-you
Copy link
Contributor Author

cc @yaooqinn @juliuszsompolski @HyukjinKwon @cloud-fan any more comments? I think this should go to branch-3.5

@HyukjinKwon
Copy link
Member

@ulysses-you wanna try merging this PR?

@ulysses-you
Copy link
Contributor Author

@HyukjinKwon sure, let me try to merge it carefully

ulysses-you added a commit that referenced this pull request Jul 21, 2023
### What changes were proposed in this pull request?

This pr changes `cancelJobGroup` to `cancelJobsWithTag ` in AQE, so that broadcast exchange can be cancelled correctly.

Since we do not set job id when executing broadcast job and use job tag to cancel it, this pr adds `jobTag` to `BroadcastExchangeLike`.

### Why are the changes needed?

fix regression

### Does this PR introduce _any_ user-facing change?

no, not released yet

### How was this patch tested?

test manully

```sql
select * from t1
join (select c1, java_method('java.lang.Thread', 'sleep', 5000l) from t2)t2 on t1.c1 = t2.c1
join (select c1, raise_error('force_fail') from t3)t3 on t1.c1 = t3.c1
```

before:
<img width="1194" alt="image" src="https://github.com/apache/spark/assets/12025282/55d218da-7289-404a-b201-1ea9f4902026">

after:
<img width="1202" alt="image" src="https://github.com/apache/spark/assets/12025282/9b293d1f-01d6-43e2-9c1a-20540f58c3e5">

Closes #41979 from ulysses-you/jobtag-followup.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Xiduo You <ulyssesyou@apache.org>
(cherry picked from commit 99f9df5)
Signed-off-by: Xiduo You <ulyssesyou@apache.org>
@ulysses-you
Copy link
Contributor Author

thank you all, merged to master/branch-3.5

@ulysses-you ulysses-you deleted the jobtag-followup branch July 21, 2023 02:43
ragnarok56 pushed a commit to ragnarok56/spark that referenced this pull request Mar 2, 2024
### What changes were proposed in this pull request?

This pr changes `cancelJobGroup` to `cancelJobsWithTag ` in AQE, so that broadcast exchange can be cancelled correctly.

Since we do not set job id when executing broadcast job and use job tag to cancel it, this pr adds `jobTag` to `BroadcastExchangeLike`.

### Why are the changes needed?

fix regression

### Does this PR introduce _any_ user-facing change?

no, not released yet

### How was this patch tested?

test manully

```sql
select * from t1
join (select c1, java_method('java.lang.Thread', 'sleep', 5000l) from t2)t2 on t1.c1 = t2.c1
join (select c1, raise_error('force_fail') from t3)t3 on t1.c1 = t3.c1
```

before:
<img width="1194" alt="image" src="https://github.com/apache/spark/assets/12025282/55d218da-7289-404a-b201-1ea9f4902026">

after:
<img width="1202" alt="image" src="https://github.com/apache/spark/assets/12025282/9b293d1f-01d6-43e2-9c1a-20540f58c3e5">

Closes apache#41979 from ulysses-you/jobtag-followup.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Xiduo You <ulyssesyou@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants