-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-50806][SQL] Support InputRDDCodegen interruption on task cancellation #49501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1000 looks like a magic number. So, we don't need to interrupt when the number of rows is less than 1000 because it will finish quickly in any way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, it's by intuition. And yes we expect 1000 rows to finish quickly soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment that this test case assumption: this should be 100000(>1k) because we check it per 1k rows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
|
@Ngone51 the new test failed |
|
@cloud-fan is there a way to configure the parallelize for |
|
Do you mean
|
|
Gentle ping, @Ngone51 ~ |
@dongjoon-hyun Thanks, this is what I'm looking for. Let me try it first. (The test generates 2 tasks locally but 3 in GithubAction. So I'd liket explictly set the partitions to fix the stability of the test.) |
|
Maybe, could you rebase this PR to |
|
@dongjoon-hyun Thanks, will updated. But this PR's test failure still exits. I'll take a closer look. 2025-01-23T05:14:27.8423862Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m- SPARK-50806: HashAggregate codegen should be interrupted on task cancellation *** FAILED *** (1 minute, 5 seconds)�[0m�[0m
2025-01-23T05:14:27.8426704Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m The code passed to eventually never returned normally. Attempted 3956 times over 1.0001835922166666 minutes. Last failure message: 3 did not equal 2. (HashAggregateCodegenInterruptionSuite.scala:91)�[0m�[0m |
|
Oh. Got it. |
|
The failure is not related to However, this is the local Spark logs, where the job The logs looks almost the same. It's confused why the job can't be found in Github Action's case. |
|
@cloud-fan @dongjoon-hyun I have updated the PR. Could you take another look? Thanks! |
| .doc("Whether to interrupt InputCodegenRDD on the task cancellation. Interrupt " + | ||
| "InputCodegenRDD would be useful to stop task and release resource quickly, " + | ||
| "and also reduce the task killing timeout failures by task reaper.") | ||
| .version("3.5.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update this value, @Ngone51 ? Maybe, 4.1.0 because this is an Improvement?
| .createWithDefault(1024) | ||
|
|
||
| val INPUT_CODEGEN_RDD_INTERRUPT_ON_CANCEL = | ||
| buildConf("spark.sql.codegen.inputCodegenRDD.interruptOnCancel") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a plan to introduce more configurations under spark.sql.codegen.inputCodegenRDD.* namespace? If not, maybe, we had better avoid introducing this new namespace, spark.sql.codegen.inputCodegenRDD.*.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 with two comments about (1) config version and (2) name. Thank you, @Ngone51 .
|
@dongjoon-hyun Thanks, addressed the comment. |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
This PR adds a new mutable state
taskInterruptedinInputRDDCodegen. The state indicates whether the task thread is interrupted or not.InputRDDCodegenwould check the state each time when it produces a record and stops immediately iftaskInterrupted=true.taskInterruptedwould be updated byTaskContext.isInterrupted()per 1000 output rows processed.Why are the changes needed?
This facilitates the resource release when there is a task killing request.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added unit test.
Was this patch authored or co-authored using generative AI tooling?
No.