-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-33526][SQL] Add config to control if cancel invoke interrupt task on thriftserver #30481
Conversation
if (forceCancel.get()) { | ||
assert(System.currentTimeMillis() - taskEnd.taskInfo.launchTime < 1000) | ||
} else { | ||
assert(System.currentTimeMillis() - taskEnd.taskInfo.launchTime >= 2900) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid the sleep accuracy so check 2.9s instead of 3s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, how about doing rough comparisons like running time of forceCancel=false
> running time of forceCancel=true
? I'm a bit worried that this test can cause test flakiness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this PR, the comparison running time of forceCancel=false
also can langer than running time of forceCancel=true
due to accuracy. How about check 2s ? I believe it is stable enough.
Test build #131642 has finished for PR 30481 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
"interrupted. When false, all the job of query will be cancelled but running task" + | ||
"will be remained until finished. Note that, this config must be set before query" + | ||
"otherwise it doesn't help.") | ||
.version("3.1.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: 3.1.0
-> 3.2.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, 3.1.0 has been cut.
Adding a new config for it looks fine. |
Kubernetes integration test starting |
Kubernetes integration test status success |
} else { | ||
assert(System.currentTimeMillis() - taskEnd.taskInfo.launchTime >= 2000) | ||
} | ||
case _ => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add assert
here, too? It seems taskEnd.reason
assumes to be always TaskKilled
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
Test build #132336 has finished for PR 30481 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
val e2 = intercept[SQLException] { | ||
statement.execute("select java_method('java.lang.Thread', 'sleep', 3000L)") | ||
}.getMessage | ||
assert(e2.contains("Query timed out")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The codes in L105-111 and L113-118 seems to be almost the same, so could you write it like this?
Seq(true, false).foreach { forceCancel =>
....
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks better !
.doc("When true, all the job of query will be cancelled and running tasks will be" + | ||
"interrupted. When false, all the job of query will be cancelled but running task" + | ||
"will be remained until finished. Note that, this config must be set before query" + | ||
"otherwise it doesn't help.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to describe Note that, this config must be set before query otherwise it doesn't help.
for this config? I think the other SQL configs have the same restriction, too (users need to set a config before running a query).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
Test build #132354 has finished for PR 30481 at commit
|
Looks fine cc: @HyukjinKwon @wangyum @viirya |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #132371 has finished for PR 30481 at commit
|
@@ -937,6 +937,15 @@ object SQLConf { | |||
.timeConf(TimeUnit.SECONDS) | |||
.createWithDefault(0L) | |||
|
|||
val THRIFTSERVER_FORCE_CANCEL = | |||
buildConf("spark.sql.thriftServer.forceCancel") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thriftServer? thriftserver?
spark.sql.thriftserver.scheduler.pool
spark.sql.thriftserver.ui.retainedStatements
spark.sql.thriftserver.ui.retainedSessions
spark.sql.thriftServer.queryTimeout
spark.sql.thriftServer.incrementalCollect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems recent config names thriftServer
. thriftserver
is early version config.
.doc("When true, all the job of query will be cancelled and running tasks will be" + | ||
"interrupted. When false, all the job of query will be cancelled but running task" + | ||
"will be remained until finished.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, is this doc correct? What it means all the job of query will be cancelled when this is true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
@juliuszsompolski This disadvantage seems important. ThrfitServer is a long lived process, I believe it's better to give user a dynamic config during it running rather than restart it. Maybe we can create another topic about the core config. What do you think about it ? @maropu @HyukjinKwon @viirya |
@ulysses-you, is there any case we have to disable this configuration when
|
Logically, it affects 2 code place.
But I'm ok to let it bind with query timeout since it's a more useful case for user. |
Okay, I'm fine having this configuration. We can have another configuration for core side that sets this configuration together but let's leave it for a future work. Does it make sense @juliuszsompolski? |
Another option might be to make it SQL specific configuration since for core they can explicitly enable this config via Scala API but plain SQL does not have a way to do it. To be honest I'm not sure which way is the best. Let me know if you guys have a different thought. |
Thinking more about it more that users of scala or python API can just set it as a parameter to a jobgroup and don't need a conf for that, I think it makes sense to keep it as is as a Thriftserver conf, because setting it via a |
@ulysses-you, can you address the rest of the comments and rebase? Looks we can just go ahead. |
thank you @HyukjinKwon , I have updated the guide doc with 921f9cf. IMO it's better to remind user this config when they set query timeout. Please correct me if missed something. |
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM otherwise
Test build #132629 has finished for PR 30481 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #132631 has finished for PR 30481 at commit
|
Merged to master. |
thanks for the discussion ! |
![ulysses-you](https://badgen.net/badge/Hello/ulysses-you/green) [![Closes #451](https://badgen.net/badge/Preview/Closes%20%23451/blue)](https://github.com/yaooqinn/kyuubi/pull/451) ![200](https://badgen.net/badge/%2B/200/red) ![17](https://badgen.net/badge/-/17/green) ![27](https://badgen.net/badge/commits/27/yellow) ![Target Issue](https://badgen.net/badge/Missing/Target%20Issue/ff0000) ![Test Plan](https://badgen.net/badge/Missing/Test%20Plan/ff0000) ![Feature](https://badgen.net/badge/Label/Feature/) [<img width="16" alt="Powered by Pull Request Badge" src="https://user-images.githubusercontent.com/1393946/111216524-d2bb8e00-85d4-11eb-821b-ed4c00989c02.png">](https://pullrequestbadge.com/?utm_medium=github&utm_source=yaooqinn&utm_campaign=badge_info)<!-- PR-BADGE: PLEASE DO NOT REMOVE THIS COMMENT --> <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://kyuubi.readthedocs.io/en/latest/community/contributions.html 2. If the PR is related to an issue in https://github.com/yaooqinn/kyuubi/issues, add '[KYUUBI #XXXX]' in your PR title, e.g., '[KYUUBI #XXXX] Your PR title ...'. 3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][KYUUBI #XXXX] Your PR title ...'. --> ### _Why are the changes needed?_ <!-- Please clarify why the changes are needed. For instance, 1. If you add a feature, you can talk about the use case of it. 2. If you fix a bug, you can clarify why it is a bug. --> Manual cherry-pick some Spark patch into Kyuubi. 1. [Support query auto timeout cancel on thriftserver](apache/spark#29933) 2. [Add config to control if cancel invoke interrupt task on thriftserver](apache/spark#30481) In order to keep backward with early Spark version, we hard code the config key instead of refer to Spark SQLConf. Note that, the exists timeout of operator (`kyuubi.operation.idle.timeout`) is to cancel that client has no access with engine. That said if a query run a long time and the client is alive, the query would not be cancelled. Then the new added config `spark.sql.thriftServer.queryTimeout` can handle this case. ### _How was this patch tested?_ Add new test. Closes #451 from ulysses-you/query-timeout. 212f579 [ulysses-you] docs 9206538 [ulysses-you] empty flaky test ddab9bf [ulysses-you] flaty test 1da02a0 [ulysses-you] flaty test edfadf1 [ulysses-you] nit 3f9920b [ulysses-you] address comment 9492c48 [ulysses-you] correct timeout 5df997e [ulysses-you] nit 2124952 [ulysses-you] address comment 192fdcc [ulysses-you] fix tets d684af6 [ulysses-you] global config 1d1adda [ulysses-you] empty 967a63e [ulysses-you] correct import 128948e [ulysses-you] add session conf in session 144d51b [ulysses-you] fix a90248b [ulysses-you] unused import c90386f [ulysses-you] timeout move to operation manager d780965 [ulysses-you] update docs a5f7138 [ulysses-you] fix test f7c7308 [ulysses-you] config name 7f3fb3d [ulysses-you] change conf place 97a011e [ulysses-you] unnecessary change 0953a76 [ulysses-you] move test 38ac0c0 [ulysses-you] Merge branch 'master' of https://github.com/yaooqinn/kyuubi into query-timeout 71bea97 [ulysses-you] refector implementation 35ef6f9 [ulysses-you] update conf 0cad8e2 [ulysses-you] Support query auto timeout cancel on thriftserver Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Kent Yao <yao@apache.org>
![ulysses-you](https://badgen.net/badge/Hello/ulysses-you/green) [![Closes #451](https://badgen.net/badge/Preview/Closes%20%23451/blue)](https://github.com/yaooqinn/kyuubi/pull/451) ![200](https://badgen.net/badge/%2B/200/red) ![17](https://badgen.net/badge/-/17/green) ![27](https://badgen.net/badge/commits/27/yellow) ![Target Issue](https://badgen.net/badge/Missing/Target%20Issue/ff0000) ![Test Plan](https://badgen.net/badge/Missing/Test%20Plan/ff0000) ![Feature](https://badgen.net/badge/Label/Feature/) [<img width="16" alt="Powered by Pull Request Badge" src="https://user-images.githubusercontent.com/1393946/111216524-d2bb8e00-85d4-11eb-821b-ed4c00989c02.png">](https://pullrequestbadge.com/?utm_medium=github&utm_source=yaooqinn&utm_campaign=badge_info)<!-- PR-BADGE: PLEASE DO NOT REMOVE THIS COMMENT --> <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://kyuubi.readthedocs.io/en/latest/community/contributions.html 2. If the PR is related to an issue in https://github.com/yaooqinn/kyuubi/issues, add '[KYUUBI #XXXX]' in your PR title, e.g., '[KYUUBI #XXXX] Your PR title ...'. 3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][KYUUBI #XXXX] Your PR title ...'. --> ### _Why are the changes needed?_ <!-- Please clarify why the changes are needed. For instance, 1. If you add a feature, you can talk about the use case of it. 2. If you fix a bug, you can clarify why it is a bug. --> Manual cherry-pick some Spark patch into Kyuubi. 1. [Support query auto timeout cancel on thriftserver](apache/spark#29933) 2. [Add config to control if cancel invoke interrupt task on thriftserver](apache/spark#30481) In order to keep backward with early Spark version, we hard code the config key instead of refer to Spark SQLConf. Note that, the exists timeout of operator (`kyuubi.operation.idle.timeout`) is to cancel that client has no access with engine. That said if a query run a long time and the client is alive, the query would not be cancelled. Then the new added config `spark.sql.thriftServer.queryTimeout` can handle this case. ### _How was this patch tested?_ Add new test. Closes #451 from ulysses-you/query-timeout. 212f579 [ulysses-you] docs 9206538 [ulysses-you] empty flaky test ddab9bf [ulysses-you] flaty test 1da02a0 [ulysses-you] flaty test edfadf1 [ulysses-you] nit 3f9920b [ulysses-you] address comment 9492c48 [ulysses-you] correct timeout 5df997e [ulysses-you] nit 2124952 [ulysses-you] address comment 192fdcc [ulysses-you] fix tets d684af6 [ulysses-you] global config 1d1adda [ulysses-you] empty 967a63e [ulysses-you] correct import 128948e [ulysses-you] add session conf in session 144d51b [ulysses-you] fix a90248b [ulysses-you] unused import c90386f [ulysses-you] timeout move to operation manager d780965 [ulysses-you] update docs a5f7138 [ulysses-you] fix test f7c7308 [ulysses-you] config name 7f3fb3d [ulysses-you] change conf place 97a011e [ulysses-you] unnecessary change 0953a76 [ulysses-you] move test 38ac0c0 [ulysses-you] Merge branch 'master' of https://github.com/yaooqinn/kyuubi into query-timeout 71bea97 [ulysses-you] refector implementation 35ef6f9 [ulysses-you] update conf 0cad8e2 [ulysses-you] Support query auto timeout cancel on thriftserver Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit fecdba3) Signed-off-by: Kent Yao <yao@apache.org>
"When false, all running tasks will remain until finished.") | ||
.version("3.2.0") | ||
.booleanConf | ||
.createWithDefault(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L134 and https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L270 unconditionally set interruptOnCancel=true for broadcasts and for streaming queries. https://issues.apache.org/jira/browse/HDFS-1208 remains open, but it seems there's not been issues resulting from using true in the other places.
Given that, do we need to bother protecting it with a config for Thriftserver, or at least this config could be flipped to be enabled by default? Having it default to false limits the default cancellation behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine to enable it by default. In fact, we have enabled it internally and have not seen any issue. @HyukjinKwon what do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah,I am fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created #41047
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems SPARK-34064 is needed to cancel broadcast jobs
…by default ### What changes were proposed in this pull request? This pr enables `spark.sql.thriftServer.interruptOnCancel` by default ### Why are the changes needed? Address the comment #30481 (comment) ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? Pass CI Closes #41047 from ulysses-you/33526-F. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…by default ### What changes were proposed in this pull request? This pr enables `spark.sql.thriftServer.interruptOnCancel` by default ### Why are the changes needed? Address the comment apache#30481 (comment) ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? Pass CI Closes apache#41047 from ulysses-you/33526-F. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
This PR add a new config
spark.sql.thriftServer.forceCancel
to give user a way to interrupt task when cancel statement.Why are the changes needed?
After #29933, we support cancel query if timeout, but the default behavior of
SparkContext.cancelJobGroups
won't interrupt task and just let task finish by itself. In some case it's dangerous, e.g., data skew or exists a heavily shuffle. A task will hold in a long time after do cancel and the resource will not release.Does this PR introduce any user-facing change?
Yes, a new config.
How was this patch tested?
Add test.