Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-33526][SQL] Add config to control if cancel invoke interrupt task on thriftserver #30481

Closed
wants to merge 12 commits into from

Conversation

ulysses-you
Copy link
Contributor

What changes were proposed in this pull request?

This PR add a new config spark.sql.thriftServer.forceCancel to give user a way to interrupt task when cancel statement.

Why are the changes needed?

After #29933, we support cancel query if timeout, but the default behavior of SparkContext.cancelJobGroups won't interrupt task and just let task finish by itself. In some case it's dangerous, e.g., data skew or exists a heavily shuffle. A task will hold in a long time after do cancel and the resource will not release.

Does this PR introduce any user-facing change?

Yes, a new config.

How was this patch tested?

Add test.

if (forceCancel.get()) {
assert(System.currentTimeMillis() - taskEnd.taskInfo.launchTime < 1000)
} else {
assert(System.currentTimeMillis() - taskEnd.taskInfo.launchTime >= 2900)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid the sleep accuracy so check 2.9s instead of 3s.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, how about doing rough comparisons like running time of forceCancel=false > running time of forceCancel=true? I'm a bit worried that this test can cause test flakiness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this PR, the comparison running time of forceCancel=false also can langer than running time of forceCancel=true due to accuracy. How about check 2s ? I believe it is stable enough.

@github-actions github-actions bot added the SQL label Nov 24, 2020
@SparkQA
Copy link

SparkQA commented Nov 24, 2020

Test build #131642 has finished for PR 30481 at commit ba382a3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ulysses-you
Copy link
Contributor Author

@SparkQA
Copy link

SparkQA commented Dec 3, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36697/

@SparkQA
Copy link

SparkQA commented Dec 3, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36697/

"interrupted. When false, all the job of query will be cancelled but running task" +
"will be remained until finished. Note that, this config must be set before query" +
"otherwise it doesn't help.")
.version("3.1.0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 3.1.0 -> 3.2.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, 3.1.0 has been cut.

@maropu
Copy link
Member

maropu commented Dec 7, 2020

Adding a new config for it looks fine.

@SparkQA
Copy link

SparkQA commented Dec 7, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36936/

@SparkQA
Copy link

SparkQA commented Dec 7, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36936/

} else {
assert(System.currentTimeMillis() - taskEnd.taskInfo.launchTime >= 2000)
}
case _ =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add assert here, too? It seems taskEnd.reason assumes to be always TaskKilled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@SparkQA
Copy link

SparkQA commented Dec 7, 2020

Test build #132336 has finished for PR 30481 at commit d1ebbba.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 7, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36954/

@SparkQA
Copy link

SparkQA commented Dec 7, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36954/

val e2 = intercept[SQLException] {
statement.execute("select java_method('java.lang.Thread', 'sleep', 3000L)")
}.getMessage
assert(e2.contains("Query timed out"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The codes in L105-111 and L113-118 seems to be almost the same, so could you write it like this?

Seq(true, false).foreach { forceCancel =>
  ....
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks better !

.doc("When true, all the job of query will be cancelled and running tasks will be" +
"interrupted. When false, all the job of query will be cancelled but running task" +
"will be remained until finished. Note that, this config must be set before query" +
"otherwise it doesn't help.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to describe Note that, this config must be set before query otherwise it doesn't help. for this config? I think the other SQL configs have the same restriction, too (users need to set a config before running a query).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@SparkQA
Copy link

SparkQA commented Dec 7, 2020

Test build #132354 has finished for PR 30481 at commit de89232.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Dec 7, 2020

Looks fine cc: @HyukjinKwon @wangyum @viirya

@SparkQA
Copy link

SparkQA commented Dec 7, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36971/

@SparkQA
Copy link

SparkQA commented Dec 7, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36971/

@SparkQA
Copy link

SparkQA commented Dec 7, 2020

Test build #132371 has finished for PR 30481 at commit f8d6384.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -937,6 +937,15 @@ object SQLConf {
.timeConf(TimeUnit.SECONDS)
.createWithDefault(0L)

val THRIFTSERVER_FORCE_CANCEL =
buildConf("spark.sql.thriftServer.forceCancel")
Copy link
Member

@viirya viirya Dec 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thriftServer? thriftserver?

spark.sql.thriftserver.scheduler.pool
spark.sql.thriftserver.ui.retainedStatements
spark.sql.thriftserver.ui.retainedSessions
spark.sql.thriftServer.queryTimeout
spark.sql.thriftServer.incrementalCollect

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems recent config names thriftServer. thriftserver is early version config.

Comment on lines 942 to 944
.doc("When true, all the job of query will be cancelled and running tasks will be" +
"interrupted. When false, all the job of query will be cancelled but running task" +
"will be remained until finished.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, is this doc correct? What it means all the job of query will be cancelled when this is true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@ulysses-you
Copy link
Contributor Author

@juliuszsompolski This disadvantage seems important. ThrfitServer is a long lived process, I believe it's better to give user a dynamic config during it running rather than restart it. Maybe we can create another topic about the core config. What do you think about it ? @maropu @HyukjinKwon @viirya

@HyukjinKwon
Copy link
Member

@ulysses-you, is there any case we have to disable this configuration when spark.sql.thriftServer.queryTimeout is set? If no, let's remove the configuration in this PR, and set interruptOnCancel to true when spark.sql.thriftServer.queryTimeout is set.

spark.sql.thriftServer.queryTimeout is not out yet so I think it's fine to add this behaviour.

@ulysses-you
Copy link
Contributor Author

ulysses-you commented Dec 10, 2020

Logically, it affects 2 code place.

  1. cancel by query timeout.
  2. cancel by statement self.

But I'm ok to let it bind with query timeout since it's a more useful case for user.

@HyukjinKwon
Copy link
Member

Okay, I'm fine having this configuration. We can have another configuration for core side that sets this configuration together but let's leave it for a future work. Does it make sense @juliuszsompolski?

@HyukjinKwon
Copy link
Member

Another option might be to make it SQL specific configuration since for core they can explicitly enable this config via Scala API but plain SQL does not have a way to do it.

To be honest I'm not sure which way is the best. Let me know if you guys have a different thought.

@juliuszsompolski
Copy link
Contributor

Thinking more about it more that users of scala or python API can just set it as a parameter to a jobgroup and don't need a conf for that, I think it makes sense to keep it as is as a Thriftserver conf, because setting it via a SET is mainly useful to JDBC/ODBC SQL-only users.

@HyukjinKwon
Copy link
Member

@ulysses-you, can you address the rest of the comments and rebase? Looks we can just go ahead.

@ulysses-you
Copy link
Contributor Author

thank you @HyukjinKwon , I have updated the guide doc with 921f9cf. IMO it's better to remind user this config when they set query timeout.

Please correct me if missed something.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

@SparkQA
Copy link

SparkQA commented Dec 11, 2020

Test build #132629 has finished for PR 30481 at commit 99ec864.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 11, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37233/

@SparkQA
Copy link

SparkQA commented Dec 11, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37233/

@SparkQA
Copy link

SparkQA commented Dec 11, 2020

Test build #132631 has finished for PR 30481 at commit 0466c0c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@ulysses-you
Copy link
Contributor Author

thanks for the discussion !

@ulysses-you ulysses-you deleted the SPARK-33526 branch March 4, 2021 04:06
yaooqinn pushed a commit to apache/kyuubi that referenced this pull request Mar 26, 2021
![ulysses-you](https://badgen.net/badge/Hello/ulysses-you/green) [![Closes #451](https://badgen.net/badge/Preview/Closes%20%23451/blue)](https://github.com/yaooqinn/kyuubi/pull/451) ![200](https://badgen.net/badge/%2B/200/red) ![17](https://badgen.net/badge/-/17/green) ![27](https://badgen.net/badge/commits/27/yellow) ![Target Issue](https://badgen.net/badge/Missing/Target%20Issue/ff0000) ![Test Plan](https://badgen.net/badge/Missing/Test%20Plan/ff0000) ![Feature](https://badgen.net/badge/Label/Feature/) [<img width="16" alt="Powered by Pull Request Badge" src="https://user-images.githubusercontent.com/1393946/111216524-d2bb8e00-85d4-11eb-821b-ed4c00989c02.png">](https://pullrequestbadge.com/?utm_medium=github&utm_source=yaooqinn&utm_campaign=badge_info)<!-- PR-BADGE: PLEASE DO NOT REMOVE THIS COMMENT -->

<!--
Thanks for sending a pull request!

Here are some tips for you:
  1. If this is your first time, please read our contributor guidelines: https://kyuubi.readthedocs.io/en/latest/community/contributions.html
  2. If the PR is related to an issue in https://github.com/yaooqinn/kyuubi/issues, add '[KYUUBI #XXXX]' in your PR title, e.g., '[KYUUBI #XXXX] Your PR title ...'.
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][KYUUBI #XXXX] Your PR title ...'.
-->

### _Why are the changes needed?_
<!--
Please clarify why the changes are needed. For instance,
  1. If you add a feature, you can talk about the use case of it.
  2. If you fix a bug, you can clarify why it is a bug.
-->
Manual cherry-pick some Spark patch into Kyuubi.
1. [Support query auto timeout cancel on thriftserver](apache/spark#29933)
2. [Add config to control if cancel invoke interrupt task on thriftserver](apache/spark#30481)

In order to keep backward with early Spark version, we hard code the config key instead of refer to Spark SQLConf.

Note that, the exists timeout of operator (`kyuubi.operation.idle.timeout`) is to cancel that client has no access with engine. That said if a query run a long time and the client is alive, the query would not be cancelled. Then the new added config `spark.sql.thriftServer.queryTimeout` can handle this case.

### _How was this patch tested?_
Add new test.

Closes #451 from ulysses-you/query-timeout.

212f579 [ulysses-you] docs
9206538 [ulysses-you] empty flaky test
ddab9bf [ulysses-you] flaty test
1da02a0 [ulysses-you] flaty test
edfadf1 [ulysses-you] nit
3f9920b [ulysses-you] address comment
9492c48 [ulysses-you] correct timeout
5df997e [ulysses-you] nit
2124952 [ulysses-you] address comment
192fdcc [ulysses-you] fix tets
d684af6 [ulysses-you] global config
1d1adda [ulysses-you] empty
967a63e [ulysses-you] correct import
128948e [ulysses-you] add session conf in session
144d51b [ulysses-you] fix
a90248b [ulysses-you] unused import
c90386f [ulysses-you] timeout move to operation manager
d780965 [ulysses-you] update docs
a5f7138 [ulysses-you] fix test
f7c7308 [ulysses-you] config name
7f3fb3d [ulysses-you] change conf place
97a011e [ulysses-you] unnecessary change
0953a76 [ulysses-you] move test
38ac0c0 [ulysses-you] Merge branch 'master' of https://github.com/yaooqinn/kyuubi into query-timeout
71bea97 [ulysses-you] refector implementation
35ef6f9 [ulysses-you] update conf
0cad8e2 [ulysses-you] Support query auto timeout cancel on thriftserver

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Kent Yao <yao@apache.org>
yaooqinn pushed a commit to apache/kyuubi that referenced this pull request Mar 26, 2021
![ulysses-you](https://badgen.net/badge/Hello/ulysses-you/green) [![Closes #451](https://badgen.net/badge/Preview/Closes%20%23451/blue)](https://github.com/yaooqinn/kyuubi/pull/451) ![200](https://badgen.net/badge/%2B/200/red) ![17](https://badgen.net/badge/-/17/green) ![27](https://badgen.net/badge/commits/27/yellow) ![Target Issue](https://badgen.net/badge/Missing/Target%20Issue/ff0000) ![Test Plan](https://badgen.net/badge/Missing/Test%20Plan/ff0000) ![Feature](https://badgen.net/badge/Label/Feature/) [<img width="16" alt="Powered by Pull Request Badge" src="https://user-images.githubusercontent.com/1393946/111216524-d2bb8e00-85d4-11eb-821b-ed4c00989c02.png">](https://pullrequestbadge.com/?utm_medium=github&utm_source=yaooqinn&utm_campaign=badge_info)<!-- PR-BADGE: PLEASE DO NOT REMOVE THIS COMMENT -->

<!--
Thanks for sending a pull request!

Here are some tips for you:
  1. If this is your first time, please read our contributor guidelines: https://kyuubi.readthedocs.io/en/latest/community/contributions.html
  2. If the PR is related to an issue in https://github.com/yaooqinn/kyuubi/issues, add '[KYUUBI #XXXX]' in your PR title, e.g., '[KYUUBI #XXXX] Your PR title ...'.
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][KYUUBI #XXXX] Your PR title ...'.
-->

### _Why are the changes needed?_
<!--
Please clarify why the changes are needed. For instance,
  1. If you add a feature, you can talk about the use case of it.
  2. If you fix a bug, you can clarify why it is a bug.
-->
Manual cherry-pick some Spark patch into Kyuubi.
1. [Support query auto timeout cancel on thriftserver](apache/spark#29933)
2. [Add config to control if cancel invoke interrupt task on thriftserver](apache/spark#30481)

In order to keep backward with early Spark version, we hard code the config key instead of refer to Spark SQLConf.

Note that, the exists timeout of operator (`kyuubi.operation.idle.timeout`) is to cancel that client has no access with engine. That said if a query run a long time and the client is alive, the query would not be cancelled. Then the new added config `spark.sql.thriftServer.queryTimeout` can handle this case.

### _How was this patch tested?_
Add new test.

Closes #451 from ulysses-you/query-timeout.

212f579 [ulysses-you] docs
9206538 [ulysses-you] empty flaky test
ddab9bf [ulysses-you] flaty test
1da02a0 [ulysses-you] flaty test
edfadf1 [ulysses-you] nit
3f9920b [ulysses-you] address comment
9492c48 [ulysses-you] correct timeout
5df997e [ulysses-you] nit
2124952 [ulysses-you] address comment
192fdcc [ulysses-you] fix tets
d684af6 [ulysses-you] global config
1d1adda [ulysses-you] empty
967a63e [ulysses-you] correct import
128948e [ulysses-you] add session conf in session
144d51b [ulysses-you] fix
a90248b [ulysses-you] unused import
c90386f [ulysses-you] timeout move to operation manager
d780965 [ulysses-you] update docs
a5f7138 [ulysses-you] fix test
f7c7308 [ulysses-you] config name
7f3fb3d [ulysses-you] change conf place
97a011e [ulysses-you] unnecessary change
0953a76 [ulysses-you] move test
38ac0c0 [ulysses-you] Merge branch 'master' of https://github.com/yaooqinn/kyuubi into query-timeout
71bea97 [ulysses-you] refector implementation
35ef6f9 [ulysses-you] update conf
0cad8e2 [ulysses-you] Support query auto timeout cancel on thriftserver

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit fecdba3)
Signed-off-by: Kent Yao <yao@apache.org>
"When false, all running tasks will remain until finished.")
.version("3.2.0")
.booleanConf
.createWithDefault(false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L134 and https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L270 unconditionally set interruptOnCancel=true for broadcasts and for streaming queries. https://issues.apache.org/jira/browse/HDFS-1208 remains open, but it seems there's not been issues resulting from using true in the other places.

Given that, do we need to bother protecting it with a config for Thriftserver, or at least this config could be flipped to be enabled by default? Having it default to false limits the default cancellation behaviour.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine to enable it by default. In fact, we have enabled it internally and have not seen any issue. @HyukjinKwon what do you think ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah,I am fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created #41047

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems SPARK-34064 is needed to cancel broadcast jobs

dongjoon-hyun pushed a commit that referenced this pull request May 5, 2023
…by default

### What changes were proposed in this pull request?

This pr enables `spark.sql.thriftServer.interruptOnCancel` by default

### Why are the changes needed?

Address the comment #30481 (comment)

### Does this PR introduce _any_ user-facing change?

yes

### How was this patch tested?

Pass CI

Closes #41047 from ulysses-you/33526-F.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
LuciferYang pushed a commit to LuciferYang/spark that referenced this pull request May 10, 2023
…by default

### What changes were proposed in this pull request?

This pr enables `spark.sql.thriftServer.interruptOnCancel` by default

### Why are the changes needed?

Address the comment apache#30481 (comment)

### Does this PR introduce _any_ user-facing change?

yes

### How was this patch tested?

Pass CI

Closes apache#41047 from ulysses-you/33526-F.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
8 participants