Skip to content

Commit e8445b7

Browse files
committed
[KYUUBI #2102] Support to retry the internal thrift request call and add engine liveness probe to enable fast fail before retry
### _Why are the changes needed?_ To close #2102 Support to retry all the internal thrift request calls(except RenewDelegationToken now), and fast fail if the remote engine is not stable or not alive. In this PR, it supports engine liveness probe. If it is enabled, a companion thrift client will be created and open a liveness probe session when opening remote engine session. It will send some simple thrift request(GetInfo) to check whether the remote engine is alive, and fast fail before retry if remote engine is not connectable. #### Why not use the same thrift client to check engine liveness before retry? I tried that, but met `out of resp sequence` error. For example: 1. send getOperationStatus request 2. read time out 3. send GetInfoType request 4. receive getOperationStatus response (out of resp sequence) ### _How was this patch tested?_ - [x] Add some test cases that check the changes thoroughly including negative and positive cases if possible - [ ] Add screenshots for manual tests if appropriate - [x] [Run test](https://kyuubi.apache.org/docs/latest/develop_tools/testing.html#running-tests) locally before make a pull request Closes #2122 from turboFei/retry_rpc. Closes #2102 3926ba0 [Fei Wang] adress comments ade4ede [Fei Wang] add timeout 1b7a64f [Fei Wang] Only check remote engine alive before retry 98e03f8 [Fei Wang] refactor fac388c [Fei Wang] remove unused import 9c6d873 [Fei Wang] add ut 9b59565 [Fei Wang] Support to retry the thrift request and engine alive probe Authored-by: Fei Wang <fwang12@ebay.com> Signed-off-by: Fei Wang <fwang12@ebay.com>
1 parent d0c92ca commit e8445b7

File tree

5 files changed

+213
-59
lines changed

5 files changed

+213
-59
lines changed

docs/deployment/settings.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -301,8 +301,9 @@ Key | Default | Meaning | Type | Since
301301
<code>kyuubi.operation.plan.only.mode</code>|<div style='width: 65pt;word-wrap: break-word;white-space: normal'>NONE</div>|<div style='width: 170pt;word-wrap: break-word;white-space: normal'>Whether to perform the statement in a PARSE, ANALYZE, OPTIMIZE, PHYSICAL, EXECUTION only way without executing the query. When it is NONE, the statement will be fully executed</div>|<div style='width: 30pt'>string</div>|<div style='width: 20pt'>1.4.0</div>
302302
<code>kyuubi.operation.query.timeout</code>|<div style='width: 65pt;word-wrap: break-word;white-space: normal'>&lt;undefined&gt;</div>|<div style='width: 170pt;word-wrap: break-word;white-space: normal'>Timeout for query executions at server-side, take affect with client-side timeout(`java.sql.Statement.setQueryTimeout`) together, a running query will be cancelled automatically if timeout. It's off by default, which means only client-side take fully control whether the query should timeout or not. If set, client-side timeout capped at this point. To cancel the queries right away without waiting task to finish, consider enabling kyuubi.operation.interrupt.on.cancel together.</div>|<div style='width: 30pt'>duration</div>|<div style='width: 20pt'>1.2.0</div>
303303
<code>kyuubi.operation.scheduler.pool</code>|<div style='width: 65pt;word-wrap: break-word;white-space: normal'>&lt;undefined&gt;</div>|<div style='width: 170pt;word-wrap: break-word;white-space: normal'>The scheduler pool of job. Note that, this config should be used after change Spark config spark.scheduler.mode=FAIR.</div>|<div style='width: 30pt'>string</div>|<div style='width: 20pt'>1.1.1</div>
304-
<code>kyuubi.operation.status.polling.max.attempts</code>|<div style='width: 65pt;word-wrap: break-word;white-space: normal'>5</div>|<div style='width: 170pt;word-wrap: break-word;white-space: normal'>Max attempts for long polling asynchronous running sql query's status on raw transport failures, e.g. TTransportException</div>|<div style='width: 30pt'>int</div>|<div style='width: 20pt'>1.4.0</div>
304+
<code>kyuubi.operation.status.polling.max.attempts</code>|<div style='width: 65pt;word-wrap: break-word;white-space: normal'>5</div>|<div style='width: 170pt;word-wrap: break-word;white-space: normal'>(deprecated) - Using kyuubi.operation.thrift.client.request.max.attempts instead</div>|<div style='width: 30pt'>int</div>|<div style='width: 20pt'>1.4.0</div>
305305
<code>kyuubi.operation.status.polling.timeout</code>|<div style='width: 65pt;word-wrap: break-word;white-space: normal'>PT5S</div>|<div style='width: 170pt;word-wrap: break-word;white-space: normal'>Timeout(ms) for long polling asynchronous running sql query's status</div>|<div style='width: 30pt'>duration</div>|<div style='width: 20pt'>1.0.0</div>
306+
<code>kyuubi.operation.thrift.client.request.max.attempts</code>|<div style='width: 65pt;word-wrap: break-word;white-space: normal'>5</div>|<div style='width: 170pt;word-wrap: break-word;white-space: normal'>Max attempts for operation thrift request call at server-side on raw transport failures, e.g. TTransportException</div>|<div style='width: 30pt'>int</div>|<div style='width: 20pt'>1.6.0</div>
306307

307308

308309
### Server
@@ -321,6 +322,9 @@ Key | Default | Meaning | Type | Since
321322
<code>kyuubi.session.conf.advisor</code>|<div style='width: 65pt;word-wrap: break-word;white-space: normal'>&lt;undefined&gt;</div>|<div style='width: 170pt;word-wrap: break-word;white-space: normal'>A config advisor plugin for Kyuubi Server. This plugin can provide some custom configs for different user or session configs and overwrite the session configs before open a new session. This config value should be a class which is a child of 'org.apache.kyuubi.plugin.SessionConfAdvisor' which has zero-arg constructor.</div>|<div style='width: 30pt'>string</div>|<div style='width: 20pt'>1.5.0</div>
322323
<code>kyuubi.session.conf.ignore.list</code>|<div style='width: 65pt;word-wrap: break-word;white-space: normal'></div>|<div style='width: 170pt;word-wrap: break-word;white-space: normal'>A comma separated list of ignored keys. If the client connection contains any of them, the key and the corresponding value will be removed silently during engine bootstrap and connection setup. Note that this rule is for server-side protection defined via administrators to prevent some essential configs from tampering but will not forbid users to set dynamic configurations via SET syntax.</div>|<div style='width: 30pt'>seq</div>|<div style='width: 20pt'>1.2.0</div>
323324
<code>kyuubi.session.conf.restrict.list</code>|<div style='width: 65pt;word-wrap: break-word;white-space: normal'></div>|<div style='width: 170pt;word-wrap: break-word;white-space: normal'>A comma separated list of restricted keys. If the client connection contains any of them, the connection will be rejected explicitly during engine bootstrap and connection setup. Note that this rule is for server-side protection defined via administrators to prevent some essential configs from tampering but will not forbid users to set dynamic configurations via SET syntax.</div>|<div style='width: 30pt'>seq</div>|<div style='width: 20pt'>1.2.0</div>
325+
<code>kyuubi.session.engine.alive.probe.enabled</code>|<div style='width: 65pt;word-wrap: break-word;white-space: normal'>false</div>|<div style='width: 170pt;word-wrap: break-word;white-space: normal'>Whether to enable the engine alive probe, it true, we will create a companion thrift client that sends simple request to check whether the engine is keep alive.</div>|<div style='width: 30pt'>boolean</div>|<div style='width: 20pt'>1.6.0</div>
326+
<code>kyuubi.session.engine.alive.probe.interval</code>|<div style='width: 65pt;word-wrap: break-word;white-space: normal'>PT10S</div>|<div style='width: 170pt;word-wrap: break-word;white-space: normal'>The interval for engine alive probe.</div>|<div style='width: 30pt'>duration</div>|<div style='width: 20pt'>1.6.0</div>
327+
<code>kyuubi.session.engine.alive.timeout</code>|<div style='width: 65pt;word-wrap: break-word;white-space: normal'>PT2M</div>|<div style='width: 170pt;word-wrap: break-word;white-space: normal'>The timeout for engine alive. If there is no alive probe success in the last timeout window, the engine will be marked as no-alive.</div>|<div style='width: 30pt'>duration</div>|<div style='width: 20pt'>1.6.0</div>
324328
<code>kyuubi.session.engine.check.interval</code>|<div style='width: 65pt;word-wrap: break-word;white-space: normal'>PT1M</div>|<div style='width: 170pt;word-wrap: break-word;white-space: normal'>The check interval for engine timeout</div>|<div style='width: 30pt'>duration</div>|<div style='width: 20pt'>1.0.0</div>
325329
<code>kyuubi.session.engine.flink.main.resource</code>|<div style='width: 65pt;word-wrap: break-word;white-space: normal'>&lt;undefined&gt;</div>|<div style='width: 170pt;word-wrap: break-word;white-space: normal'>The package used to create Flink SQL engine remote job. If it is undefined, Kyuubi will use the default</div>|<div style='width: 30pt'>string</div>|<div style='width: 20pt'>1.4.0</div>
326330
<code>kyuubi.session.engine.flink.max.rows</code>|<div style='width: 65pt;word-wrap: break-word;white-space: normal'>1000000</div>|<div style='width: 170pt;word-wrap: break-word;white-space: normal'>Max rows of Flink query results. For batch queries, rows that exceeds the limit would be ignored. For streaming queries, the query would be canceled if the limit is reached.</div>|<div style='width: 30pt'>int</div>|<div style='width: 20pt'>1.5.0</div>

kyuubi-common/src/main/scala/org/apache/kyuubi/config/KyuubiConf.scala

Lines changed: 34 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -668,6 +668,29 @@ object KyuubiConf {
668668
.timeConf
669669
.createWithDefault(Duration.ofSeconds(60).toMillis)
670670

671+
val ENGINE_ALIVE_PROBE_ENABLED: ConfigEntry[Boolean] =
672+
buildConf("session.engine.alive.probe.enabled")
673+
.doc("Whether to enable the engine alive probe, it true, we will create a companion thrift" +
674+
" client that sends simple request to check whether the engine is keep alive.")
675+
.version("1.6.0")
676+
.booleanConf
677+
.createWithDefault(false)
678+
679+
val ENGINE_ALIVE_PROBE_INTERVAL: ConfigEntry[Long] =
680+
buildConf("session.engine.alive.probe.interval")
681+
.doc("The interval for engine alive probe.")
682+
.version("1.6.0")
683+
.timeConf
684+
.createWithDefault(Duration.ofSeconds(10).toMillis)
685+
686+
val ENGINE_ALIVE_TIMEOUT: ConfigEntry[Long] =
687+
buildConf("session.engine.alive.timeout")
688+
.doc("The timeout for engine alive. If there is no alive probe success in the last timeout" +
689+
" window, the engine will be marked as no-alive.")
690+
.version("1.6.0")
691+
.timeConf
692+
.createWithDefault(Duration.ofSeconds(120).toMillis)
693+
671694
val ENGINE_INIT_TIMEOUT: ConfigEntry[Long] = buildConf("session.engine.initialize.timeout")
672695
.doc("Timeout for starting the background engine, e.g. SparkSQLEngine.")
673696
.version("1.0.0")
@@ -825,14 +848,23 @@ object KyuubiConf {
825848
.timeConf
826849
.createWithDefault(Duration.ofSeconds(5).toMillis)
827850

851+
@deprecated(s"using kyuubi.operation.thrift.client.request.max.attempts instead", "1.6.0")
828852
val OPERATION_STATUS_POLLING_MAX_ATTEMPTS: ConfigEntry[Int] =
829853
buildConf("operation.status.polling.max.attempts")
830-
.doc("Max attempts for long polling asynchronous running sql query's status on raw" +
831-
" transport failures, e.g. TTransportException")
854+
.doc(s"(deprecated) - Using kyuubi.operation.thrift.client.request.max.attempts instead")
832855
.version("1.4.0")
833856
.intConf
834857
.createWithDefault(5)
835858

859+
val OPERATION_THRIFT_CLIENT_REQUEST_MAX_ATTEMPTS: ConfigEntry[Int] =
860+
buildConf("operation.thrift.client.request.max.attempts")
861+
.doc("Max attempts for operation thrift request call at server-side on raw transport" +
862+
" failures, e.g. TTransportException")
863+
.version("1.6.0")
864+
.intConf
865+
.checkValue(_ > 0, "must be positive number")
866+
.createWithDefault(5)
867+
836868
val OPERATION_FORCE_CANCEL: ConfigEntry[Boolean] =
837869
buildConf("operation.interrupt.on.cancel")
838870
.doc("When true, all running tasks will be interrupted if one cancels a query. " +

0 commit comments

Comments
 (0)