Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TASK][EASY] Statements get blocked even with timeouts of connection, socket and query already set #5392

Closed
3 of 4 tasks
bowenliang123 opened this issue Oct 10, 2023 · 17 comments
Assignees
Labels
kind:bug This is a clearly a bug priority:major

Comments

@bowenliang123
Copy link
Contributor

bowenliang123 commented Oct 10, 2023

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

Describe the bug

With the follow timeout set, the statement gets block and no proper response or connection closing.

Connection and socket timeouts are set via JDBC URL:
jdbc:kyuubi://kyuubi_server_ip:port/;retries=3;connectTimeout=300000;socketTimeout=1800000
Query Timeout is set via JDBC statement statement.setTimeout().

Connection Timeout: 30000 ms (or 30s)
Socket Timeout: 1800000ms (or 30min)
Query Timeout: 60*60 sec (or 1hr)

After waiting for ~ 3hr, the statement finally gets TIMEDOUT_STATE and closes the session.

Affects Version(s)

1.7.1 / master

Kyuubi Server Log Output

The insert statement was submitted at 06:22:24.

At first, the JDBC statement logs output correctly with repeated RUNNING_STATE.
At 06:29:24, the output get blocked, until over 2 hours later at 08:35:41.
Finally, at 09:16:03, with SQLTimeoutException thrown and failed the query.

[2023-10-10 06:22:24]-[INFO] [pool-140421-thread-1] 2023-10-10 06:22:22.032 INFO org.apache.kyuubi.operation.ExecuteStatement: Query[b0bf024e-c21e-48b8-8ac7-1702f731ba96] in RUNNING_STATE

...
[2023-10-10 06:28:34]-[INFO] [pool-140421-thread-1] 2023-10-10 06:28:33.596 INFO org.apache.kyuubi.operation.ExecuteStatement: Query[b0bf024e-c21e-48b8-8ac7-1702f731ba96] in RUNNING_STATE
[2023-10-10 06:28:49]-[INFO] [pool-140421-thread-1] 2023-10-10 06:28:46.013 INFO org.apache.kyuubi.operation.ExecuteStatement: Query[b0bf024e-c21e-48b8-8ac7-1702f731ba96] in RUNNING_STATE
[2023-10-10 06:29:19]-[INFO] [pool-140421-thread-1] 2023-10-10 06:29:15.079 INFO org.apache.kyuubi.operation.ExecuteStatement: Query[b0bf024e-c21e-48b8-8ac7-1702f731ba96] in RUNNING_STATE
[2023-10-10 06:29:24]-[INFO] [pool-140421-thread-1] 2023-10-10 06:29:20.728 INFO org.apache.kyuubi.operation.ExecuteStatement: Query[b0bf024e-c21e-48b8-8ac7-1702f731ba96] in RUNNING_STATE
[2023-10-10 08:35:41]-[INFO] [pool-140421-thread-1] 2023-10-10 08:35:39.212 INFO org.apache.kyuubi.operation.ExecuteStatement: Query[b0bf024e-c21e-48b8-8ac7-1702f731ba96] in RUNNING_STATE
[2023-10-10 09:16:03]-[INFO] [pool-140421-thread-1] 2023-10-10 09:16:03.827 INFO org.apache.kyuubi.operation.ExecuteStatement: Query[b0bf024e-c21e-48b8-8ac7-1702f731ba96] in TIMEDOUT_STATE
[2023-10-10 09:16:03]-[INFO] [pool-140421-thread-1] 2023-10-10 09:16:03.852 INFO org.apache.kyuubi.operation.ExecuteStatement: Processing gf_personas_plf's query[b0bf024e-c21e-48b8-8ac7-1702f731ba96]: RUNNING_STATE -> TIMEDOUT_STATE, time taken: 10429.681 seconds
[2023-10-10 09:16:03]-[FATAL] java.sql.SQLTimeoutException: Query timed out after 3600 seconds
	at org.apache.kyuubi.jdbc.hive.KyuubiStatement.waitForOperationToComplete(KyuubiStatement.java:411)
	at org.apache.kyuubi.jdbc.hive.KyuubiStatement.executeWithConfOverlay(KyuubiStatement.java:200)
	at org.apache.kyuubi.jdbc.hive.KyuubiStatement.execute(KyuubiStatement.java:194)
	at org.apache.kyuubi.jdbc.hive.KyuubiPreparedStatement.execute(KyuubiPreparedStatement.java:66)
	at cn.com.gf.bdp.horae.runner.engine.indexcalculate.connector.KyuubiConnector.executeSql(KyuubiConnector.java:157)
	at cn.com.gf.bdp.horae.runner.engine.indexcalculate.connector.KyuubiConnector.executeSql(KyuubiConnector.java:97)
	at cn.com.gf.bdp.horae.runner.engine.indexcalculate.KyuubiIndexCalculateEngine.execute(KyuubiIndexCalculateEngine.java:188)
	at cn.com.gf.bdp.horae.runner.SparkIndexTaskRunner.execute(SparkIndexTaskRunner.java:48)
	at cn.com.gf.bdp.horae.runner.AbstractTaskRunner.run(AbstractTaskRunner.java:215)
	at java.lang.Thread.run(Thread.java:748)

Kyuubi Engine Log Output

No response

Kyuubi Server Configurations

No response

Kyuubi Engine Configurations

No response

Additional context

No response

Are you willing to submit PR?

  • Yes. I would be willing to submit a PR with guidance from the Kyuubi community to fix.
  • No. I cannot submit a PR at this time.
@bowenliang123 bowenliang123 added kind:bug This is a clearly a bug priority:major labels Oct 10, 2023
@bowenliang123
Copy link
Contributor Author

bowenliang123 commented Oct 10, 2023

It's suspicious that the loop inside ExecuteStatement for status checks,

  1. it tries to fetch query log first (fetchQueryLog)
  2. and then poll the operation state for query

In step one, it uses a KyuubiSyncThriftClient instance to connect and query from the engine instance.
The a KyuubiSyncThriftClient instance is created with no socket timeout (set to 0 manually),

KyuubiSessionImpl:

_client = KyuubiSyncThriftClient.createClient(user, passwd, host, port, sessionConf)

KyuubiSyncThriftClient:

val tProtocol = createTProtocol(user, passwd, host, port, 0, loginTimeout)

With no socket timeout set, it may block the timeout state checks and therefore the statement is not able to do the step 2 and jammed at step 1.

@bowenliang123
Copy link
Contributor Author

cc @turboFei @pan3793 @wForget

@pan3793
Copy link
Member

pan3793 commented Oct 10, 2023

Try setting kyuubi.session.engine.alive.probe.enabled=true, BTW, it was introduced in 1.6.0 but disabled by default, is it good to enable it in 1.8? @turboFei

@bowenliang123
Copy link
Contributor Author

The engine probe is a side-way approach for engine lifecycle.

May I create a PR to change the socket timeout from 0 to defined value falling back to query timeout ?

@pan3793
Copy link
Member

pan3793 commented Oct 10, 2023

timeout 0 with additional probe is by design, the thrift client lifecycle does not match the statement

@wForget
Copy link
Member

wForget commented Oct 10, 2023

Can you provide engine logs? Query timeout is triggered by the engine side.

@bowenliang123
Copy link
Contributor Author

Can you provide engine logs? Query timeout is triggered by the engine side.

Sorry, the engine log is not available.

@bowenliang123
Copy link
Contributor Author

timeout 0 with additional probe is by design, the thrift client lifecycle does not match the statement

OK. Is there anything we could do on the server side for a timeout guarantee?
e.g. 1. set to time out state if statement duration exceeded query timeout on server side.
2. rebuild the client instance

@wForget
Copy link
Member

wForget commented Oct 10, 2023

With no socket timeout set, it may block the timeout state checks and therefore the statement is not able to do the step 2 and jammed at step 1.

I think the case may not be due to this. If client socket is blocked, the operation should not receive TIMEDOUT_STATE.

@bowenliang123
Copy link
Contributor Author

Doubtfully have TIMEDOUT_STATE finally, the driver could be busy GC and not responding correctly. And maybe when it is back to normal, it sends the response.

@bowenliang123
Copy link
Contributor Author

I see not timeout limits on fetchQueryLog . hmm.

@wForget
Copy link
Member

wForget commented Oct 10, 2023

Doubtfully have TIMEDOUT_STATE finally, the driver could be busy GC and not responding correctly. And maybe when it is back to normal, it sends the response.

Full GC that lasts an hour seems impossible, if that happens the yarn will kill engine too.

@wForget
Copy link
Member

wForget commented Oct 10, 2023

I think we can also start queryTimeout monitor in the server.

@bowenliang123
Copy link
Contributor Author

timeout 0 with additional probe is by design, the thrift client lifecycle does not match the statement

Socket timeout is about maximum time to wait for the response only. A limited value is more close to the guarantee.
And we could also rebuild the client for the session if it's failed, as it's a mutable member of the session.

@bowenliang123
Copy link
Contributor Author

I think we can also start queryTimeout monitor in the server.

SGTM. Once the elapsed duration of the statement is greater tahn the query timeout , the statement should be marked as TIMED_OUT status.

@pan3793
Copy link
Member

pan3793 commented Oct 11, 2023

@bowenliang123 there is a chance that the fetch result may block for a long time, for example, when incremental collect is enabled, once the first partition of the result is available, the operation state will be marked as FINISHED, the subsequential partitions will be calculated lazily, which may cause the fetch result blocks for a long time.

And, query timeout should be a concept of operation scope, but the thrift client is session scope

bowenliang123 added a commit that referenced this issue Oct 18, 2023
…tement

### _Why are the changes needed?_

As reported in #5392, currently the server is unable to guarantee that the statement timed-out when the engine may have no proper response for the server's request therefore the query timeout does not work.

Introduce a server-side statement query timeout monitor, to ensure the time-out query statements are set to TIMEOUT state and help the JDBC client get out of the blocked status.

### _How was this patch tested?_
- [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible

- [ ] Add screenshots for manual tests if appropriate

- [x] [Run test](https://kyuubi.readthedocs.io/en/master/contributing/code/testing.html#running-tests) locally before make a pull request

### _Was this patch authored or co-authored using generative AI tooling?_

No.

Closes #5398 from bowenliang123/stmt-timeout.

Closes #5392

f5733b3 [Bowen Liang] use addTimeoutMonitor for server-side query timeout checks

Authored-by: Bowen Liang <liangbowen@gf.com.cn>
Signed-off-by: liangbowen <liangbowen@gf.com.cn>
(cherry picked from commit bdc28ac)
Signed-off-by: liangbowen <liangbowen@gf.com.cn>
@bowenliang123
Copy link
Contributor Author

bowenliang123 commented Oct 18, 2023

I think we can also start queryTimeout monitor in the server.

The server-side timeout monitor for ExecuteStatement operation has been implemented in #5398 now.

@pan3793 pan3793 changed the title [Bug] Statements get blocked even with timeouts of connection, socket and query already set [TASK][EASY] Statements get blocked even with timeouts of connection, socket and query already set Nov 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug This is a clearly a bug priority:major
Projects
No open projects
Development

No branches or pull requests

3 participants