-
Notifications
You must be signed in to change notification settings - Fork 437
Open
Milestone
Description
Search before asking
- I searched in the issues and found nothing similar.
Fluss version
0.8.0 (latest release)
Please describe the bug 🐞
Flink lookup will Intermittent timeout when Fluss cluster upgrading. Once a timeout occurs, it causes the Flink job to fail. This cannot be avoided no matter how large the table.exec.async-lookup.timeout is set.
The error is as follow:
java.lang.Exception: Could not complete the stream element: Record @ (undef) : +I(xxx)
at org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.completeExceptionally(AsyncWaitOperator.java:636)
at org.apache.flink.streaming.api.functions.async.AsyncFunction.timeout(AsyncFunction.java:97)
at org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.timerTriggered(AsyncWaitOperator.java:654)
at org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.lambda$registerTimeout$1(AsyncWaitOperator.java:649)
at org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.lambda$registerTimer$2(AsyncWaitOperator.java:433)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invokeProcessingTimeCallback(StreamTask.java:2186)
at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$deferCallbackToMailbox$27(StreamTask.java:2177)
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:101)
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMail(MailboxProcessor.java:414)
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:383)
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:368)
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:229)
at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:1202)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:1146)
at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:976)
at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:955)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:768)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:580)
at java.base/java.lang.Thread.run(Thread.java:991)
Caused by: java.util.concurrent.TimeoutException: Async function call has timed out.
... 19 more
The root cause is still unknown, but there are two likely possibilities:
- During upgrades, pods are recreated and their IP addresses change, which may cause metadata requests to take longer.
- The Netty connection timeout is set to 120 seconds (
client.connect-timeout). If the client sends a request to an IP that no longer exists—but previously had an established connection—it may wait for the full 120 seconds before timing out.
Solution
No response
Are you willing to submit a PR?
- I'm willing to submit a PR!