Skip to content

Flink lookup will Intermittent timeout when Fluss cluster upgrading #2110

@swuferhong

Description

@swuferhong

Search before asking

  • I searched in the issues and found nothing similar.

Fluss version

0.8.0 (latest release)

Please describe the bug 🐞

Flink lookup will Intermittent timeout when Fluss cluster upgrading. Once a timeout occurs, it causes the Flink job to fail. This cannot be avoided no matter how large the table.exec.async-lookup.timeout is set.

The error is as follow:

java.lang.Exception: Could not complete the stream element: Record @ (undef) : +I(xxx)
	at org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.completeExceptionally(AsyncWaitOperator.java:636)
	at org.apache.flink.streaming.api.functions.async.AsyncFunction.timeout(AsyncFunction.java:97)
	at org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.timerTriggered(AsyncWaitOperator.java:654)
	at org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.lambda$registerTimeout$1(AsyncWaitOperator.java:649)
	at org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.lambda$registerTimer$2(AsyncWaitOperator.java:433)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invokeProcessingTimeCallback(StreamTask.java:2186)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$deferCallbackToMailbox$27(StreamTask.java:2177)
	at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
	at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:101)
	at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMail(MailboxProcessor.java:414)
	at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:383)
	at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:368)
	at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:229)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:1202)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:1146)
	at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:976)
	at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:955)
	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:768)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:580)
	at java.base/java.lang.Thread.run(Thread.java:991)
Caused by: java.util.concurrent.TimeoutException: Async function call has timed out.
	... 19 more

The root cause is still unknown, but there are two likely possibilities:

  1. During upgrades, pods are recreated and their IP addresses change, which may cause metadata requests to take longer.
  2. The Netty connection timeout is set to 120 seconds (client.connect-timeout). If the client sends a request to an IP that no longer exists—but previously had an established connection—it may wait for the full 120 seconds before timing out.

Solution

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions