Skip to content

Brief window of query failures immediately after a server pod finishes restarting and becomes healthy #18902

Description

@piby180

We regularly experience a brief window of query failures immediately after a Pinot server pod transitions to a healthy state following a restart. Notably, there is no observed disruption during the initial termination phase or while the pod is in a 0/1 Runningstate; the instability occurs specifically as the service becomes ready to handle traffic again.

We have been seeing this issue for quite some time. We would appreciate if we can get guidance over this issue.

Error Details

The failures manifest as QueryExecutionError exceptions (likely caused by gRPC connection timeouts). The logs indicate that the query dispatcher is unable to establish a connection to the newly restarted pod.

Full Java Stack Trace:

QueryExecutionError: Error dispatching query: 667717545000000055 to server: pinot-server-8.pinot-server-headless.pinot.svc.cluster.local@{8421,8442}
    at org.apache.pinot.query.service.dispatch.QueryDispatcher.processResults(QueryDispatcher.java:689)
    at org.apache.pinot.query.service.dispatch.QueryDispatcher.execute(QueryDispatcher.java:643)
    at org.apache.pinot.query.service.dispatch.QueryDispatcher.submit(QueryDispatcher.java:584)
    at org.apache.pinot.query.service.dispatch.QueryDispatcher.submitAndReduce(QueryDispatcher.java:219)
Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
    at io.grpc.Status.asRuntimeException(Status.java:532)
    at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:581)
    at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:566)
    at io.grpc.internal.ClientCallImpl.access$100(ClientCallImpl.java:72)
    at io.grpc.internal.ClientCallImpl$ClientCallListenerImpl.onClose(ClientCallImpl.java:547)
    at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:566)
    at io.grpc.internal.ClientCallImpl.access$100(ClientCallImpl.java:72)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:739)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:720)
    at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
    at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.io.IOException: connection timed out after 30000 ms: pinot-server-8.pinot-server-headless.pinot.svc.cluster.local/100.64.62.48:8421
    at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$2.run(AbstractEpollChannel.java:615)
    at io.grpc.netty.shaded.io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
    at io.grpc.netty.shaded.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:160)
    at io.grpc.netty.shaded.io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
    at io.grpc.netty.shaded.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
    at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
    at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
    at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
    at io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at java.base/java.lang.Thread.run(Thread.java:840)

Environment Context

Pinot Version: 1.5.0
EKS 1.34

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions