Brief window of query failures immediately after a server pod finishes restarting and becomes healthy

We regularly experience a brief window of query failures immediately after a Pinot server pod transitions to a healthy state following a restart. Notably, there is no observed disruption during the initial termination phase or while the pod is in a 0/1 Runningstate; the instability occurs specifically as the service becomes ready to handle traffic again.

We have been seeing this issue for quite some time. We would appreciate if we can get guidance over this issue. 

### Error Details

The failures manifest as QueryExecutionError exceptions (likely caused by gRPC connection timeouts). The logs indicate that the query dispatcher is unable to establish a connection to the newly restarted pod.

Full Java Stack Trace:

```
QueryExecutionError: Error dispatching query: 667717545000000055 to server: pinot-server-8.pinot-server-headless.pinot.svc.cluster.local@{8421,8442}
    at org.apache.pinot.query.service.dispatch.QueryDispatcher.processResults(QueryDispatcher.java:689)
    at org.apache.pinot.query.service.dispatch.QueryDispatcher.execute(QueryDispatcher.java:643)
    at org.apache.pinot.query.service.dispatch.QueryDispatcher.submit(QueryDispatcher.java:584)
    at org.apache.pinot.query.service.dispatch.QueryDispatcher.submitAndReduce(QueryDispatcher.java:219)
Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
    at io.grpc.Status.asRuntimeException(Status.java:532)
    at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:581)
    at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:566)
    at io.grpc.internal.ClientCallImpl.access$100(ClientCallImpl.java:72)
    at io.grpc.internal.ClientCallImpl$ClientCallListenerImpl.onClose(ClientCallImpl.java:547)
    at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:566)
    at io.grpc.internal.ClientCallImpl.access$100(ClientCallImpl.java:72)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:739)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:720)
    at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
    at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.io.IOException: connection timed out after 30000 ms: pinot-server-8.pinot-server-headless.pinot.svc.cluster.local/100.64.62.48:8421
    at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$2.run(AbstractEpollChannel.java:615)
    at io.grpc.netty.shaded.io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
    at io.grpc.netty.shaded.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:160)
    at io.grpc.netty.shaded.io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
    at io.grpc.netty.shaded.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
    at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
    at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
    at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
    at io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at java.base/java.lang.Thread.run(Thread.java:840)
```

### Environment Context
Pinot Version: 1.5.0
EKS 1.34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Brief window of query failures immediately after a server pod finishes restarting and becomes healthy #18902

Error Details

Environment Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Brief window of query failures immediately after a server pod finishes restarting and becomes healthy #18902

Description

Error Details

Environment Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions