We regularly experience a brief window of query failures immediately after a Pinot server pod transitions to a healthy state following a restart. Notably, there is no observed disruption during the initial termination phase or while the pod is in a 0/1 Runningstate; the instability occurs specifically as the service becomes ready to handle traffic again.
We have been seeing this issue for quite some time. We would appreciate if we can get guidance over this issue.
Error Details
The failures manifest as QueryExecutionError exceptions (likely caused by gRPC connection timeouts). The logs indicate that the query dispatcher is unable to establish a connection to the newly restarted pod.
Full Java Stack Trace:
QueryExecutionError: Error dispatching query: 667717545000000055 to server: pinot-server-8.pinot-server-headless.pinot.svc.cluster.local@{8421,8442}
at org.apache.pinot.query.service.dispatch.QueryDispatcher.processResults(QueryDispatcher.java:689)
at org.apache.pinot.query.service.dispatch.QueryDispatcher.execute(QueryDispatcher.java:643)
at org.apache.pinot.query.service.dispatch.QueryDispatcher.submit(QueryDispatcher.java:584)
at org.apache.pinot.query.service.dispatch.QueryDispatcher.submitAndReduce(QueryDispatcher.java:219)
Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
at io.grpc.Status.asRuntimeException(Status.java:532)
at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:581)
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:566)
at io.grpc.internal.ClientCallImpl.access$100(ClientCallImpl.java:72)
at io.grpc.internal.ClientCallImpl$ClientCallListenerImpl.onClose(ClientCallImpl.java:547)
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:566)
at io.grpc.internal.ClientCallImpl.access$100(ClientCallImpl.java:72)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:739)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:720)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.io.IOException: connection timed out after 30000 ms: pinot-server-8.pinot-server-headless.pinot.svc.cluster.local/100.64.62.48:8421
at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$2.run(AbstractEpollChannel.java:615)
at io.grpc.netty.shaded.io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
at io.grpc.netty.shaded.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:160)
at io.grpc.netty.shaded.io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
at io.grpc.netty.shaded.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.base/java.lang.Thread.run(Thread.java:840)
Environment Context
Pinot Version: 1.5.0
EKS 1.34
We regularly experience a brief window of query failures immediately after a Pinot server pod transitions to a healthy state following a restart. Notably, there is no observed disruption during the initial termination phase or while the pod is in a 0/1 Runningstate; the instability occurs specifically as the service becomes ready to handle traffic again.
We have been seeing this issue for quite some time. We would appreciate if we can get guidance over this issue.
Error Details
The failures manifest as QueryExecutionError exceptions (likely caused by gRPC connection timeouts). The logs indicate that the query dispatcher is unable to establish a connection to the newly restarted pod.
Full Java Stack Trace:
Environment Context
Pinot Version: 1.5.0
EKS 1.34