Unavailable Server Causes Controller Load Issues

**Core issue:** 
Pinot is unable to safely ingest/serve queries from remaining replicas for a prolonged period due to some sort of retry logic impacting controller functionality.

**Background:**
Recently we saw a controller struggling to process ZKEvents as fast as they were created. This began happening after a server failed to start due to a deadlock condition, and was left in this state for a few days. Controller CPU is elevated during this period, and eventually the throughput of callbacks/events is too high for processing to keep up:
<img width="1094" alt="Image" src="https://github.com/user-attachments/assets/7b9f69ca-d756-49df-b99b-5905d78abcca" />

It looks like the slow event processing was due to resource starvation, with Helix's [ZKEventThread](https://github.com/apache/helix/blob/master/zookeeper-api/src/main/java/org/apache/helix/zookeeper/zkclient/ZkEventThread.java#L123) presumably struggling to be scheduled. From our metrics, we see a huge increase in ZK transaction volume (metric is tx log size, which is flushed every 1h): 
<img width="1391" alt="Image" src="https://github.com/user-attachments/assets/34d8117f-38d0-4322-b95b-a3da230ff076" />

Looking at snapshot of the cluster during this time, it seems likely that the transactions were under the dead server's `MESSAGES` znode:
```
(CONNECTED [localhost:55179]) /pinot/pinot-<redacted>/<redacted>-cluster/INSTANCES/eb92c571-ca4e-4035-8bf0-fc09a9c40e4b> stat MESSAGES
Stat(
  czxid=0x20000098a
  mzxid=0x20000098a
  ctime=1731131435565
  mtime=1731131435565
  version=0
  cversion=92797988
  aversion=0
  ephemeralOwner=0x0
  dataLength=0
  numChildren=3348
  pzxid=0x30f152e04
)
```
For reference, other servers in this cluster have a ~200-300k cversion. However, when looking at the messages themselves I see the message znodes are created and unmodified for a long time - to me it is not yet clear which child znodes are being modified. 

Another phrasing of the issue may be: failed messages continue to load controller/ZK even after failing:
<img width="1092" alt="Image" src="https://github.com/user-attachments/assets/a7985378-bb0c-4122-b745-9e0695f181fe" />

One note about cluster/table setup: we use minion for upsert compaction, which generates a lot more messages than is typical for a realtime table of this size. 

Has anyone seen something similar? I haven't yet walked through the relevant helix code. The end goal of raising this issue is to understand how we can prevent a dead server causing such a large load increase on controller and ZK. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unavailable Server Causes Controller Load Issues #14930

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Unavailable Server Causes Controller Load Issues #14930

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions