Core issue:
Pinot is unable to safely ingest/serve queries from remaining replicas for a prolonged period due to some sort of retry logic impacting controller functionality.
Background:
Recently we saw a controller struggling to process ZKEvents as fast as they were created. This began happening after a server failed to start due to a deadlock condition, and was left in this state for a few days. Controller CPU is elevated during this period, and eventually the throughput of callbacks/events is too high for processing to keep up:

It looks like the slow event processing was due to resource starvation, with Helix's ZKEventThread presumably struggling to be scheduled. From our metrics, we see a huge increase in ZK transaction volume (metric is tx log size, which is flushed every 1h):

Looking at snapshot of the cluster during this time, it seems likely that the transactions were under the dead server's MESSAGES znode:
(CONNECTED [localhost:55179]) /pinot/pinot-<redacted>/<redacted>-cluster/INSTANCES/eb92c571-ca4e-4035-8bf0-fc09a9c40e4b> stat MESSAGES
Stat(
czxid=0x20000098a
mzxid=0x20000098a
ctime=1731131435565
mtime=1731131435565
version=0
cversion=92797988
aversion=0
ephemeralOwner=0x0
dataLength=0
numChildren=3348
pzxid=0x30f152e04
)
For reference, other servers in this cluster have a ~200-300k cversion. However, when looking at the messages themselves I see the message znodes are created and unmodified for a long time - to me it is not yet clear which child znodes are being modified.
Another phrasing of the issue may be: failed messages continue to load controller/ZK even after failing:

One note about cluster/table setup: we use minion for upsert compaction, which generates a lot more messages than is typical for a realtime table of this size.
Has anyone seen something similar? I haven't yet walked through the relevant helix code. The end goal of raising this issue is to understand how we can prevent a dead server causing such a large load increase on controller and ZK.
Core issue:
Pinot is unable to safely ingest/serve queries from remaining replicas for a prolonged period due to some sort of retry logic impacting controller functionality.
Background:

Recently we saw a controller struggling to process ZKEvents as fast as they were created. This began happening after a server failed to start due to a deadlock condition, and was left in this state for a few days. Controller CPU is elevated during this period, and eventually the throughput of callbacks/events is too high for processing to keep up:
It looks like the slow event processing was due to resource starvation, with Helix's ZKEventThread presumably struggling to be scheduled. From our metrics, we see a huge increase in ZK transaction volume (metric is tx log size, which is flushed every 1h):

Looking at snapshot of the cluster during this time, it seems likely that the transactions were under the dead server's
MESSAGESznode:For reference, other servers in this cluster have a ~200-300k cversion. However, when looking at the messages themselves I see the message znodes are created and unmodified for a long time - to me it is not yet clear which child znodes are being modified.
Another phrasing of the issue may be: failed messages continue to load controller/ZK even after failing:

One note about cluster/table setup: we use minion for upsert compaction, which generates a lot more messages than is typical for a realtime table of this size.
Has anyone seen something similar? I haven't yet walked through the relevant helix code. The end goal of raising this issue is to understand how we can prevent a dead server causing such a large load increase on controller and ZK.