Skip to content

Count the connection failure as the condition of quarantine#4727

Merged
lhotari merged 4 commits intoapache:masterfrom
zymap:count-connection-failure-as-quarantine-condition
Mar 19, 2026
Merged

Count the connection failure as the condition of quarantine#4727
lhotari merged 4 commits intoapache:masterfrom
zymap:count-connection-failure-as-quarantine-condition

Conversation

@zymap
Copy link
Member

@zymap zymap commented Mar 12, 2026


Motivation

Currently, the bookie client quarantine mechanism primarily triggers based on read and write error responses from Bookies. However, in multi-region deployments, a common failure mode is the Network Partition or DNS Resolution Failure at the Region level.

In such scenarios:

  1. A Bookie remains registered in ZooKeeper (it can still heartbeat to its local ZK observer).
  2. The Client (Broker) cannot resolve the Bookie's IP or establish a TCP connection.
  3. The EnsemblePlacementPolicy (especially RegionAwareEnsemblePlacementPolicy) sees the Bookie as "Available" and repeatedly selects it to satisfy minRack or E/Qw constraints.
  4. The LedgerHandle fails to write because it cannot initialize a connection handle, triggering an Ensemble Change.
  5. Because the connection failure didn't trigger a quarantine, the placement policy picks the same problematic Bookie again in the next iteration.

This creates an infinite Ensemble Change loop, causing the Ledger write to hang indefinitely and bloating the Ledger metadata in ZooKeeper with thousands of segments.

zymap added 2 commits March 12, 2026 14:27
---

### Motivation

Currently, the BookieClient quarantine mechanism primarily triggers based on read and write error responses from Bookies. However, in multi-region deployments, a common failure mode is the Network Partition or DNS Resolution Failure at the Region level.

In such scenarios:

A Bookie remains registered in ZooKeeper (it can still heartbeat to its local ZK observer).

The Client (Broker) cannot resolve the Bookie's IP or establish a TCP connection.

The EnsemblePlacementPolicy (especially RegionAwareEnsemblePlacementPolicy) sees the Bookie as "Available" and repeatedly selects it to satisfy minRack or E/Qw constraints.

The LedgerHandle fails to write because it cannot initialize a connection handle, triggering an Ensemble Change.

Because the connection failure didn't trigger a quarantine, the placement policy picks the same problematic Bookie again in the next iteration.

This creates an infinite Ensemble Change loop, causing the Ledger write to hang indefinitely and bloating the Ledger metadata in ZooKeeper with thousands of segments.
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggesting a new name for the config. It seems that getBoolean would throw an exception if the setting is missing and when the default value isn't provided.

Co-authored-by: Lari Hotari <lhotari@users.noreply.github.com>
@lhotari lhotari merged commit 497aa4e into apache:master Mar 19, 2026
20 checks passed
@lhotari lhotari added this to the 4.18.0 milestone Mar 19, 2026
zymap added a commit that referenced this pull request Mar 20, 2026
* Count the connection failure as the condition of quarantine
---

### Motivation

Currently, the BookieClient quarantine mechanism primarily triggers based on read and write error responses from Bookies. However, in multi-region deployments, a common failure mode is the Network Partition or DNS Resolution Failure at the Region level.

In such scenarios:

A Bookie remains registered in ZooKeeper (it can still heartbeat to its local ZK observer).

The Client (Broker) cannot resolve the Bookie's IP or establish a TCP connection.

The EnsemblePlacementPolicy (especially RegionAwareEnsemblePlacementPolicy) sees the Bookie as "Available" and repeatedly selects it to satisfy minRack or E/Qw constraints.

The LedgerHandle fails to write because it cannot initialize a connection handle, triggering an Ensemble Change.

Because the connection failure didn't trigger a quarantine, the placement policy picks the same problematic Bookie again in the next iteration.

This creates an infinite Ensemble Change loop, causing the Ledger write to hang indefinitely and bloating the Ledger metadata in ZooKeeper with thousands of segments.

* Add configuration to control the behavior

(cherry picked from commit 497aa4e)
@zymap zymap self-assigned this Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants