Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[improve][broker] Do not close the socket if lookup failed due to LockBusyException #21993

Merged

Conversation

BewareMyPower
Copy link
Contributor

Motivation

When a broker restarted, there is a case in
NamespaceService#findBrokerServiceUrl:

  1. ownershipCache.getOwnerAsync(bundle) got an empty data, then searchForCandidateBroker will be called
  2. The broker itself was elected as the candidate broker.
  3. Meanwhile, the other broker has acquired the distributed lock of the bundle, then ownershipCache.tryAcquiringOwnership will fail with
lookupFuture.completeExceptionally(new PulsarServerException(
        "Failed to acquire ownership for namespace bundle " + bundle, exception));

See apache/pulsar-client-cpp#390 for the real world case.

Then in TopicLookupBase#handleLookupError, this exception will be wrapped into a ServiceNotReady error to client.

This case happens very frequently in our production environment when a broker restarted. If there is a PulsarClient that has many producers or consumers, the connection will be closed, which results in many reconnections, which brings much pressure to the cluster.

Modifications

In handleLookupError, check the PulsarServerException and unwrap the CompletionException. If the unwrapped exception is MetadataStoreException, return the MetadataError to avoid closing the connection at client side.

Add testLookupConnectionNotCloseIfFailedToAcquireOwnershipOfBundle to simulate the case and verify the socket won't be closed.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository:

…kBusyException

### Motivation

When a broker restarted, there is a case in
`NamespaceService#findBrokerServiceUrl`:
1. `ownershipCache.getOwnerAsync(bundle)` got an empty data, then
   `searchForCandidateBroker` will be called
2. The broker itself was elected as the candidate broker.
3. Meanwhile, the other broker has acquired the distributed lock of the
   bundle, then `ownershipCache.tryAcquiringOwnership` will fail with

```java
lookupFuture.completeExceptionally(new PulsarServerException(
        "Failed to acquire ownership for namespace bundle " + bundle, exception));
```

See apache/pulsar-client-cpp#390 for the real
world case.

Then in `TopicLookupBase#handleLookupError`, this exception will be
wrapped into a `ServiceNotReady` error to client.

This case happens very frequently in our production environment when a
broker restarted. If there is a `PulsarClient` that has many producers
or consumers, the connection will be closed, which results in many
reconnections, which brings much pressure to the cluster.

### Modifications

In `handleLookupError`, check the `PulsarServerException` and unwrap the
`CompletionException`. If the unwrapped exception is
`MetadataStoreException`, return the `MetadataError` to avoid closing
the connection at client side.

Add `testLookupConnectionNotCloseIfFailedToAcquireOwnershipOfBundle` to
simulate the case and verify the socket won't be closed.
@BewareMyPower BewareMyPower added type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages area/broker release/3.0.3 release/3.1.3 release/3.2.1 labels Jan 30, 2024
@BewareMyPower BewareMyPower added this to the 3.3.0 milestone Jan 30, 2024
@BewareMyPower BewareMyPower self-assigned this Jan 30, 2024
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Jan 30, 2024
@lhotari
Copy link
Member

lhotari commented Jan 30, 2024

@BewareMyPower Looks like a good proposal, I'll review later in more detail. One observation is that this problem is really common when the leader broker election is broken, like with the bug #21897 which was fixed by #21894 .

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (1952f94) 73.63% compared to head (c4bd42f) 73.63%.
Report is 4 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #21993      +/-   ##
============================================
- Coverage     73.63%   73.63%   -0.01%     
- Complexity    32473    32494      +21     
============================================
  Files          1863     1863              
  Lines        138786   138814      +28     
  Branches      15207    15216       +9     
============================================
+ Hits         102201   102216      +15     
- Misses        28687    28695       +8     
- Partials       7898     7903       +5     
Flag Coverage Δ
inttests 24.05% <0.00%> (-0.04%) ⬇️
systests 23.93% <0.00%> (-0.04%) ⬇️
unittests 72.91% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...g/apache/pulsar/broker/lookup/TopicLookupBase.java 66.29% <100.00%> (-4.17%) ⬇️

... and 58 files with indirect coverage changes

@Technoboy- Technoboy- merged commit e7c2a75 into apache:master Feb 2, 2024
54 checks passed
@BewareMyPower BewareMyPower deleted the bewaremypower/fix-service-not-ready branch February 2, 2024 02:42
Technoboy- pushed a commit that referenced this pull request Feb 5, 2024
BewareMyPower added a commit that referenced this pull request Feb 7, 2024
Technoboy- pushed a commit that referenced this pull request Feb 27, 2024
mukesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Mar 1, 2024
mukesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/broker cherry-picked/branch-3.0 cherry-picked/branch-3.1 cherry-picked/branch-3.2 doc-not-needed Your PR changes do not impact docs ready-to-test release/3.0.3 release/3.1.3 release/3.2.1 type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants