Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][broker] One topic can be closed multiple times concurrently #17524

Merged
merged 7 commits into from
Apr 29, 2024

Conversation

poorbarcode
Copy link
Contributor

@poorbarcode poorbarcode commented Sep 7, 2022

Motivation

With the transaction feature, we send and receive messages, and at the same time, execute admin API: unload namespace 1000 times. Then the problem occur: Consumer could not receive any message, and there has no error log. After that we tried admin API: get topic stats, and the response showed only producers are registered on topic, and no consumers are registered on topic, but consumer stat is Ready in the client. This means that the state of the consumer is inconsistent between the broker and the client.

Location problem

Then we found the problem: Two PersistentTopic which have the same name registered at a broker node, consumer registered on one (aka topic-c), and producer registered on another one (aka topic-p). At this time, when we send messages, the data flow like this :

client: producer sends a message

broker: handle cmd-send

broker: find the topic by name, it is "topic-p"

broker: find all subscriptions registered on "topic-p"

broker: found one subscription, but it has no consumers registered

broker: no need to send the message to the client

But the consumer exactly registered on another topic: topic-c, so consumer could not receive any message.

Repreduce

How to reproduce two topics registered at the same broker node ?

Make transaction buffer recover, admin unload namespace, client create consumer, client create producer executed at the same time, the process flow like this (at the step-11, the problem occurs ):

Time transaction buffer recoverr admin unload namespace client create consumer client create producer
1 TB recover
2 TB recover failure topic.unload
3 topic.close(false) topic.close(true)
4 brokerService.topics.remove(topicName)
5 remove topic finish lookup
6 create topic-c
7 consumer registered on topic-c
8 brokerService.topics.remove(topic)
9 remove topic-c finish lookup
10 create topic-p
11 producer registered on topic-p
  • Each column means the individual process. e.g. client create consumer, client create producer.
  • Multiple processes are going on at the same time, and all effet the brokerService.topics.
  • Column Time is used only to indicate the order of each step, not the actual time.
  • The important steps are explained below:

step 3

Even if persistent topic propertyisClosingOrDeleting have already changed to true, it still can be executed another once, see line-1247:

public CompletableFuture<Void> close(boolean closeWithoutWaitingClientDisconnect) {
CompletableFuture<Void> closeFuture = new CompletableFuture<>();
lock.writeLock().lock();
try {
// closing managed-ledger waits until all producers/consumers/replicators get closed. Sometimes, broker
// forcefully wants to close managed-ledger without waiting all resources to be closed.
if (!isClosingOrDeleting || closeWithoutWaitingClientDisconnect) {
fenceTopicToCloseOrDelete();
} else {

Whether close can be executed depends on two predicates: is closing or @param closeWithoutWaitingClientDisconnect is true. This means that method topic.close can be reentrant executed when @param closeWithoutWaitingClientDisconnect is true, and in the implementation of admin API: unload namespace the parameter closeWithoutWaitingClientDisconnect is exactly true.

public CompletableFuture<Void> unloadNamespaceBundle(NamespaceBundle bundle, long timeout, TimeUnit timeoutUnit) {
return unloadNamespaceBundle(bundle, timeout, timeoutUnit, true);
}

So when transaction buffer recover fail and admin unload namespace is executed at the same time, and transaction buffer recover fail before admin unload namespace, the topic will be removed from brokerService.topics twice.

step-4 / step-8

Because of the current implementation of BrokerService. removeTopicFromCache use cmd map.remove(key), not use map.remove(key, value), So this cmd can remove any value in the map, even if it's not the desired one.

To sum up: We should make these two changes:

  • Make method topic.close non-reentrant. Also prevent reentrant between topic.close and topic.delete.
  • Use cmd map.remove(key, value) instead of map.remove(key) in implementation of BrokerService. removeTopicFromCache. This change will apply to both scenes: topic.close and topic.delete.

Modifications

Documentation

  • doc-required

  • doc-not-needed

  • doc

  • doc-complete

@poorbarcode
Copy link
Contributor Author

This PR should merge into these branches:

  • branch-2.8
  • branch-2.9
  • branch-2.10
  • branch-2.11
  • master

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Sep 7, 2022
@poorbarcode
Copy link
Contributor Author

/pulsarbot rerun-failure-checks

@mattisonchao mattisonchao added area/broker type/bug The PR fixed a bug or issue reported a bug labels Sep 20, 2022
@mattisonchao mattisonchao added this to the 2.12.0 milestone Sep 20, 2022
@github-actions
Copy link

The pr had no activity for 30 days, mark with Stale label.

@github-actions github-actions bot added the Stale label Oct 21, 2022
@poorbarcode
Copy link
Contributor Author

Since we will start the RC version of 3.0.0 on 2023-04-11, I will change the label/milestone of PR who have not been merged.

  • The PR of type feature is deferred to 3.1.0
  • The PR of type fix is deferred to 3.0.1

So drag this PR to 3.0.1

@poorbarcode poorbarcode modified the milestones: 3.0.0, 3.1.0 Apr 10, 2023
@github-actions github-actions bot removed the Stale label Apr 11, 2023
@github-actions
Copy link

The pr had no activity for 30 days, mark with Stale label.

@lhotari
Copy link
Member

lhotari commented Jul 5, 2023

@poorbarcode please rebase (or merge changes from master to) this PR

@Technoboy- Technoboy- modified the milestones: 3.1.0, 3.2.0 Jul 31, 2023
@Technoboy- Technoboy- modified the milestones: 3.2.0, 3.3.0 Dec 22, 2023
@lhotari
Copy link
Member

lhotari commented Dec 27, 2023

@poorbarcode please rebase this PR. I guess this is still relevant and in the same category as #20540 .

@poorbarcode poorbarcode force-pushed the fix/topic_repeat_registry_split_2 branch from 41010f3 to 9aee3df Compare January 9, 2024 15:43
@poorbarcode
Copy link
Contributor Author

@lhotari

@poorbarcode please rebase this PR. I guess this is still relevant and in the same category as #20540 .

Rebased

Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, good work!

@lhotari
Copy link
Member

lhotari commented Apr 26, 2024

@poorbarcode there are some test failures, do you have a chance to check?

Copy link
Contributor

@heesung-sn heesung-sn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this issue.

However, I think this Topic state management needs a serious refactoring.

I suggest defining TopicState and revisit topic state transitions in a state machine manner.

@lhotari lhotari changed the title [fix][broker]One topic can be close multiple times concurrently [fix][broker] One topic can be closed multiple times concurrently Apr 26, 2024
@poorbarcode
Copy link
Contributor Author

However, I think this Topic state management needs a serious refactoring.

Agree with you

@lhotari
Copy link
Member

lhotari commented Apr 26, 2024

However, I think this Topic state management needs a serious refactoring.

I suggest defining TopicState and revisit topic state transitions in a state machine manner.

@heesung-sn I agree that a state machine style would result in a more maintainable solution. We can handle that in a second step. There is urgency to address the long outstanding topic closing issues and this PR makes good progress in that area.

@lhotari
Copy link
Member

lhotari commented Apr 26, 2024

@poorbarcode looks like OneWayReplicatorTest.testUnFenceTopicToReuse fails

@poorbarcode
Copy link
Contributor Author

@poorbarcode looks like OneWayReplicatorTest.testUnFenceTopicToReuse fails

Sorry, I found a behavior change(before: broker tries to unfence topic to reuse when clos clients fail; after: this mechanism does not work), and it is difficulte to be fixed gracefully, I will try to fix it tomorrow.

@poorbarcode poorbarcode force-pushed the fix/topic_repeat_registry_split_2 branch from 473bbd4 to 057fe1d Compare April 27, 2024 16:56
@poorbarcode
Copy link
Contributor Author

@poorbarcode looks like OneWayReplicatorTest.testUnFenceTopicToReuse fails

Sorry, I found a behavior change(before: broker tries to unfence topic to reuse when clos clients fail; after: this mechanism does not work), and it is difficulte to be fixed gracefully, I will try to fix it tomorrow.

Fixed, the code is ugly now, sorry. Please review it again. Thanks. @lhotari @heesung-sn

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 91.42857% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 73.93%. Comparing base (bbc6224) to head (057fe1d).
Report is 202 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #17524      +/-   ##
============================================
+ Coverage     73.57%   73.93%   +0.35%     
- Complexity    32624    32640      +16     
============================================
  Files          1877     1885       +8     
  Lines        139502   140679    +1177     
  Branches      15299    15465     +166     
============================================
+ Hits         102638   104010    +1372     
+ Misses        28908    28622     -286     
- Partials       7956     8047      +91     
Flag Coverage Δ
inttests 27.23% <67.14%> (+2.64%) ⬆️
systests 24.61% <52.85%> (+0.28%) ⬆️
unittests 73.21% <91.42%> (+0.37%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...java/org/apache/pulsar/common/util/FutureUtil.java 77.04% <100.00%> (+2.50%) ⬆️
...sar/broker/service/persistent/PersistentTopic.java 78.73% <89.65%> (+0.27%) ⬆️

... and 273 files with indirect coverage changes

@lhotari lhotari merged commit 93afd89 into apache:master Apr 29, 2024
49 of 50 checks passed
poorbarcode added a commit that referenced this pull request May 7, 2024
poorbarcode added a commit that referenced this pull request May 7, 2024
nikhil-ctds pushed a commit to datastax/pulsar that referenced this pull request May 13, 2024
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants