Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][broker] Fix the issue of topics possibly being deleted. #21704

Merged
merged 4 commits into from Dec 14, 2023

Conversation

crossoverJie
Copy link
Contributor

Fixes #21653

Motivation

Reproduce process:
#21653 (comment)

Modifications

The registerTopicPolicyListener() function was moved from PersistentTopic() to initTopicPolicy().

Flow diagram:

Current:
image

After:
image

Verifying this change

  • Make sure that the change passes the CI checks.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: crossoverJie#19

@@ -1632,6 +1631,11 @@ public CompletableFuture<Void> checkReplication() {
}

List<String> configuredClusters = topicPolicies.getReplicationClusters().get();
if (CollectionUtils.isEmpty(configuredClusters)) {
log.warn("[{}] No replication clusters configured", name);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be an abnormal situation, please add some explanation here to indicate what happens here. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image
In our production env, because configuredClusters is empty, it caused the Topic to be deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Referring to the above flow diagram, method onUpdate()->checkReplication() is called before method this.updateTopicPolicyByNamespacePolicy(policies).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, maybe we can refine the log to let other users know about the possibility of a problem. :)

Copy link
Member

@mattisonchao mattisonchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Is it possible to add a test? Maybe running it many times can reproduce it, which is also acceptable. :)

@crossoverJie
Copy link
Contributor Author

Maybe running it many times can reproduce it

#21653 (comment)

image

Its difficult to reproduce with unit tests, manually thread.sleep() can reproduce it😂.

@codecov-commenter
Copy link

codecov-commenter commented Dec 11, 2023

Codecov Report

Merging #21704 (ca2d28c) into master (495b141) will increase coverage by 36.62%.
Report is 13 commits behind head on master.
The diff coverage is 100.00%.

Additional details and impacted files

Impacted file tree graph

@@              Coverage Diff              @@
##             master   #21704       +/-   ##
=============================================
+ Coverage     36.75%   73.37%   +36.62%     
- Complexity    12271    32767    +20496     
=============================================
  Files          1717     1893      +176     
  Lines        131197   140761     +9564     
  Branches      14339    15509     +1170     
=============================================
+ Hits          48220   103290    +55070     
+ Misses        76596    29360    -47236     
- Partials       6381     8111     +1730     
Flag Coverage Δ
inttests 24.11% <40.00%> (-0.01%) ⬇️
systests 24.65% <40.00%> (-0.13%) ⬇️
unittests 72.67% <100.00%> (+40.80%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...pulsar/broker/admin/impl/PersistentTopicsBase.java 64.89% <ø> (+54.59%) ⬆️
...sar/broker/service/persistent/PersistentTopic.java 78.82% <100.00%> (+26.74%) ⬆️

... and 1463 files with indirect coverage changes

@codelipenghui
Copy link
Contributor

@crossoverJie, can we just put a time-consuming task in the thread pool to simulate the sleep behavior?
Or maybe you can just push your code with the sleep and we can find a solution together.

It's a very critical fix, if don't have a test to protect the change, things will go wrong again in the future. Here is an example: #21686

cc @mattisonchao @Technoboy-

@Technoboy- Technoboy- merged commit 84ea1ca into apache:master Dec 14, 2023
48 checks passed
poorbarcode pushed a commit that referenced this pull request Dec 26, 2023
Co-authored-by: Jiwe Guo <technoboy@apache.org>
(cherry picked from commit 84ea1ca)
poorbarcode pushed a commit that referenced this pull request Dec 26, 2023
Co-authored-by: Jiwe Guo <technoboy@apache.org>
(cherry picked from commit 84ea1ca)
poorbarcode pushed a commit that referenced this pull request Dec 26, 2023
Co-authored-by: Jiwe Guo <technoboy@apache.org>
(cherry picked from commit 84ea1ca)
poorbarcode pushed a commit that referenced this pull request Dec 26, 2023
Co-authored-by: Jiwe Guo <technoboy@apache.org>
(cherry picked from commit 84ea1ca)
nikhil-ctds pushed a commit to datastax/pulsar that referenced this pull request Jan 4, 2024
…#21704)

Co-authored-by: Jiwe Guo <technoboy@apache.org>
(cherry picked from commit 84ea1ca)
(cherry picked from commit 49490b3)
@poorbarcode
Copy link
Contributor

poorbarcode commented Jan 5, 2024

Sorry, to clear the root cause of the issue, I left a comment:


The conditions-1 to encounter the issue:

  • You enabled replication on the namespace level and did not enable replication on the topic level.
  • You enabled topic level policies
  • The flowing events happen at the same time
    • Modify topic-level policies, such as pulsar-admin topics set-dispatch-rate. Note: Modifying namespace-level policies will not trigger the issue.
    • The target topic is loading now.

The conditions-2 to encounter the issue:

  • You enabled replication on the namespace level and did not enable replication on the topic level.
  • You enabled topic-level policies
  • Restart brokers
  • Note:
    • this case often happens when there is only 1 broker under your cluster,
    • the probability is rare if there are more than 1 brokers under your cluster. If you restart all brokers at the same time, the probability rises as only 1 broker.

Background & Summary

  1. checkCompaction will delete the topic if the attribute replicationClusters does not match the current cluster( in other words, the attribute replicationClusters does not contain broker.conf.cluster_name )
  2. topic level policy modification and namespace level policy modification will trigger a task checkCompaction.
  3. When a topic is loading, it is initialized by the following steps:
    a. Registered a listener to accept the topic level policy modification event
    b. Initialize the attribute replicationClusters
  4. TopicPoliciesService will read all topic policies and try to update topic-level policies for every topic under this broker after the Broker restart.

After section 3-a, the task checkCompaction may be triggered by a topic-level policy modification(see section 3), but at this time the replicationClustershas not been initialized. It causes thesection 2` to delete the topic.


Detail steps of the issue: conditions-1

time task: topic loading task: modifying topic policies
1 Registered the topic to accept new changes of Topic Policies; the attribute replicationClusters is empty now
2 Modify topic policies
3 Trigger a task checkReplication, and then it will try to delete the topic due to replicationClusters being empty
4 Initialize the attribute replicationClusters from Namespace level Replication settings

Detail steps of the issue: conditions-2

time task: topic loading task: TopicPoliciesService initialize after brokers restart
1 Registered the topic to accept new changes of Topic Policies; the attribute replicationClusters is empty now
2 Read all topic policies and try to update topic-level policies for every topic under this broker after the Broker restart
3 Trigger a task checkReplication, and then it will try to delete the topic due to replicationClusters being empty
4 Initialize the attribute replicationClusters from Namespace level Replication settings

When there is more than one broker under your cluster, the probability of shedding topics to a new broker(when it has just started right now) is low, so the probability is high if there is only 1 broker.


Workaround-1

Before you upgrade to the new version(which contains this fix), disable the feature topicLevelPoliciesEnabled, but it also makes all topic-level policies invalidate you set before.

Workaround-2

Before you upgrade to the new version(which contains this fix), do not restart brokers or modify any topic-level policies.

@crossoverJie
Copy link
Contributor Author

@poorbarcode
Thanks for your clarification.

But based on the background of Workaround-2, if the current cluster has the conditions to trigger this issue, you can refer to the operation here to upgrade; otherwise, it may still cause the topic to be deleted during the upgrade(or restart) process.

@poorbarcode
Copy link
Contributor

poorbarcode commented Jan 5, 2024

if the current cluster has the conditions to trigger this issue, you can refer to the #21653 (comment) to upgrade; otherwise, it may still cause the topic to be deleted during the upgrade(or restart) process.

Ah, I see. Pulsar reads all topic policies and calls topic.onUpdate for every topic when it starts. @crossoverJie Thanks ❤️

@crossoverJie I updated the comment above, please review it, thanks

@crossoverJie
Copy link
Contributor Author

@poorbarcode Great job, thanks for your supplement👍.

shibd added a commit that referenced this pull request Jan 8, 2024
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Jan 8, 2024
…#21704)

Co-authored-by: Jiwe Guo <technoboy@apache.org>
(cherry picked from commit 84ea1ca)
(cherry picked from commit 49490b3)
nodece pushed a commit to nodece/pulsar that referenced this pull request Feb 23, 2024
…#21704)

Co-authored-by: Jiwe Guo <technoboy@apache.org>
(cherry picked from commit 84ea1ca)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] After upgrading to the 3.0.1, the topic was deleted!
7 participants