Skip to content

Skip topics with remote replication producers in topic inactivity check#16622

Open
lhotari wants to merge 2 commits intoapache:masterfrom
lhotari:lh-skip-topics-with-remote-replication-producers-in-topic-inactivity-check
Open

Skip topics with remote replication producers in topic inactivity check#16622
lhotari wants to merge 2 commits intoapache:masterfrom
lhotari:lh-skip-topics-with-remote-replication-producers-in-topic-inactivity-check

Conversation

@lhotari
Copy link
Member

@lhotari lhotari commented Jul 15, 2022

Motivation

Modifications

  • skip deleting a topic that has connected remote replication producers.
  • move existing check to happen before replicators are closed.

Additional context

An alternative solution is to disable delete-while-inactive for namespaces that are replicated.

pulsar-admin namespaces set-inactive-topic-policies tenant/namespace --disable-delete-while-inactive

@lhotari lhotari added type/bug The PR fixed a bug or issue reported a bug area/broker area/geo-replication doc-not-needed Your PR changes do not impact docs release/2.10.2 labels Jul 15, 2022
@lhotari lhotari self-assigned this Jul 15, 2022
@mattisonchao
Copy link
Member

Question:

  1. Do you mean we will delete the topic at the second GC check?
  2. I'm not sure If it's a breaking change for the user. Maybe we need to notice this change?
  3. This will cause the broker to have a large number of unused producers, right?

@lhotari
Copy link
Member Author

lhotari commented Jul 16, 2022

Question:

  1. Do you mean we will delete the topic at the second GC check?

@mattisonchao
please see how the existing code works in

public void checkGC() {
if (!isDeleteWhileInactive()) {
// This topic is not included in GC
return;
}
InactiveTopicDeleteMode deleteMode =
topicPolicies.getInactiveTopicPolicies().get().getInactiveTopicDeleteMode();
int maxInactiveDurationInSec = topicPolicies.getInactiveTopicPolicies().get().getMaxInactiveDurationSeconds();
if (isActive(deleteMode)) {
lastActive = System.nanoTime();
} else if (System.nanoTime() - lastActive < TimeUnit.SECONDS.toNanos(maxInactiveDurationInSec)) {
// Gc interval did not expire yet
return;
} else if (shouldTopicBeRetained()) {
// Topic activity is still within the retention period
return;
} else {
CompletableFuture<Void> replCloseFuture = new CompletableFuture<>();
if (TopicName.get(topic).isGlobal()) {
// For global namespace, close repl producers first.
// Once all repl producers are closed, we can delete the topic,
// provided no remote producers connected to the broker.
if (log.isDebugEnabled()) {
log.debug("[{}] Global topic inactive for {} seconds, closing repl producers.", topic,
maxInactiveDurationInSec);
}
closeReplProducersIfNoBacklog().thenRun(() -> {
if (hasRemoteProducers()) {
if (log.isDebugEnabled()) {
log.debug("[{}] Global topic has connected remote producers. Not a candidate for GC",
topic);
}
replCloseFuture
.completeExceptionally(new TopicBusyException("Topic has connected remote producers"));
} else {
log.info("[{}] Global topic inactive for {} seconds, closed repl producers", topic,
maxInactiveDurationInSec);
replCloseFuture.complete(null);
}
}).exceptionally(e -> {
if (log.isDebugEnabled()) {
log.debug("[{}] Global topic has replication backlog. Not a candidate for GC", topic);
}
replCloseFuture.completeExceptionally(e.getCause());
return null;
});
} else {
replCloseFuture.complete(null);
}
. The existing logic closes replicators that contain no backlog. However if there are connected replication producers, the inactivity deletion will be skipped eventually and it's possible that some replicators were already closed and they won't be restored. This is why this PR is made since it's better to make the check as the first step instead of possibly closing some replicators and then terminating the inactivity deletion.

  1. I'm not sure If it's a breaking change for the user. Maybe we need to notice this change?

This PR doesn't introduce a breaking change.

  1. This will cause the broker to have a large number of unused producers, right?

No.

@mattisonchao
Copy link
Member

mattisonchao commented Jul 16, 2022

@lhotari
I got it, thanks for your explanation.

- currently there's a problem that the replicators get shutdown in the inactivity check
- this is an improved fix for the issue described in apache#11382
…call

- the test is flaky and this change could help resolve the issue
@lhotari lhotari force-pushed the lh-skip-topics-with-remote-replication-producers-in-topic-inactivity-check branch from 65e59c8 to 76b6eae Compare August 11, 2022 07:19
@lhotari
Copy link
Member Author

lhotari commented Aug 11, 2022

@merlimat @codelipenghui Please review

@lhotari lhotari requested a review from Technoboy- August 11, 2022 07:19
@Jason918
Copy link
Contributor

@merlimat @codelipenghui @Technoboy- Ping


// For global namespace, close repl producers first.
// Once all repl producers are closed, we can delete the topic,
// provided no remote producers connected to the broker.
Copy link
Member

@mattisonchao mattisonchao Aug 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From this comment, it seems we have to close the repl producer first. Otherwise, in a two-way replication scenario. We can get to a "deadlock" and the topic is never deleted, even if it has no backlog.
I'm not sure if it's your expected behaviour or if I'm missing some replicator close logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have two clusters, cluster A and cluster B. They enable two-way replication. So, cluster A has remote producer B, and cluster B has remote producer A.
In the previous logic. when the replicator has no more backlog (Cluster A), we will close the replicator of Cluster A. And cluster B will have no remote producer A. In the next round of GC check, if cluster B also has no backlog. At this point, cluster B's replicator will be close and cluster A will also remove cluster B's remote producer. Then in the next new round of GC, we will clean up the topics on both clusters.
But after this PR is modified, if two clusters enable two-way replication, they hold each other's remote producers. At this point, the topic will enter a kind of circular chain, and the check GC will never delete the topic until a replicator is closed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

I also have the same concern

And do we have a test for covering the inactive geo topic deletion?
The CI gets passed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add the test to cover this behaviour If we need it.

Copy link
Contributor

@codelipenghui codelipenghui Aug 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add the test to cover this behaviour If we need it.

+1, @mattisonchao

Copy link
Contributor

@codelipenghui codelipenghui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to have a clear behavior or the behavior we expected for how inactive topic check work with geo-replication topic.

I think the change will lead to the inactive topic will always skips the topic enabled geo-replication.


// For global namespace, close repl producers first.
// Once all repl producers are closed, we can delete the topic,
// provided no remote producers connected to the broker.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

I also have the same concern

And do we have a test for covering the inactive geo topic deletion?
The CI gets passed.

@github-actions
Copy link

The pr had no activity for 30 days, mark with Stale label.

@github-actions github-actions bot removed the doc-not-needed Your PR changes do not impact docs label Feb 1, 2023
@github-actions
Copy link

github-actions bot commented Feb 1, 2023

@lhotari Please add the following content to your PR description and select a checkbox:

- [ ] `doc` <!-- Your PR contains doc changes -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [ ] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

@Technoboy- Technoboy- added this to the 3.2.0 milestone Jul 31, 2023
@Technoboy- Technoboy- modified the milestones: 3.2.0, 3.3.0 Dec 22, 2023
@coderzc coderzc modified the milestones: 3.3.0, 3.4.0 May 8, 2024
@lhotari lhotari modified the milestones: 4.0.0, 4.1.0 Oct 14, 2024
@coderzc coderzc modified the milestones: 4.1.0, 4.2.0 Sep 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants