Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] [broker] Delete topic timeout due to NPE #21595

Merged
merged 5 commits into from
Nov 21, 2023

Conversation

poorbarcode
Copy link
Contributor

@poorbarcode poorbarcode commented Nov 20, 2023

Motivation

Background:

  • BucketDelayedDeliveryTrackerFactory was added with PIP-195: New bucket based delayed message tracker
    • BucketDelayedDeliveryTrackerFactory will be initialized if setting config setDelayedDeliveryTrackerFactoryClassName to BucketDelayedDeliveryTrackerFactory
    • Pulsar will delete the delay message indexes bucket when a subscription is unsubscribing.

Issue:
There is an NPE that causes the Future of Delay message indexes bucket deletion to be no longer complete, which leads to the topic deletion timeout. You can reproduce this issue by the test testDeletePartitionedTopicIfCursorPropsEmpty and testDeleteTopicIfCursorPropsEmpty

Issue logs:

2023-11-16T17:53:00,154+0000 [delayer-53-1] WARN  org.apache.pulsar.client.admin.internal.BaseResource - [http://xxx:8080/admin/v2/persistent/public/default/tp1-partition-0?force=true&deleteSchema=true] Failed to perform http delete request: java.util.concurrent.CompletionException: org.apache.pulsar.common.util.FutureUtil$LowOverheadTimeoutException: Read timeout
2023-11-16T17:53:00,154+0000 [delayer-53-1] ERROR org.apache.pulsar.broker.admin.impl.PersistentTopicsBase - [11c16270-bb0f-42bf-9e53-f4a94c6e98c3] Failed to delete partition persistent://public/default/tp1-partition-0
org.apache.pulsar.client.admin.PulsarAdminException: java.util.concurrent.CompletionException: org.apache.pulsar.common.util.FutureUtil$LowOverheadTimeoutException: Read timeout
	at org.apache.pulsar.client.admin.internal.BaseResource.getApiException(BaseResource.java:300) ~[org.apache.pulsar-pulsar-client-admin-original-3.0.1.jar:3.0.1]
	at org.apache.pulsar.client.admin.internal.BaseResource$4.failed(BaseResource.java:237) ~[org.apache.pulsar-pulsar-client-admin-original-3.0.1.jar:3.0.1]
	at org.glassfish.jersey.client.JerseyInvocation$1.failed(JerseyInvocation.java:882) ~[org.glassfish.jersey.core-jersey-client-2.34.jar:?]
	at org.glassfish.jersey.client.ClientRuntime.processFailure(ClientRuntime.java:247) ~[org.glassfish.jersey.core-jersey-client-2.34.jar:?]
	at org.glassfish.jersey.client.ClientRuntime.processFailure(ClientRuntime.java:242) ~[org.glassfish.jersey.core-jersey-client-2.34.jar:?]
	at org.glassfish.jersey.client.ClientRuntime.access$100(ClientRuntime.java:62) ~[org.glassfish.jersey.core-jersey-client-2.34.jar:?]
	at org.glassfish.jersey.client.ClientRuntime$2.lambda$failure$1(ClientRuntime.java:178) ~[org.glassfish.jersey.core-jersey-client-2.34.jar:?]
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248) ~[org.glassfish.jersey.core-jersey-common-2.34.jar:?]
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244) ~[org.glassfish.jersey.core-jersey-common-2.34.jar:?]
	at org.glassfish.jersey.internal.Errors.process(Errors.java:292) ~[org.glassfish.jersey.core-jersey-common-2.34.jar:?]
	at org.glassfish.jersey.internal.Errors.process(Errors.java:274) ~[org.glassfish.jersey.core-jersey-common-2.34.jar:?]
	at org.glassfish.jersey.internal.Errors.process(Errors.java:244) ~[org.glassfish.jersey.core-jersey-common-2.34.jar:?]
	at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:288) ~[org.glassfish.jersey.core-jersey-common-2.34.jar:?]
	at org.glassfish.jersey.client.ClientRuntime$2.failure(ClientRuntime.java:178) ~[org.glassfish.jersey.core-jersey-client-2.34.jar:?]
	at org.apache.pulsar.client.admin.internal.http.AsyncHttpConnector.lambda$apply$1(AsyncHttpConnector.java:227) ~[org.apache.pulsar-pulsar-client-admin-original-3.0.1.jar:3.0.1]
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863) ~[?:?]
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841) ~[?:?]
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162) ~[?:?]
	at org.apache.pulsar.common.util.FutureUtil.lambda$addTimeoutHandling$9(FutureUtil.java:236) ~[org.apache.pulsar-pulsar-common-3.0.1.jar:3.0.1]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.93.Final.jar:4.1.93.Final]
	at java.lang.Thread.run(Thread.java:833) ~[?:?]
Caused by: java.util.concurrent.CompletionException: org.apache.pulsar.common.util.FutureUtil$LowOverheadTimeoutException: Read timeout
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332) ~[?:?]
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347) ~[?:?]
	at java.util.concurrent.CompletableFuture$OrApply.tryFire(CompletableFuture.java:1576) ~[?:?]
	at java.util.concurrent.CompletableFuture$CoCompletion.tryFire(CompletableFuture.java:1219) ~[?:?]
	... 10 more
Caused by: org.apache.pulsar.common.util.FutureUtil$LowOverheadTimeoutException: Read timeout
	at org.apache.pulsar.client.admin.internal.http.AsyncHttpConnector.retryOrTimeout(...)(Unknown Source) ~[org.apache.pulsar-pulsar-client-admin-original-3.0.1.jar:3.0.1]
2023-11-16T17:53:00,160+0000 [configuration-metadata-store-13-1] INFO  org.eclipse.jetty.server.RequestLog - 172.16.167.53 - - [16/Nov/2023:17:52:29 +0000] "DELETE /admin/v2/persistent/public/default/tp1/partitions?force=true HTTP/1.1" 500 2 "-" "PostmanRuntime/7.34.0" 30256
Screenshot 2023-11-20 at 19 14 44

Modifications

Fix the NPE.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: x

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Nov 20, 2023
@poorbarcode poorbarcode self-assigned this Nov 20, 2023
@poorbarcode poorbarcode modified the milestones: 3.0, 3.2.0 Nov 20, 2023
@poorbarcode poorbarcode added type/bug The PR fixed a bug or issue reported a bug category/reliability The function does not work properly in certain specific environments or failures. e.g. data lost release/3.1.2 release/3.0.3 labels Nov 20, 2023
@@ -35,6 +37,7 @@
import org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers;
import org.apache.pulsar.common.util.FutureUtil;

@Slf4j
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@@ -85,6 +88,9 @@ public DelayedDeliveryTracker newTracker(PersistentDispatcherMultipleConsumers d
*/
public CompletableFuture<Void> cleanResidualSnapshots(ManagedCursor cursor) {
Map<String, String> cursorProperties = cursor.getCursorProperties();
if (MapUtils.isEmpty(cursorProperties)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder which place introduce the null cursorProperties. And if so, should all the place check cursorProperties is empty or not:)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder which place introduce the null cursorProperties. And if so, should all the place check cursorProperties is empty or not:)?

Good suggestion. Fixed

@codecov-commenter
Copy link

Codecov Report

Merging #21595 (e1071be) into master (cbdb19c) will decrease coverage by 0.02%.
Report is 4 commits behind head on master.
The diff coverage is 90.32%.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #21595      +/-   ##
============================================
- Coverage     73.27%   73.26%   -0.02%     
- Complexity    32694    32703       +9     
============================================
  Files          1892     1892              
  Lines        140707   140682      -25     
  Branches      15483    15493      +10     
============================================
- Hits         103098   103065      -33     
- Misses        29496    29508      +12     
+ Partials       8113     8109       -4     
Flag Coverage Δ
inttests 24.09% <12.90%> (-0.05%) ⬇️
systests 24.66% <12.90%> (+0.01%) ⬆️
unittests 72.59% <90.32%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...che/bookkeeper/mledger/impl/ManagedCursorImpl.java 79.07% <100.00%> (-0.12%) ⬇️
...org/apache/pulsar/broker/ServiceConfiguration.java 99.38% <100.00%> (+<0.01%) ⬆️
...r/delayed/BucketDelayedDeliveryTrackerFactory.java 91.42% <100.00%> (+0.51%) ⬆️
...r/delayed/bucket/BucketDelayedDeliveryTracker.java 83.62% <100.00%> (-0.88%) ⬇️
...g/apache/pulsar/client/impl/RawBatchConverter.java 93.75% <87.50%> (+0.24%) ⬆️
...rg/apache/pulsar/compaction/TwoPhaseCompactor.java 73.83% <85.71%> (+0.15%) ⬆️

... and 71 files with indirect coverage changes

@poorbarcode poorbarcode merged commit b2f2b53 into apache:master Nov 21, 2023
48 checks passed
poorbarcode added a commit that referenced this pull request Nov 23, 2023
### Issue:
There is an NPE that causes the Future of Delay message indexes bucket deletion to be no longer complete, which leads to the topic deletion timeout. You can reproduce this issue by the test `testDeletePartitionedTopicIfCursorPropsEmpty` and `testDeleteTopicIfCursorPropsEmpty`

### Modifications
Fix the NPE.

(cherry picked from commit b2f2b53)
poorbarcode added a commit that referenced this pull request Nov 23, 2023
### Issue:
There is an NPE that causes the Future of Delay message indexes bucket deletion to be no longer complete, which leads to the topic deletion timeout. You can reproduce this issue by the test `testDeletePartitionedTopicIfCursorPropsEmpty` and `testDeleteTopicIfCursorPropsEmpty`

### Modifications
Fix the NPE.

(cherry picked from commit b2f2b53)
nikhil-ctds pushed a commit to datastax/pulsar that referenced this pull request Dec 20, 2023
### Issue:
There is an NPE that causes the Future of Delay message indexes bucket deletion to be no longer complete, which leads to the topic deletion timeout. You can reproduce this issue by the test `testDeletePartitionedTopicIfCursorPropsEmpty` and `testDeleteTopicIfCursorPropsEmpty`

### Modifications
Fix the NPE.

(cherry picked from commit b2f2b53)
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Dec 20, 2023
### Issue:
There is an NPE that causes the Future of Delay message indexes bucket deletion to be no longer complete, which leads to the topic deletion timeout. You can reproduce this issue by the test `testDeletePartitionedTopicIfCursorPropsEmpty` and `testDeleteTopicIfCursorPropsEmpty`

### Modifications
Fix the NPE.

(cherry picked from commit b2f2b53)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category/reliability The function does not work properly in certain specific environments or failures. e.g. data lost cherry-picked/branch-3.0 cherry-picked/branch-3.1 doc-not-needed Your PR changes do not impact docs ready-to-test release/3.0.3 release/3.1.2 type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants