Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pulsar-broker] Fix bug that message delivery stops after resetting cursor for failover subscription #5185

Merged
merged 2 commits into from
Sep 16, 2019

Conversation

massakam
Copy link
Contributor

Motivation

Resetting the cursor for a subscription in Failover mode may cause message delivery to stop. This can be reproduced with the following procedure:

  1. Connect multiple consumers to a subscription in Failover mode
  2. Reset the subscription cursor to a past position
  3. Close some consumers
  4. The remaining consumers may not receive new messages from the topic

At this time, the active consumer is already closed one:

"subscriptions" : {
  "sub1" : {
    "msgRateOut" : 0.0,
    "msgThroughputOut" : 0.0,
    "msgRateRedeliver" : 0.0,
    "msgBacklog" : 57604,
    "blockedSubscriptionOnUnackedMsgs" : false,
    "unackedMessages" : 0,
    "type" : "Failover",
    "activeConsumerName" : "04b6c", // This consumer is already closed!
    "msgRateExpired" : 0.0,
    "consumers" : [ {
      "msgRateOut" : 0.0,
      "msgThroughputOut" : 0.0,
      "msgRateRedeliver" : 0.0,
      "consumerName" : "06317b",
      "availablePermits" : 564,
      "unackedMessages" : 0,
      "blockedConsumerOnUnackedMsgs" : false,
      "metadata" : { },
      "connectedSince" : "2019-09-11T18:56:25.413+09:00",
      "clientVersion" : "2.3.2",
      "address" : "/xxx.xxx.xxx.xxx:36968"
    }, {
      "msgRateOut" : 0.0,
      "msgThroughputOut" : 0.0,
      "msgRateRedeliver" : 0.0,
      "consumerName" : "37edc",
      "availablePermits" : 1000,
      "unackedMessages" : 0,
      "blockedConsumerOnUnackedMsgs" : false,
      "metadata" : { },
      "connectedSince" : "2019-09-11T18:56:27.77+09:00",
      "clientVersion" : "2.3.2",
      "address" : "/xxx.xxx.xxx.xxx:38392"
    }, {
      "msgRateOut" : 0.0,
      "msgThroughputOut" : 0.0,
      "msgRateRedeliver" : 0.0,
      "consumerName" : "822f0",
      "availablePermits" : 1000,
      "unackedMessages" : 0,
      "blockedConsumerOnUnackedMsgs" : false,
      "metadata" : { },
      "connectedSince" : "2019-09-11T18:56:27.769+09:00",
      "clientVersion" : "2.3.2",
      "address" : "/xxx.xxx.xxx.xxx:38380"
    }, {
      "msgRateOut" : 0.0,
      "msgThroughputOut" : 0.0,
      "msgRateRedeliver" : 0.0,
      "consumerName" : "b91282",
      "availablePermits" : 1000,
      "unackedMessages" : 0,
      "blockedConsumerOnUnackedMsgs" : false,
      "metadata" : { },
      "connectedSince" : "2019-09-11T18:56:25.413+09:00",
      "clientVersion" : "2.3.2",
      "address" : "/xxx.xxx.xxx.xxx:38408"
    } ]
  }
},

This is because AbstractDispatcherSingleActiveConsumer#closeFuture is not null, so pickAndScheduleActiveConsumer() is not called and the active consumer does not change.

if (closeFuture == null && !consumers.isEmpty()) {
pickAndScheduleActiveConsumer();
return;
}

closeFuture becomes non-null when disconnectAllConsumers() is called. And once a value is assigned, it will never return to null.

public synchronized CompletableFuture<Void> disconnectAllConsumers() {
closeFuture = new CompletableFuture<>();

disconnectAllConsumers() is called when unloading or deleting a topic, as well as when resetting the cursor.

Modifications

Added resetCloseFuture() method to the Dispatcher classes to return closeFuture to null when resetting cursor is completed.

@massakam massakam added type/bug The PR fixed a bug or issue reported a bug area/broker labels Sep 12, 2019
@massakam massakam added this to the 2.4.2 milestone Sep 12, 2019
@massakam massakam self-assigned this Sep 12, 2019
@massakam
Copy link
Contributor Author

rerun java8 tests
rerun integration tests

3 similar comments
@massakam
Copy link
Contributor Author

rerun java8 tests
rerun integration tests

@massakam
Copy link
Contributor Author

rerun java8 tests
rerun integration tests

@massakam
Copy link
Contributor Author

rerun java8 tests
rerun integration tests

Copy link
Contributor

@merlimat merlimat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@massakam
Copy link
Contributor Author

rerun java8 tests
rerun integration tests

4 similar comments
@massakam
Copy link
Contributor Author

rerun java8 tests
rerun integration tests

@massakam
Copy link
Contributor Author

rerun java8 tests
rerun integration tests

@massakam
Copy link
Contributor Author

rerun java8 tests
rerun integration tests

@massakam
Copy link
Contributor Author

rerun java8 tests
rerun integration tests

@massakam
Copy link
Contributor Author

rerun java8 tests
rerun integration tests

2 similar comments
@massakam
Copy link
Contributor Author

rerun java8 tests
rerun integration tests

@massakam
Copy link
Contributor Author

rerun java8 tests
rerun integration tests

@massakam
Copy link
Contributor Author

rerun java8 tests

@sijie sijie merged commit 499069e into apache:master Sep 16, 2019
@massakam massakam deleted the fix-failover-bug branch September 17, 2019 01:48
wolfstudy pushed a commit that referenced this pull request Nov 20, 2019
…ursor for failover subscription (#5185)

### Motivation

Resetting the cursor for a subscription in Failover mode may cause message delivery to stop. This can be reproduced with the following procedure:

1. Connect multiple consumers to a subscription in Failover mode
1. Reset the subscription cursor to a past position
1. Close some consumers
1. The remaining consumers may not receive new messages from the topic

At this time, the active consumer is already closed one:

```js
"subscriptions" : {
  "sub1" : {
    "msgRateOut" : 0.0,
    "msgThroughputOut" : 0.0,
    "msgRateRedeliver" : 0.0,
    "msgBacklog" : 57604,
    "blockedSubscriptionOnUnackedMsgs" : false,
    "unackedMessages" : 0,
    "type" : "Failover",
    "activeConsumerName" : "04b6c", // This consumer is already closed!
    "msgRateExpired" : 0.0,
    "consumers" : [ {
      "msgRateOut" : 0.0,
      "msgThroughputOut" : 0.0,
      "msgRateRedeliver" : 0.0,
      "consumerName" : "06317b",
      "availablePermits" : 564,
      "unackedMessages" : 0,
      "blockedConsumerOnUnackedMsgs" : false,
      "metadata" : { },
      "connectedSince" : "2019-09-11T18:56:25.413+09:00",
      "clientVersion" : "2.3.2",
      "address" : "/xxx.xxx.xxx.xxx:36968"
    }, {
      "msgRateOut" : 0.0,
      "msgThroughputOut" : 0.0,
      "msgRateRedeliver" : 0.0,
      "consumerName" : "37edc",
      "availablePermits" : 1000,
      "unackedMessages" : 0,
      "blockedConsumerOnUnackedMsgs" : false,
      "metadata" : { },
      "connectedSince" : "2019-09-11T18:56:27.77+09:00",
      "clientVersion" : "2.3.2",
      "address" : "/xxx.xxx.xxx.xxx:38392"
    }, {
      "msgRateOut" : 0.0,
      "msgThroughputOut" : 0.0,
      "msgRateRedeliver" : 0.0,
      "consumerName" : "822f0",
      "availablePermits" : 1000,
      "unackedMessages" : 0,
      "blockedConsumerOnUnackedMsgs" : false,
      "metadata" : { },
      "connectedSince" : "2019-09-11T18:56:27.769+09:00",
      "clientVersion" : "2.3.2",
      "address" : "/xxx.xxx.xxx.xxx:38380"
    }, {
      "msgRateOut" : 0.0,
      "msgThroughputOut" : 0.0,
      "msgRateRedeliver" : 0.0,
      "consumerName" : "b91282",
      "availablePermits" : 1000,
      "unackedMessages" : 0,
      "blockedConsumerOnUnackedMsgs" : false,
      "metadata" : { },
      "connectedSince" : "2019-09-11T18:56:25.413+09:00",
      "clientVersion" : "2.3.2",
      "address" : "/xxx.xxx.xxx.xxx:38408"
    } ]
  }
},
```

This is because `AbstractDispatcherSingleActiveConsumer#closeFuture` is not null, so `pickAndScheduleActiveConsumer()` is not called and the active consumer does not change.
https://github.com/apache/pulsar/blob/8c3445ad6746df93fef80d2c661374cdab00bc38/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/AbstractDispatcherSingleActiveConsumer.java#L181-L184

`closeFuture` becomes non-null when `disconnectAllConsumers()` is called. And once a value is assigned, it will never return to null.
https://github.com/apache/pulsar/blob/8c3445ad6746df93fef80d2c661374cdab00bc38/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/AbstractDispatcherSingleActiveConsumer.java#L217-L218

`disconnectAllConsumers()` is called when unloading or deleting a topic, as well as when resetting the cursor.

### Modifications

Added `resetCloseFuture()` method to the Dispatcher classes to return `closeFuture` to null when resetting cursor is completed.

(cherry picked from commit 499069e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/broker type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants