Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[improve] [ml] Persist mark deleted ops to ZK if create cursor ledger was failed #20935

Conversation

poorbarcode
Copy link
Contributor

Motivation

The progress Persist mark deleted position is like this:

  • persist to BK
  • if failed to persist to BK, try to persist to ZK

But in the current implementation: if the cursor ledger was created failed, Pulsar will not try to persist to ZK. It makes if the cursor ledger created fails, a lot of ack records can not be persisted, and we will get a lot of repeat consumption after the BK recover.

Modifications

Try to persist the mark deleted position to ZK if the cursor ledger was created failed

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: x

@poorbarcode poorbarcode self-assigned this Aug 6, 2023
@poorbarcode poorbarcode added the type/bug The PR fixed a bug or issue reported a bug label Aug 6, 2023
@poorbarcode poorbarcode added this to the 3.2.0 milestone Aug 6, 2023
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Aug 6, 2023
Copy link
Contributor

@codelipenghui codelipenghui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good to me.
Only add some minor comments to consolidate the test.

}
}, null);
}

void persistPositionMetaStore(MarkDeleteEntry mdEntry, final VoidCallback callback) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void persistPositionMetaStore(MarkDeleteEntry mdEntry, final VoidCallback callback) {
void persistPositionToMetaStore(MarkDeleteEntry mdEntry, final VoidCallback callback) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@poorbarcode Please check here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

};

if (State.NoLedger.equals(STATE_UPDATER.get(this))) {
persistPositionMetaStore(mdEntry, cb);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the BookKeeper cluster runs into read-only mode and millions of consumers keep acknowledging the mark-delete position to brokers, we will persist the mark-delete position to the meta store without any throttle policy, will it bring high pressure or risk for the meta store? especially when we use Zookeeper as the Pulsar cluster's meta store service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the BookKeeper cluster runs into read-only mode and millions of consumers keep acknowledging the mark-delete position to brokers, we will persist the mark-delete position to the meta store without any throttle policy, will it bring high pressure or risk for the meta store? especially when we use Zookeeper as the Pulsar cluster's meta store service.

Some users may want to set BK to read-only, and stop sending messages but consume messages.

However, pulsar does not support this yet, because even if only the message is consumed, there is still some BK write operations, here are some examples:

If we want to support the above scenario "Make BK read-only and only consume messages", we should need a large design.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will persist the mark-delete position to the meta store without any throttle policy

@hangc0276 @poorbarcode Actually, we have a throttling policy that can be configured by managedLedgerDefaultMarkDeleteRateLimit. For improving this case, it looks like we'd better have a separate one for the zookeeper. So that users can persist to the cursor ledger for each second but persist to zookeeper for every 30 seconds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hangc0276 @poorbarcode Actually, we have a throttling policy that can be configured by managedLedgerDefaultMarkDeleteRateLimit. For improving this case, it looks like we'd better have a separate one for the zookeeper. So that users can persist to the cursor ledger for each second but persist to zookeeper for every 30 seconds.

Good idea. Since this solution should change broker.conf and PersistencePolicies, maybe a PIP is needed.

In the current PR, I've simpler this behavior to the following: persist to ZK only hits these two scenarios

  • the cursor ledger is not null, persist to ZK after persisting to BK failed(the original behavior before this PR)
  • the cursor ledger is null, and the cursor.markDeletedPosition==managedLedger.LAC. Since the ledger can not be created, no entry can be written now, so the variable managedLedger.LAC will not increase anymore. For per subscription, persist to ZK will only be executed once.

And in the future, we submit a PIP to add a rater limiter to further improve the behavior.

@codelipenghui @hangc0276 What do you think abort?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it looks good to me. And users can also change managedLedgerDefaultMarkDeleteRateLimit to reduce the impact.

@codelipenghui
Copy link
Contributor

/pulsarbot run-failure-checks

@Technoboy- Technoboy- closed this Aug 14, 2023
@Technoboy- Technoboy- reopened this Aug 14, 2023
@poorbarcode poorbarcode force-pushed the improve/create_cursor_ledger_fail_then_persist_to_ZK branch from 8460bb0 to 7c7e879 Compare August 14, 2023 03:32
@codelipenghui
Copy link
Contributor

/pulsarbot run-failure-checks

};

if (State.NoLedger.equals(STATE_UPDATER.get(this))) {
if (mdEntry.newPosition.equals(ledger.getLastConfirmedEntry())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the compare logic as we discussed to cover the case that LAC's entryId is -1 and add a test to cover it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@codelipenghui codelipenghui added the category/functionality Some functions are not working such as get errors label Aug 21, 2023
@poorbarcode poorbarcode force-pushed the improve/create_cursor_ledger_fail_then_persist_to_ZK branch from c6b4554 to ebb705a Compare August 22, 2023 05:04
@codelipenghui
Copy link
Contributor

/pulsarbot run-failure-checks

@poorbarcode poorbarcode force-pushed the improve/create_cursor_ledger_fail_then_persist_to_ZK branch from 4dd88ef to 9461ef0 Compare August 30, 2023 14:50
@poorbarcode
Copy link
Contributor Author

Rebase master

@codecov-commenter
Copy link

Codecov Report

Merging #20935 (c29f906) into master (dab5b2f) will increase coverage by 0.11%.
Report is 4 commits behind head on master.
The diff coverage is 86.84%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #20935      +/-   ##
============================================
+ Coverage     73.03%   73.14%   +0.11%     
- Complexity    32363    32392      +29     
============================================
  Files          1887     1887              
  Lines        139860   139879      +19     
  Branches      15384    15389       +5     
============================================
+ Hits         102140   102316     +176     
+ Misses        29634    29477     -157     
  Partials       8086     8086              
Flag Coverage Δ
inttests 24.08% <5.26%> (?)
systests 25.11% <5.26%> (+<0.01%) ⬆️
unittests 72.44% <86.84%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
...che/bookkeeper/mledger/impl/ManagedLedgerImpl.java 81.22% <83.33%> (-0.08%) ⬇️
...che/bookkeeper/mledger/impl/ManagedCursorImpl.java 79.23% <88.46%> (+0.25%) ⬆️

... and 77 files with indirect coverage changes

@poorbarcode poorbarcode merged commit 843b830 into apache:master Aug 31, 2023
46 checks passed
poorbarcode added a commit that referenced this pull request Aug 31, 2023
… was failed (#20935)

The progress Persist mark deleted position is like this:
- persist to BK
- If failed to persist to BK, try to persist to ZK

But in the current implementation: if the cursor ledger was created failed, Pulsar will not try to persist to ZK. It makes if the cursor ledger created fails, a lot of ack records can not be persisted, and we will get a lot of repeat consumption after the BK recover.

Modifications: Try to persist the mark deleted position to ZK if the cursor ledger was created failed
(cherry picked from commit 843b830)
Technoboy- pushed a commit that referenced this pull request Sep 5, 2023
… was failed (#20935)

The progress Persist mark deleted position is like this:
- persist to BK
- If failed to persist to BK, try to persist to ZK

But in the current implementation: if the cursor ledger was created failed, Pulsar will not try to persist to ZK. It makes if the cursor ledger created fails, a lot of ack records can not be persisted, and we will get a lot of repeat consumption after the BK recover.

Modifications: Try to persist the mark deleted position to ZK if the cursor ledger was created failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category/functionality Some functions are not working such as get errors cherry-picked/branch-3.0 cherry-picked/branch-3.1 doc-not-needed Your PR changes do not impact docs release/3.0.2 release/3.1.1 type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants