Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce config to skip non-recoverable data-ledger #1046

Merged
merged 1 commit into from
Jan 24, 2018

Conversation

rdhabalia
Copy link
Contributor

Motivation

Recently, we have seen that due to unexpected ledger deletion at BK, Broker was seeing NoSuchEntryException/NoLedgerException while reading data-ledger.

22:14:22.833 [BookKeeperClientWorker-20-1] WARN  o.a.b.mledger.impl.OpReadEntry       - [prop1/global/ns/persistent/topic1][sub1] read failed from ledger at position:5422597082:9745 : No such ledger exists
22:14:33.806 [BookKeeperClientWorker-20-1] WARN  o.a.b.mledger.impl.OpReadEntry       - [prop1/global/ns/persistent/topic2][sub1] read failed from ledger at position:5422670450:5149 : No such entry
:

In this case, it requires manual cleanup to delete those ledgers from managed-ledger list and performing it for large number of topics across many brokers will be very complicated. In this case, those ledgers are not recoverable so, there should be a mechanism to skip non-recoverable data-ledgers from managed-ledger list.

Modifications

  • Introduced a dynamic configuration to skip non-recoverable data-ledger
  • Skip reading non-recoverable data-ledger

Result

Broker will be more resilient when it is not able to read non-recoverable data-ledger and it can avoid manual cleanup in cluster.

@rdhabalia rdhabalia added type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages type/feature The PR added a new feature or issue requested a new feature labels Jan 9, 2018
@rdhabalia rdhabalia added this to the 1.22.0-incubating milestone Jan 9, 2018
@rdhabalia rdhabalia self-assigned this Jan 9, 2018
Copy link
Contributor

@merlimat merlimat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a couple of suggestions

@@ -51,6 +51,7 @@
private double throttleMarkDelete = 0;
private long retentionTimeMs = 0;
private long retentionSizeInMB = 0;
private boolean skipNonRecoverableLedger;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about naming it automaticallySkipNonRecoverableData?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or abbreviated into auto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I will change it.

@@ -1956,7 +1960,7 @@ boolean ledgerExists(long ledgerId) {
return ledgers.get(ledgerId) != null;
}

long getNextValidLedger(long ledgerId) {
Long getNextValidLedger(long ledgerId) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why changing type? Do we need null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, because long was causing NPE if we try to get next ledger after very last ledger in the list. at that time ledgers.ceilingKey(ledgerId + 1) returns null and it gives NPE if we try to cast it in long so, changed the data type to Long and handling at caller.

switch (rc) {
case BKException.Code.NoSuchLedgerExistsException:
case BKException.Code.ReadException:
case BKException.Code.LedgerRecoveryException:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be temporary, if multiple bookies are down and recovery is now failing, though it could succeed later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may make sense to consider as isBkErrorNotRecoverable for ManagedCursor and not for ManagedLedger.? I think we can create two separate methods isBkErrorNotRecoverable for ManagedCursor and ManagedLedger as they may need different flexibility level?

public static boolean isBkErrorNotRecoverable(int rc) {
switch (rc) {
case BKException.Code.NoSuchLedgerExistsException:
case BKException.Code.ReadException:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This I'm not 100% sure is non-recoverable, we need to check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think we can remove it. let me verify in which case we can see ReadException and we can remove it if we don't need it.

@rdhabalia
Copy link
Contributor Author

@merlimat addressed comments.

Copy link
Contributor

@merlimat merlimat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@apache apache deleted a comment from merlimat Jan 23, 2018
@rdhabalia rdhabalia force-pushed the ml_recover branch 2 times, most recently from 334fcd2 to 2dbda65 Compare January 24, 2018 18:59
@rdhabalia rdhabalia merged commit f5a2ec7 into apache:master Jan 24, 2018
@rdhabalia rdhabalia deleted the ml_recover branch July 25, 2019 18:12
sijie pushed a commit that referenced this pull request Aug 14, 2019
…or (#4818)

### Motivation

In #1046, we have added a flag (`autoSkipNonRecoverableData`) and mechanism to recover cursor if ledger data is deleted. However, expiery-monitor doesn't use that flag and it gets stuck when it finds non-recoverable ml-error while cleaning up expired message.

### Modification
Expiry-monitor can skip non-recoverable managed-ledger exception (eg: data/ledger doesn't exist anymore) when `autoSkipNonRecoverableData` flag is enabled.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages type/feature The PR added a new feature or issue requested a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants