Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pulsar-broker] Fix: handle topic loading failure due to broken schema ledger #9212

Merged
merged 1 commit into from
Feb 7, 2021

Conversation

rdhabalia
Copy link
Contributor

Motivation

Sometimes schema ledger gets deleted but failed to clean up schema-locator in zk which can make the topic unavailable and inaccessible in the broker. It mainly happens when the broker tries to delete an inactive topic and the user tries to connect the producer again to that topic and can't access the topic due to the below error:
In this case, if the storage ledger doesn't exist then it's a non-recoverable error and the broker should be resilient to clean up the broken schema locator and allow the topic to load again.

client-error

2020-12-24 07:27:32,221 [Executor task launch worker for task 81] ERROR org.apache.spark.executor.Executor  - Exception in task 18.0 in stage 2.0 (TID 81)
org.apache.pulsar.client.api.PulsarClientException: java.util.concurrent.ExecutionException: org.apache.pulsar.client.api.PulsarClientException: java.io.IOException: No such ledger exists -  ledger=123456890 - operation=Failed to open ledger
        at org.apache.pulsar.client.api.PulsarClientException.unwrap(PulsarClientException.java:719)
        at org.apache.pulsar.client.impl.ProducerBuilderImpl.create(ProducerBuilderImpl.java:93)
Caused by: java.util.concurrent.ExecutionException: org.apache.pulsar.client.api.PulsarClientException: java.io.IOException: No such ledger exists -  ledger=123456890 - operation=Failed to open ledger
        at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
        at org.apache.pulsar.client.impl.ProducerBuilderImpl.create(ProducerBuilderImpl.java:91)
        ... 14 more

Broker-error

01:33:48.674 [ForkJoinPool.commonPool-worker-9] WARN  org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://my-prop/cluster/ns/t1] Inactive topic deletion failed
java.util.concurrent.CompletionException: java.io.IOException: No such ledger exists -  ledger=123456790 - operation=Failed to open ledger
        at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331) ~[?:?]
        at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346) ~[?:?]
        at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1063) ~[?:?]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) ~[?:?]
        at org.apache.pulsar.broker.service.schema.BookkeeperSchemaStorage.lambda$openLedger$39(BookkeeperSchemaStorage.java:567) ~[pulsar-broker.jar:]
        at org.apache.bookkeeper.client.LedgerOpenOp.openComplete(LedgerOpenOp.java:232) ~[bookkeeper-server-4.9.4.8-yahoo.jar:4.9.4.8-yahoo]
        at org.apache.bookkeeper.client.LedgerOpenOp.lambda$initiate$0(LedgerOpenOp.java:117) ~[bookkeeper-server-4.9.4.8-yahoo.jar:4.9.4.8-yahoo]
        at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[?:?]
        at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[?:?]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) ~[?:?]
        at org.apache.bookkeeper.meta.AbstractZkLedgerManager$3.processResult(AbstractZkLedgerManager.java:396) ~[bookkeeper-server-4.9.4.8-yahoo.jar:4.9.4.8-yahoo]
        at org.apache.bookkeeper.zookeeper.ZooKeeperClient$19$1.processResult(ZooKeeperClient.java:994) ~[bookkeeper-server-4.9.4.8-yahoo.jar:4.9.4.8-yahoo]
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:575) ~[zookeeper-3.4.13.jar:3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03]
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508) ~[zookeeper-3.4.13.jar:3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03]
Caused by: java.io.IOException: No such ledger exists -  ledger=571076459 - operation=Failed to open ledger
        at org.apache.pulsar.broker.service.schema.BookkeeperSchemaStorage.bkException(BookkeeperSchemaStorage.java:656) ~[pulsar-broker.jar:]
        ... 11 more

Modification

  • Handle broken schema storage locator , cleanup and load the topic successfully.

@rdhabalia
Copy link
Contributor Author

/pulsarbot run-failure-checks

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

I left one nit about a little typo in a test

Awesome work

producer.close();

String key = TopicName.get(fqtnOne).getSchemaName();
BookkeeperSchemaStorage schemaStrogate = (BookkeeperSchemaStorage) pulsar.getSchemaStorage();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: strogate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed it.

@@ -41,6 +41,8 @@

CompletableFuture<SchemaVersion> deleteSchemaStorage(String schemaId);

CompletableFuture<SchemaVersion> deleteSchemaStorage(String schemaId, boolean forcefully);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mark this a default method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that will be tricky because deleteSchemaStorage(schemaId) calls deleteSchemaStorage(schemaId, false) . and we can't define default behavior of deleteSchemaStorage(schemaId, false) by calling existing deleteSchemaStorage(schemaId) because it creates cyclic call.

@@ -32,6 +32,8 @@

CompletableFuture<List<CompletableFuture<StoredSchema>>> getAll(String key);

CompletableFuture<SchemaVersion> delete(String key, boolean forcefully);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this a default method?

trimDeletedSchemaAndGetList(list);
// clean up the broken schema from zk
deleteSchemaStorage(schemaId, true).handle((sv, th) -> {
log.info("Deletion of {} {}", schemaId,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What a user needs to do when he/she reads this message? Especially if it is "Deletion of ... failed", it might be worth adding more details for the "failed" case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, added.

@rdhabalia
Copy link
Contributor Author

/pulsarbot run-failure-checks

@rdhabalia
Copy link
Contributor Author

/pulsarbot run-failure-checks

…a ledger

add more error log

fix list assignment
@rdhabalia rdhabalia merged commit 3d5d6f6 into apache:master Feb 7, 2021
@rdhabalia rdhabalia deleted the schema_fail branch February 7, 2021 04:58
@codelipenghui codelipenghui added the cherry-picked/branch-2.7 Archived: 2.7 is end of life label Feb 18, 2021
codelipenghui pushed a commit that referenced this pull request Feb 18, 2021
…a ledger (#9212)

add more error log

fix list assignment

(cherry picked from commit 3d5d6f6)
merlimat pushed a commit to merlimat/pulsar that referenced this pull request Apr 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants