Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][broker] Fix broker is not able to load topic with broken schema ledger #19882

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

rdhabalia
Copy link
Contributor

@rdhabalia rdhabalia commented Mar 21, 2023

Motivation

If due to any operation reason if schema ledger metadata gets deleted then broker is not able to handle that non-recoverable error and not able to load the topic. So, if one of the schema ledger is broken/deleted or ledger metadata is missing , then broker logs below exception because ledger without metadata gives NoSuchLedgerExistsOnMetadataServerException which is not considered as non-recoverable error.

2023-03-21T10:40:17,930-0700 [main-EventThread] ERROR org.apache.pulsar.broker.service.schema.SchemaRegistryServiceImpl - [prop1/ns1/t2] Put schema failed
java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.schema.exceptions.SchemaException: No such ledger exists on Metadata Server -  ledger=8 - operation=Failed to open ledger
        at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:368) ~[?:?]
        at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:377) ~[?:?]
        at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:1097) ~[?:?]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162) ~[?:?]
        at org.apache.pulsar.broker.service.schema.SchemaRegistryServiceImpl.lambda$trimDeletedSchemaAndGetList$30(SchemaRegistryServiceImpl.java:553) ~[pulsar-broker-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934) ~[?:?]
        at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911) ~[?:?]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162) ~[?:?]
        at org.apache.pulsar.broker.service.schema.BookkeeperSchemaStorage.lambda$openLedger$41(BookkeeperSchemaStorage.java:603) ~[pulsar-broker-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.apache.bookkeeper.client.LedgerOpenOp.openComplete(LedgerOpenOp.java:267) ~[bookkeeper-server-4.15.4.jar:4.15.4]
        at org.apache.bookkeeper.client.LedgerOpenOp.lambda$initiate$0(LedgerOpenOp.java:118) ~[bookkeeper-server-4.15.4.jar:4.15.4]
        at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:990) ~[?:?]
        at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:974) ~[?:?]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162) ~[?:?]
        at org.apache.pulsar.metadata.bookkeeper.PulsarLedgerManager.lambda$readLedgerMetadata$2(PulsarLedgerManager.java:208) ~[pulsar-metadata-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:718) ~[?:?]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]
        at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147) ~[?:?]
        at org.apache.pulsar.metadata.impl.ZKMetadataStore.handleGetResult(ZKMetadataStore.java:269) ~[pulsar-metadata-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.apache.pulsar.metadata.impl.ZKMetadataStore.lambda$batchOperation$5(ZKMetadataStore.java:219) ~[pulsar-metadata-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.apache.pulsar.metadata.impl.PulsarZooKeeperClient$3$1.processResult(PulsarZooKeeperClient.java:489) ~[pulsar-metadata-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]

Modifications

Handle NoSuchLedgerExistsOnMetadataServerException non-recoverable error and successfully load such topic with broken schema ledger.

Verifying this change

  • Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (10MB)
  • Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository:

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@poorbarcode
Copy link
Contributor

/pulsarbot rerun-failure-checks

@cbornet cbornet closed this Apr 11, 2023
@cbornet cbornet reopened this Apr 11, 2023
@cbornet cbornet modified the milestones: 3.0.0, 3.1.0 Apr 11, 2023
@Denovo1998
Copy link
Contributor

The PR of type fix should be deferred to 3.0.1.

@codecov-commenter
Copy link

codecov-commenter commented Apr 21, 2023

Codecov Report

Merging #19882 (1bd2776) into master (1e44ba1) will increase coverage by 45.06%.
The diff coverage is 74.72%.

Impacted file tree graph

@@              Coverage Diff              @@
##             master   #19882       +/-   ##
=============================================
+ Coverage     24.42%   69.48%   +45.06%     
- Complexity      291     3652     +3361     
=============================================
  Files          1603     1873      +270     
  Lines        124343   152550    +28207     
  Branches      13571    18653     +5082     
=============================================
+ Hits          30369   106004    +75635     
+ Misses        89490    37612    -51878     
- Partials       4484     8934     +4450     
Flag Coverage Δ
inttests 25.77% <23.07%> (+1.35%) ⬆️
systests 26.66% <21.97%> (?)
unittests 71.75% <82.27%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...extensions/reporter/TopBundleLoadDataReporter.java 71.64% <0.00%> (+71.64%) ⬆️
...broker/service/schema/BookkeeperSchemaStorage.java 78.72% <ø> (+17.32%) ⬆️
...nsions/policies/AntiAffinityGroupPolicyHelper.java 37.77% <38.46%> (+37.77%) ⬆️
...metadata/coordination/impl/LeaderElectionImpl.java 73.14% <66.66%> (+20.24%) ⬆️
...ar/metadata/coordination/impl/LockManagerImpl.java 89.47% <71.42%> (+31.57%) ⬆️
...java/org/apache/pulsar/common/util/FutureUtil.java 80.55% <76.92%> (+50.02%) ⬆️
...dbalance/extensions/scheduler/TransferShedder.java 66.40% <80.00%> (+66.40%) ⬆️
...ker/loadbalance/extensions/models/TopKBundles.java 90.78% <87.17%> (+90.78%) ⬆️
...org/apache/pulsar/broker/ServiceConfiguration.java 98.66% <100.00%> (+0.76%) ⬆️

... and 1583 files with indirect coverage changes

@codelipenghui
Copy link
Contributor

/pulsarbot run-failure-checks

Copy link
Contributor

@poorbarcode poorbarcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions:

  1. Should we make it will only work if enabled autoSkipNonRecoverableData?
  2. This PR is the same as [improve] [broker] make consumer and producer can work even if schema ledger lost #18010. I do not expect to use 18010 instead current one, but suggestion copying the test which In [improve] [broker] make consumer and producer can work even if schema ledger lost #18010 into the current PR, and then I can close PR [improve] [broker] make consumer and producer can work even if schema ledger lost #18010

@Denovo1998
Copy link
Contributor

On the first question, I asked the same question in PR 18010. Whether it should be at the time of enable autoSkipNonRecoverableData, to get NoSuchLedgerExistsOnMetadataServerException as a recoverable error.
#18010 (comment)

@codelipenghui
Copy link
Contributor

@rdhabalia Could you please take a look at @poorbarcode 's comment?

@github-actions
Copy link

github-actions bot commented Jun 8, 2023

The pr had no activity for 30 days, mark with Stale label.

@github-actions github-actions bot added the Stale label Jun 8, 2023
@Technoboy- Technoboy- modified the milestones: 3.1.0, 3.2.0 Jul 31, 2023
@Technoboy- Technoboy- removed this from the 3.2.0 milestone Dec 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants