Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix producer stuck issue due to NPE thrown when creating a new ledger #7401

Merged
merged 1 commit into from
Jun 30, 2020

Conversation

sijie
Copy link
Member

@sijie sijie commented Jun 30, 2020

Motivation

NPE can be thrown when creating a ledger because the network address is unresolvable. If NPE is thrown before adding the timeout task, the timeout mechanism doesn't work.

Network address unresolvable is commonly seen in the Kubernetes environment. It can happen when a bookie pod or a worker node restarts.

Changes

This pull request does the followings:

  1. Catch the NPE when creating a new ledger.
  2. When the timeout task is triggered, always execute the callback. It is totally fine because we already have the logic to ensure the callback is triggered only once.
  3. Add a mechanism to detect the CreatingLedger state is not moving.

@sijie sijie added type/bug The PR fixed a bug or issue reported a bug area/broker release/2.5.3 release/2.6.1 labels Jun 30, 2020
@sijie sijie added this to the 2.7.0 milestone Jun 30, 2020
@sijie sijie self-assigned this Jun 30, 2020
digestType, config.getPassword(), cb, ledgerCreated, finalMetadata);
} catch (Throwable cause) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try-catch here is good in any case, though we should also ensure that BK client is handling DNS errors by triggering the callback instead of exception.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix was already made in the bookkeeper client. We don't release the bookkeeper client yet.

}
cb.createComplete(BKException.Code.TimeoutException, null, ledgerCreated);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change intentional?

Shouldn't the callback only be triggered if (!ledgerCreated.get()) ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is intentional - when the timeout task is triggered, always execute the callback. It is totally fine because we already have the logic to ensure the callback is triggered only once.

This change ensures all the logic is executed in a central place instead of spreading across multiple places and can potentially make code maintenance much harder.

@merlimat
Copy link
Contributor

@sijie One more thing. There are a couple more calls to create ledgers, we should then do try/catch on all of them:

$ git grep '\.asyncCreateLedger' | grep -v test
managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedCursorImpl.java:        ledger.asyncCreateLedger(bookkeeper, config, digestType, (rc, lh, ctx) -> {
managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java:            bookKeeper.asyncCreateLedger(config.getEnsembleSize(), config.getWriteQuorumSize(), config.getAckQuorumSize(),
pulsar-broker/src/main/java/org/apache/pulsar/broker/service/schema/BookkeeperSchemaStorage.java:        bookKeeper.asyncCreateLedger(
pulsar-broker/src/main/java/org/apache/pulsar/compaction/TwoPhaseCompactor.java:        bk.asyncCreateLedger(conf.getManagedLedgerDefaultEnsembleSize(),

@merlimat
Copy link
Contributor

Anyway, we can do in separate PR. But my next question is: can the NPE also happen on asyncAddEntry() calls?

@sijie
Copy link
Member Author

sijie commented Jun 30, 2020

@merlimat the DNS resolution problem only happens when creating a ledger. So it doesn't happen on adding entries.

I think we should release a bug fix release on bookkeeper soon.

@merlimat merlimat merged commit 86e2610 into apache:master Jun 30, 2020
codelipenghui pushed a commit to streamnative/pulsar-archived that referenced this pull request Jul 14, 2020
cdbartholomew pushed a commit to kafkaesque-io/pulsar that referenced this pull request Jul 21, 2020
cdbartholomew pushed a commit to kafkaesque-io/pulsar that referenced this pull request Jul 24, 2020
wolfstudy pushed a commit that referenced this pull request Jul 29, 2020
huangdx0726 pushed a commit to huangdx0726/pulsar that referenced this pull request Aug 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/broker release/2.5.3 release/2.6.1 type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants