[Managed Ledger] Resolved race by fixing order of adding OpAddEntry to pendingAddEntries #10758

devinbost · 2021-05-31T21:27:52Z

It looks like it's possible for pendingAddEntries to have an OpAddEntry instance that hasn't had a ledgerId set before checkAddTimeout() is called.
We add the OpAddEntry to pendingAddEntries here:

pulsar/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java

Line 716 in a223cc2

pendingAddEntries.add(addOperation);

and set the ledgerId later on OpAddEntry in that method:

pulsar/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java

Line 760 in a223cc2

addOperation.setLedger(currentLedger);

If checkAddTimeout() is called before the ledgerId is set, the ledgerId will show as -1 (

pulsar/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java

Line 3678 in a223cc2

log.error("Failed to add entry for ledger {} in time-out {} sec",

), and if that's still -1 when the handler reaches ledgerClosed(..), we may not close the ledger that was timing out.

devinbost · 2021-05-31T21:28:09Z

@lhotari thoughts?

Anonymitaet · 2021-06-01T00:23:54Z

@devinbost thanks for your contribution. For this PR, do we need to update docs?

devinbost · 2021-06-01T03:40:27Z

@Anonymitaet Thanks for asking. I don't think that will be necessary for this PR.

devinbost · 2021-06-01T05:50:57Z

@merlimat ?

eolivelli · 2021-06-01T11:25:26Z

I believe that we should not add the operation to the list until it is fully prepared

devinbost · 2021-06-18T23:34:50Z

@eolivelli I moved the pendingAddEntries.add(addOperation) calls to after we're done making changes on the addOperation in the different code branches.

What happens if this block in [ManagedLedgerImpl].createComplete(..) is called before the thread processing pendingAddEntries has had a chance to process all the operations (for example, if there was backpressure)?

                STATE_UPDATER.set(this, State.ClosedLedger);
            } else {
                STATE_UPDATER.set(this, State.WriteFailed);
            }

(

pulsar/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java

Line 1403 in 380cf92

if (pendingAddEntries.isEmpty()) {

)

Seems like a race... The state could be set to State.WriteFailed when in reality the addOperation just hasn't been processed yet. If they're set to WriteFailed, what happens? Do they get automatically reprocessed? I wonder if there could be an edge case there.

devinbost · 2021-06-24T22:14:26Z

/pulsarbot run-failure-checks

devinbost · 2021-06-24T23:29:00Z

I think we have more flaky tests.

Error: Tests run: 6, Failures: 1, Errors: 0, Skipped: 4, Time elapsed: 14.843 s <<< FAILURE! - in org.apache.pulsar.metadata.ZKSessionTest
Error: testReacquireLeadershipAfterSessionLost(org.apache.pulsar.metadata.ZKSessionTest) Time elapsed: 4.231 s <<< FAILURE!
java.lang.AssertionError: expected [null] but found [NoLeader]

Another one that failed earlier was:

ManagedLedgerTest.testExpiredLedgerDeletionAfterManagedLedgerRestart()

but that test passed for me locally.

@lhotari FYI.

devinbost · 2021-06-24T23:29:10Z

/pulsarbot run-failure-checks

devinbost · 2021-06-25T01:30:02Z

@eolivelli PTAL. All the tests are passing.

devinbost · 2021-06-29T20:43:59Z

@sijie ?

devinbost · 2021-06-29T20:45:52Z

or @codelipenghui

devinbost · 2021-06-29T21:14:10Z

Due to this race, if the ledger isn't attached to the addOperation by the time the op is picked up from pendingAddEntries, then we can't run asyncAddEntry(..) on the op. But, at that point, I'd expect an NPE to be thrown.

Adding to pendingAddEntries after finishing changes on addOperation

codelipenghui · 2022-03-04T06:59:52Z

The pr had no activity for 30 days, mark with Stale label.

github-actions · 2022-03-04T07:00:10Z

@devinbost:Thanks for your contribution. For this PR, do we need to update docs?
(The PR template contains info about doc, which helps others know more about the changes. Can you provide doc-related info in this and future PR descriptions? Thanks)

tisonkun · 2022-12-10T07:17:13Z

Closed as stale and conflict. Please rebase and resubmit the patch if it's still relevant.

It seems no consensus here. PR page is always under high traffic. From GitHub data, the Pulsar community opens and closes about 300+ PR per month correspondingly.

I suggest you start a thread on the dev@ mailing list to reach a consensus first.

devinbost force-pushed the fix_pendingAddEntries_race branch from a223cc2 to 6583c24 Compare June 18, 2021 22:52

devinbost changed the title ~~Synchronized checkAddTimeout due to race on pendingAddEntries~~ Resolved race by fixing order of adding OpAddEntry to pendingAddEntries Jun 25, 2021

devinbost changed the title ~~Resolved race by fixing order of adding OpAddEntry to pendingAddEntries~~ [Managed Ledger] Resolved race by fixing order of adding OpAddEntry to pendingAddEntries Jun 25, 2021

Synchronized checkAddTimeout due to race on pendingAddEntries

b546920

Adding to pendingAddEntries after finishing changes on addOperation

devinbost force-pushed the fix_pendingAddEntries_race branch from 380cf92 to b546920 Compare July 5, 2021 21:03

codelipenghui added the lifecycle/stale label Mar 4, 2022

github-actions bot assigned devinbost Mar 4, 2022

github-actions bot added the doc-label-missing label Mar 4, 2022

tisonkun closed this Dec 10, 2022

[Managed Ledger] Resolved race by fixing order of adding OpAddEntry to pendingAddEntries #10758

[Managed Ledger] Resolved race by fixing order of adding OpAddEntry to pendingAddEntries #10758

Uh oh!

Conversation

devinbost commented May 31, 2021

Uh oh!

devinbost commented May 31, 2021

Uh oh!

Anonymitaet commented Jun 1, 2021

Uh oh!

devinbost commented Jun 1, 2021

Uh oh!

devinbost commented Jun 1, 2021

Uh oh!

eolivelli commented Jun 1, 2021

Uh oh!

devinbost commented Jun 18, 2021

Uh oh!

devinbost commented Jun 24, 2021

Uh oh!

devinbost commented Jun 24, 2021

Uh oh!

devinbost commented Jun 24, 2021

Uh oh!

devinbost commented Jun 25, 2021

Uh oh!

devinbost commented Jun 29, 2021

Uh oh!

devinbost commented Jun 29, 2021

Uh oh!

devinbost commented Jun 29, 2021

Uh oh!

codelipenghui commented Mar 4, 2022

Uh oh!

github-actions bot commented Mar 4, 2022

Uh oh!

tisonkun commented Dec 10, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants