Skip to content

Conversation

@devinbost
Copy link
Contributor

It looks like it's possible for pendingAddEntries to have an OpAddEntry instance that hasn't had a ledgerId set before checkAddTimeout() is called.
We add the OpAddEntry to pendingAddEntries here:


and set the ledgerId later on OpAddEntry in that method:

If checkAddTimeout() is called before the ledgerId is set, the ledgerId will show as -1 (

log.error("Failed to add entry for ledger {} in time-out {} sec",
), and if that's still -1 when the handler reaches ledgerClosed(..), we may not close the ledger that was timing out.

@devinbost
Copy link
Contributor Author

@lhotari thoughts?

@Anonymitaet
Copy link
Member

@devinbost thanks for your contribution. For this PR, do we need to update docs?

@devinbost
Copy link
Contributor Author

@Anonymitaet Thanks for asking. I don't think that will be necessary for this PR.

@devinbost
Copy link
Contributor Author

@merlimat ?

@eolivelli
Copy link
Contributor

I believe that we should not add the operation to the list until it is fully prepared

@devinbost devinbost force-pushed the fix_pendingAddEntries_race branch from a223cc2 to 6583c24 Compare June 18, 2021 22:52
@devinbost
Copy link
Contributor Author

@eolivelli I moved the pendingAddEntries.add(addOperation) calls to after we're done making changes on the addOperation in the different code branches.

What happens if this block in [ManagedLedgerImpl].createComplete(..) is called before the thread processing pendingAddEntries has had a chance to process all the operations (for example, if there was backpressure)?

                STATE_UPDATER.set(this, State.ClosedLedger);
            } else {
                STATE_UPDATER.set(this, State.WriteFailed);
            }

(

)

Seems like a race... The state could be set to State.WriteFailed when in reality the addOperation just hasn't been processed yet. If they're set to WriteFailed, what happens? Do they get automatically reprocessed? I wonder if there could be an edge case there.

@devinbost
Copy link
Contributor Author

/pulsarbot run-failure-checks

@devinbost
Copy link
Contributor Author

I think we have more flaky tests.

Error: Tests run: 6, Failures: 1, Errors: 0, Skipped: 4, Time elapsed: 14.843 s <<< FAILURE! - in org.apache.pulsar.metadata.ZKSessionTest
Error: testReacquireLeadershipAfterSessionLost(org.apache.pulsar.metadata.ZKSessionTest) Time elapsed: 4.231 s <<< FAILURE!
java.lang.AssertionError: expected [null] but found [NoLeader]

Another one that failed earlier was:

ManagedLedgerTest.testExpiredLedgerDeletionAfterManagedLedgerRestart()

but that test passed for me locally.

@lhotari FYI.

@devinbost
Copy link
Contributor Author

/pulsarbot run-failure-checks

@devinbost
Copy link
Contributor Author

@eolivelli PTAL. All the tests are passing.

@devinbost devinbost changed the title Synchronized checkAddTimeout due to race on pendingAddEntries Resolved race by fixing order of adding OpAddEntry to pendingAddEntries Jun 25, 2021
@devinbost devinbost changed the title Resolved race by fixing order of adding OpAddEntry to pendingAddEntries [Managed Ledger] Resolved race by fixing order of adding OpAddEntry to pendingAddEntries Jun 25, 2021
@devinbost
Copy link
Contributor Author

@sijie ?

@devinbost
Copy link
Contributor Author

or @codelipenghui

@devinbost
Copy link
Contributor Author

Due to this race, if the ledger isn't attached to the addOperation by the time the op is picked up from pendingAddEntries, then we can't run asyncAddEntry(..) on the op. But, at that point, I'd expect an NPE to be thrown.

Adding to pendingAddEntries after finishing changes on addOperation
@codelipenghui
Copy link
Contributor

The pr had no activity for 30 days, mark with Stale label.

@github-actions
Copy link

github-actions bot commented Mar 4, 2022

@devinbost:Thanks for your contribution. For this PR, do we need to update docs?
(The PR template contains info about doc, which helps others know more about the changes. Can you provide doc-related info in this and future PR descriptions? Thanks)

@tisonkun
Copy link
Member

Closed as stale and conflict. Please rebase and resubmit the patch if it's still relevant.

It seems no consensus here. PR page is always under high traffic. From GitHub data, the Pulsar community opens and closes about 300+ PR per month correspondingly.

I suggest you start a thread on the dev@ mailing list to reach a consensus first.

@tisonkun tisonkun closed this Dec 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants