KAFKA-9777; Remove txn purgatory to fix race condition on txn completion #8389

hachikuji · 2020-03-30T17:45:24Z

This patch addresses a locking issue with DelayTxnMarker completion. Because of the reliance on the shared read lock in TransactionStateManager and the deadlock avoidance algorithm in DelayedOperation, we cannot guarantee that a call to checkAndComplete will offer an opportunity to complete the job. This patch removes the reliance on this lock in two ways:

We replace the transaction marker purgatory with a map of transaction with pending markers. We were not using purgatory expiration anyway, so this avoids the locking issue and simplifies usage.
We were also relying on the read lock for the DelayedProduce completion when calling ReplicaManager.appendRecords. As far as I can tell, this was not necessary. The lock order is always 1) state read/write lock, 2) txn metadata locks. Since we only call appendRecords while holding the read lock, a deadlock does not seem possible.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

junrao

@hachikuji : Thanks for the PR. LGTM. Just a minor comment below.

core/src/main/scala/kafka/coordinator/transaction/TransactionMarkerChannelManager.scala

hachikuji · 2020-03-30T22:33:03Z

@junrao Thanks for reviewing. I addressed the review comment and I added a test case which reproduces the problem consistently on trunk.

guozhangwang

Just a meta comment: could we augment GroupCoordinatorConcurrencyTest with ReplicaFetchRequest to cover the lock issue?

hachikuji · 2020-03-31T00:39:47Z

The one failure is this: KAFKA-9783. I will go ahead and merge.

guozhangwang

Sorry for getting a bit late on this PR, made a pass on it and LGTM too.

KAFKA-9777; Remove txn purgatory to fix race condition on txn completion

ac31630

junrao approved these changes Mar 30, 2020

View reviewed changes

core/src/main/scala/kafka/coordinator/transaction/TransactionMarkerChannelManager.scala Outdated Show resolved Hide resolved

PendingCommitTxn -> PendingCompleteTxn

88b8b0f

Add new concurrency test case to reproduce bug

2ffe6c0

hachikuji force-pushed the KAFKA-9777-ALT branch from 7ef0a4e to 2ffe6c0 Compare March 30, 2020 22:34

guozhangwang reviewed Mar 30, 2020

View reviewed changes

hachikuji merged commit 75e8ee1 into apache:trunk Mar 31, 2020

guozhangwang reviewed Mar 31, 2020

View reviewed changes

junrao mentioned this pull request Apr 1, 2020

KAFKA-8334 Executor to retry delayed operations failed to obtain lock #6915

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-9777; Remove txn purgatory to fix race condition on txn completion #8389

KAFKA-9777; Remove txn purgatory to fix race condition on txn completion #8389

hachikuji commented Mar 30, 2020 •

edited

junrao left a comment

hachikuji commented Mar 30, 2020

guozhangwang left a comment

hachikuji commented Mar 31, 2020 •

edited

guozhangwang left a comment

KAFKA-9777; Remove txn purgatory to fix race condition on txn completion #8389

KAFKA-9777; Remove txn purgatory to fix race condition on txn completion #8389

Conversation

hachikuji commented Mar 30, 2020 • edited

Committer Checklist (excluded from commit message)

junrao left a comment

Choose a reason for hiding this comment

hachikuji commented Mar 30, 2020

guozhangwang left a comment

Choose a reason for hiding this comment

hachikuji commented Mar 31, 2020 • edited

guozhangwang left a comment

Choose a reason for hiding this comment

hachikuji commented Mar 30, 2020 •

edited

hachikuji commented Mar 31, 2020 •

edited