Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-9777; Remove txn purgatory to fix race condition on txn completion #8389

Merged
merged 3 commits into from Mar 31, 2020

Conversation

hachikuji
Copy link
Contributor

@hachikuji hachikuji commented Mar 30, 2020

This patch addresses a locking issue with DelayTxnMarker completion. Because of the reliance on the shared read lock in TransactionStateManager and the deadlock avoidance algorithm in DelayedOperation, we cannot guarantee that a call to checkAndComplete will offer an opportunity to complete the job. This patch removes the reliance on this lock in two ways:

  1. We replace the transaction marker purgatory with a map of transaction with pending markers. We were not using purgatory expiration anyway, so this avoids the locking issue and simplifies usage.
  2. We were also relying on the read lock for the DelayedProduce completion when calling ReplicaManager.appendRecords. As far as I can tell, this was not necessary. The lock order is always 1) state read/write lock, 2) txn metadata locks. Since we only call appendRecords while holding the read lock, a deadlock does not seem possible.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

Copy link
Contributor

@junrao junrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hachikuji : Thanks for the PR. LGTM. Just a minor comment below.

@hachikuji
Copy link
Contributor Author

@junrao Thanks for reviewing. I addressed the review comment and I added a test case which reproduces the problem consistently on trunk.

Copy link
Contributor

@guozhangwang guozhangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a meta comment: could we augment GroupCoordinatorConcurrencyTest with ReplicaFetchRequest to cover the lock issue?

@hachikuji
Copy link
Contributor Author

hachikuji commented Mar 31, 2020

The one failure is this: KAFKA-9783. I will go ahead and merge.

@hachikuji hachikuji merged commit 75e8ee1 into apache:trunk Mar 31, 2020
Copy link
Contributor

@guozhangwang guozhangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for getting a bit late on this PR, made a pass on it and LGTM too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants