Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-9844; Maximum number of members within a group is not always enforced due to a race condition in join group #8454

Merged
merged 5 commits into from
Apr 23, 2020

Conversation

dajac
Copy link
Contributor

@dajac dajac commented Apr 9, 2020

This patch fixes a race condition in the join group request handling which sometimes results in not enforcing the maximum number of members allowed in a group. The JIRA provides an example: https://issues.apache.org/jira/browse/KAFKA-9844

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

…forced due to a race condition in join group
@dajac
Copy link
Contributor Author

dajac commented Apr 9, 2020

I think that this is the root cause of https://issues.apache.org/jira/browse/KAFKA-7965.

@hachikuji
Copy link
Contributor

ok to test

Copy link
Contributor

@abbccdda abbccdda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! Although I don't think this alone helps close the gap for the flaky test itself as I have also seen cases where we reject more members than necessary. Could we get a unit test for the patch?

@hachikuji
Copy link
Contributor

@abbccdda Yeah, there is a separate problem discussed in more detail here: #8437.

@hachikuji
Copy link
Contributor

Thanks @dajac. This fix LGTM. It might be worth taking a look at GroupCoordinatorConcurrencyTest to see if there is a straightforward way to build a test which hits this problem.

@dajac
Copy link
Contributor Author

dajac commented Apr 13, 2020

@abbccdda @hachikuji I have added a test in GroupCoordinatorConcurrencyTest which reproduce the bug. It was pretty straightforward. The test fails all time without the fix.

@hachikuji
Copy link
Contributor

ok to test

@hachikuji
Copy link
Contributor

retest this please

1 similar comment
@hachikuji
Copy link
Contributor

retest this please

Copy link
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the test case. Left just one minor comment.

*/
def getGroup(groupId: String): Option[GroupMetadata] = {
Option(groupMetadataCache.get(groupId))
def getGroup(groupId: String, createIfNotExist: Boolean = false): Option[GroupMetadata] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional arguments are best avoided. Maybe we could split this into two methods getOrCreateGroup and getGroup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I went with the following method getOrMaybeCreateGroup which still takes the createIfNotExist argument in order to keep the calling side concise. It avoids having to switch between the two on the calling side which makes the error handling a little easier to follow. This is perhaps what you had in mind already.

@hachikuji
Copy link
Contributor

retest this please

@hachikuji
Copy link
Contributor

ok to test

@hachikuji
Copy link
Contributor

retest this please

@dajac
Copy link
Contributor Author

dajac commented Apr 22, 2020

@hachikuji Could we get this one merged?

@hachikuji
Copy link
Contributor

retest this please

@hachikuji hachikuji merged commit 85e81d4 into apache:trunk Apr 23, 2020
ijuma added a commit to ijuma/kafka that referenced this pull request Apr 29, 2020
…t-for-generated-requests

* apache-github/trunk: (366 commits)
  MINOR: Improve producer test BufferPoolTest#testCloseNotifyWaiters. (apache#7982)
  MINOR: document how to escape json parameters to ducktape tests (apache#8546)
  KAFKA-9885; Evict last members of a group when the maximum allowed is reached (apache#8525)
  KAFKA-9866: Avoid election for topics where preferred leader is not in ISR (apache#8524)
  KAFKA-9839; Broker should accept control requests with newer broker epoch (apache#8509)
  KAKFA-9612: Add an option to kafka-configs.sh to add configs from a prop file (KIP-574)
  MINOR: Partition is under reassignment when adding and removing (apache#8364)
  MINOR: reduce allocations in log start and recovery checkpoints (apache#8467)
  MINOR: Remove unused foreign-key join class (apache#8547)
  HOTFIX: Fix broker bounce system tests (apache#8532)
  KAFKA-9704: Fix the issue z/OS won't let us resize file when mmap. (apache#8224)
  KAFKA-8639: Replace AddPartitionsToTxn with Automated Protocol  (apache#8326)
  MINOR: equals() should compare all fields for generated classes (apache#8539)
  KAFKA-9844; Fix race condition which allows more than maximum number of members(apache#8454)
  KAFKA-9823: Remember the sent generation for the coordinator request (apache#8445)
  KAFKA-9883: Add better error message when REST API forwards a request and leader is not known (apache#8536)
  KAFKA-9907: Switch default build to Scala 2.13 (apache#8537)
  MINOR: Some html fixes in Streams DSL documentation (apache#8503)
  MINOR: Enable fatal warnings with scala 2.13 (apache#8429)
  KAFKA-9852: Change the max duration that calls to the buffer pool can block from 2000ms to 10ms to reduce overall test runtime (apache#8464)
  ...
@dajac dajac deleted the KAFKA-9844 branch October 6, 2020 20:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants