Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-8715: fix timestamp bug for static member.id generation #7116

Merged
merged 3 commits into from Jul 26, 2019

Conversation

@abbccdda
Copy link
Contributor

commented Jul 25, 2019

The bug is that we accidentally used the loaded timestamp for the group instead of the real current time. Fix is made and unit test to make sure the timestamp is properly encoded within the returned member.id.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)
@abbccdda

This comment has been minimized.

Copy link
Contributor Author

commented Jul 25, 2019

@@ -873,6 +873,22 @@ class GroupCoordinatorTest {
assertEquals(Errors.FENCED_INSTANCE_ID, invalidHeartbeatResult)
}

@Test
def shouldGetDifferentStaticMemberIdBasedOnTimestamp() {

This comment has been minimized.

Copy link
@guozhangwang

guozhangwang Jul 25, 2019

Contributor

Does this test expose the original bug? I thought the bug would only surface if we have a loaded group metadata with old versioned format; otherwise the time stays the same but not None.

In this test MockTimer does not advance after the staticMembersJoinAndRebalance call, so the joinTimeAsStr would be the same?

I think we need to explicitly call timer.advanceClock() within the loop as well.

This comment has been minimized.

Copy link
@abbccdda

abbccdda Jul 25, 2019

Author Contributor

We do have that within staticMembersJoinAndRebalance

This comment has been minimized.

Copy link
@guozhangwang

guozhangwang Jul 25, 2019

Contributor

Yes but I'm thinking that we should also advance time within the forloop right? Otherwise the timestamp would still be static.

@guozhangwang
Copy link
Contributor

left a comment

Thanks for the quick fix, I have a question about the test coverage.

@hachikuji
Copy link
Contributor

left a comment

@abbccdda Thanks for the fix. I think this bug has both a direct and indirect impact. The direct impact is the NPE. The indirect impact is that the memberId is not being uniquely generated. This seems to have an impact for fencing? I think I'm wondering whether we should use UUID.randomUUID() rather than timestamp, similar to the dynamic path.

staticMembersJoinAndRebalance(leaderInstanceId, followerInstanceId)

val timeAdvance = 1
for (_ <- 1 to 20) {

This comment has been minimized.

Copy link
@hachikuji

hachikuji Jul 25, 2019

Contributor

20 seems like overkill. Maybe two times is enough.

This comment has been minimized.

Copy link
@abbccdda

abbccdda Jul 25, 2019

Author Contributor

Yea, will reduce it to 5.

@guozhangwang

This comment has been minimized.

Copy link
Contributor

commented Jul 25, 2019

@abbccdda Thanks for the fix. I think this bug has both a direct and indirect impact. The direct impact is the NPE. The indirect impact is that the memberId is not being uniquely generated. This seems to have an impact for fencing? I think I'm wondering whether we should use UUID.randomUUID() rather than timestamp, similar to the dynamic path.

Assuming the instance-id is unique already, then practically time.millisecond() should be sufficient as you will only break fencing if the clients configured with the same instance id are joining at the exact same milli-second. So I think this theoretical risk is not worth dropping the debugging efficiency?

@abbccdda

This comment has been minimized.

Copy link
Contributor Author

commented Jul 25, 2019

I still feel timestamp is useful for debugging like Guozhang suggested. For uniqueness I'm fine with both.

@hachikuji

This comment has been minimized.

Copy link
Contributor

commented Jul 25, 2019

Ok, we can stick with timestamp.

@guozhangwang

This comment has been minimized.

Copy link
Contributor

commented Jul 25, 2019

LGTM.

@ijuma

This comment has been minimized.

Copy link
Contributor

commented Jul 26, 2019

Hmm, the reasoning for using a timestamp seems a bit odd. Are we sure about this?

@abbccdda

This comment has been minimized.

Copy link
Contributor Author

commented Jul 26, 2019

Yea, I think we discussed a couple of times during past reviews, and felt that a timestamp suffix could help debug when a static member actually rejoins the group.

@abbccdda

This comment has been minimized.

Copy link
Contributor Author

commented Jul 26, 2019

get 1/3 build with jdk 8 unknown failure

retest this please

@ijuma

This comment has been minimized.

Copy link
Contributor

commented Jul 26, 2019

We can use a time-based UUID or you could add the timestamp in addition to the UUID. Relying on a timestamp for uniqueness is an anti-pattern. In particular, the following statement is weird:

"Assuming the instance-id is unique already, then practically time.millisecond() should be sufficient as you will only break fencing if the clients configured with the same instance id are joining at the exact same milli-second. So I think this theoretical risk"

We know things like this happen, it's not theoretical.

@guozhangwang

This comment has been minimized.

Copy link
Contributor

commented Jul 26, 2019

I'm fine with a instanceId + UUID + timestamp format than instanceId + timestamp -- I do feel maintaining the timestamp would be convenient for debugging.

@hachikuji

This comment has been minimized.

Copy link
Contributor

commented Jul 26, 2019

I prefer instanceId + UUID for consistency. I think the debugging benefit of having a timestamp is marginal.

@abbccdda

This comment has been minimized.

Copy link
Contributor Author

commented Jul 26, 2019

Per offline discussion, we shall use UUID for consistency. Thank you all!

@hachikuji hachikuji merged commit 79341a1 into apache:trunk Jul 26, 2019

2 of 3 checks passed

JDK 8 and Scala 2.11 FAILURE 11692 tests run, 67 skipped, 1 failed.
Details
JDK 11 and Scala 2.12 SUCCESS 11692 tests run, 67 skipped, 0 failed.
Details
JDK 11 and Scala 2.13 SUCCESS 11692 tests run, 67 skipped, 0 failed.
Details
hachikuji added a commit that referenced this pull request Jul 26, 2019
KAFKA-8715; Fix buggy reliance on state timestamp in static member.id…
… generation (#7116)

The bug is that we accidentally used the current state timestamp for the group instead of the real current time. When a group is first loaded, this timestamp is not initialized, so this resulted in a `NoSuchElementException`. Additionally this violated the intended uniqueness of the memberId, which could have broken the group instance fencing. Fix is made and unit test to make sure the timestamp is properly encoded within the returned member.id.

Reviewers: Ismael Juma <ismael@juma.me.uk>, Guozhang Wang <wangguoz@gmail.com>, Jason Gustafson <jason@confluent.io>
ijuma added a commit to confluentinc/kafka that referenced this pull request Aug 4, 2019
Merge remote-tracking branch 'apache-github/2.3' into ccs-2.3
* apache-github/2.3:
  MINOR: Avoid dividing by zero (apache#7143)
  KAFKA-8731: InMemorySessionStore throws NullPointerException on startup (apache#7132)
  KAFKA-8715; Fix buggy reliance on state timestamp in static member.id generation (apache#7116)
  KAFKA-8678; Fix leave group protocol bug in throttling and error response (apache#7101)
xiowu0 added a commit to linkedin/kafka that referenced this pull request Aug 22, 2019
[LI-CHERRY-PICK] [c29380f] KAFKA-8715; Fix buggy reliance on state ti…
…mestamp in static member.id generation (apache#7116)

TICKET = KAFKA-8715
LI_DESCRIPTION =
EXIT_CRITERIA = HASH [c29380f]
ORIGINAL_DESCRIPTION =

The bug is that we accidentally used the current state timestamp for the group instead of the real current time. When a group is first loaded, this timestamp is not initialized, so this resulted in a `NoSuchElementException`. Additionally this violated the intended uniqueness of the memberId, which could have broken the group instance fencing. Fix is made and unit test to make sure the timestamp is properly encoded within the returned member.id.

Reviewers: Ismael Juma <ismael@juma.me.uk>, Guozhang Wang <wangguoz@gmail.com>, Jason Gustafson <jason@confluent.io>
(cherry picked from commit c29380f)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.