KAFKA-2720: expire group metadata when all offsets have expired #1427

hachikuji · 2016-05-25T03:54:12Z

No description provided.

hachikuji · 2016-05-25T04:09:45Z

This patch moves group metadata removal to the periodic offset expiration thread so that group metadata (including the generation) is preserved until all offsets for the group have expired. Since the last member may leave the group before offsets have expired, this patch changes the behavior to increment the generation and essentially write an otherwise empty group metadata object to the log. The main idea is to avoid resetting the generation to 0 until it is safe to do so. This also helps users who currently have no easy way to tell if the group simply doesn't exist or it is dead with possibly some offset state left behind.

Couple notes on the current patch:

I've added a new Empty state to the coordinator's group state machine. This may be unnecessary, since it seems Stable could work as well, but I preferred slightly to keep the logic for this case separate. It also gets around the need to have a direct transition from PreparingRebalance to Stable, which should not be possible normally. Instead we allow PreparingRebalance to transition to Empty when there are no members left in the group.
I've done a little bit of additional cleanup. The most notable change is the removal of the offset expiration lock in GroupMetadataManager, which seemed unnecessary with a small change to the offset loading logic. Rather than loading offsets directly into the cache, I first stage them in a local collection as is done with the group metadata. This should prevent the expiration thread from seeing stale values, but let me know if I've missed something.

ijuma · 2016-05-25T08:25:03Z

core/src/main/scala/kafka/coordinator/GroupCoordinator.scala

-    delayedGroupStore.foreach(groupManager.store)
+    // for deadlock if the callback is invoked holding other locks (e.g. the replica
+    // state change lock)
+    delayedGroupStore.map(groupManager.store)


Why did you change from foreach to map? It doesn't seem like the return value is being used.

Yeah, you're right. I briefly had considered refactoring the delayed store api and this may have been a side effect of manually reverting.

guozhangwang · 2016-05-25T18:37:47Z

Just browsed through the code change, regarding 2) I am still not clear how it prevent concurrent access on the offsetCache. Let's discuss offline.

guozhangwang · 2016-05-25T20:40:10Z

Specifically, could this issue be likely to triggered even without log compaction stopping? (second incident of the following blog post):

https://engineering.linkedin.com/blog/2016/05/kafkaesque-days-at-linkedin--part-1

hachikuji · 2016-05-25T21:49:31Z

Summary of brief offline discussion with @guozhangwang: I think we agree that concurrent loading and expiration is safe with this change, but we are trying to find a way to also solve the problem of expiration with a concurrent offset commit which is unprotected with and without this patch.

onurkaraman · 2016-05-25T21:53:36Z

hey @hachikuji I'll try to get to this patch after I finish up testing the topic indexing idea you mentioned in the "[DISCUSS] scalability limits in the coordinator" thread.

guozhangwang · 2016-05-26T05:09:21Z

Also ping @jjkoshy for reviews as well since you fixed the problem of race condition between offset expiration and added the lock; we are trying to safely remove the lock, and at the same time fix the race condition between offset expiration and offset commit as well.

Ishiihara · 2016-05-27T02:28:55Z

core/src/main/scala/kafka/coordinator/GroupMetadata.scala

+  * transition: last offsets removed in periodic expiration task => Dead
+  *             join group from a new member => PreparingRebalance
+  */
+private[coordinator] case object Empty extends GroupState { val state: Byte = 5 }


We should also document state transitions from other states to EMPTY state.

hachikuji · 2016-06-06T22:32:01Z

@onurkaraman @guozhangwang Ready for another look. The main change I made is moving the offset cache of each group into GroupMetadata. For groups which only use offset commits, the group will stay in the Empty state until all offsets for the group expire. The change helps to unify expiration logic and addresses the race condition between offset commits and offset expiration, which could previously result in an inconsistent cache. Now offset update/expiration is protected with the group lock.

guozhangwang · 2016-06-07T17:53:20Z

Thanks @hachikuji , will review.

The Jenkins failure looks familiar to me. I think we already have a JIRA for it?

hachikuji · 2016-06-08T23:40:56Z

@guozhangwang Yeah, looks like an instance of KAFKA-3155.

guozhangwang · 2016-06-09T23:53:50Z

core/src/main/scala/kafka/coordinator/GroupMetadataManager.scala

    }
  )

-  def currentGroups(): Iterable[GroupMetadata] = groupsCache.values
+  def start(enableExpiration: Boolean = true) {


This function name is a bit awkward to me:

it is only triggered in GroupCoordinator.startup, and enableExpiration is always passed (i.e. default value is never used?).

it only take effects if the passed enableExpiration is true.

Could we rename it to startMetadataCleanupInBackground with no parameters, and let coordinator to call it only if enableExpiration is true?

Ack. I changed the name to enableMetadataExpiration.

guozhangwang · 2016-06-10T00:17:27Z

The updated state machine diagram looks reasonable to me, and I will let @onurkaraman to take another check on it. One general comment about naming of removeGroup and removeGroupForPartition, and onGroupLoaded, onGroupUnloaded, could we differentiate them better by name the ones, for example: deleteGroup, evictGroupsFromCacheForPartition, and onGroupLoadedToCache onGroupEvictedFromCache. And also add some comments on when they could be triggered?

hachikuji · 2016-06-13T22:45:03Z

@guozhangwang @ijuma @Ishiihara @onurkaraman: I'm removing the WIP tag on this patch. I've cleaned up a few small problems and added additional test cases. Take a look when you have a chance.

guozhangwang · 2016-06-14T01:07:45Z

@hachikuji All my comments are from in the previous pass and I do not have further ones, would be OK to go with it if others have made their own passes.

onurkaraman · 2016-06-14T05:52:08Z

core/src/main/scala/kafka/coordinator/GroupMetadataManager.scala

  newGauge("NumOffsets",
    new Gauge[Int] {
-      def value = offsetsCache.size
+      def value = groupMetadataCache.values.map(group => {


This worries me a bit. Every time we compute this metric, we need to hold onto many locks.

Bugs me as well. I debated whether it was worthwhile keeping a separate atomic counter to track the size of the offsets or changing to a concurrent map, but it seemed like premature optimization. If you have any ideas, I'd love to hear them.

…p lock

hachikuji · 2016-06-14T23:40:14Z

FYI: There are some system test failures that I'm looking into.

hachikuji · 2016-06-15T16:33:01Z

Found the problem. Here's a nearly clean run: http://testing.confluent.io/confluent-kafka-branch-builder-system-test-results/?prefix=2016-06-15--001.1465997374--hachikuji--KAFKA-2720--9ffe4d0/. The two failures appear to be unrelated.

guozhangwang · 2016-06-16T02:48:25Z

LGTM. Merged to trunk.

ijuma · 2016-06-16T08:13:04Z

core/src/main/scala/kafka/coordinator/GroupMetadataManager.scala

  /**
   * When this broker becomes a follower for an offsets topic partition clear out the cache for groups that belong to
   * that partition.
-   * @param offsetsPartition Groups belonging to this partition of the offsets topic will be deleted from the cache.
+    *
+    * @param offsetsPartition Groups belonging to this partition of the offsets topic will be deleted from the cache.


Looks like an unintentional change, fix it in a follow-up? cc @hachikuji

hachikuji · 2016-06-16T16:24:21Z

Thanks @ijuma. I'll prepare a follow-up.

ijuma reviewed May 25, 2016
View reviewed changes

Ishiihara reviewed May 27, 2016
View reviewed changes

hachikuji force-pushed the KAFKA-2720 branch from 8481905 to f7f5bdc Compare June 6, 2016 22:10

guozhangwang reviewed Jun 9, 2016
View reviewed changes

hachikuji changed the title ~~KAFKA-2720 [WIP]: expire group metadata when all offsets have expired~~ KAFKA-2720: expire group metadata when all offsets have expired Jun 13, 2016

onurkaraman reviewed Jun 14, 2016
View reviewed changes

hachikuji added 2 commits June 14, 2016 16:09

KAFKA-2720 [WIP]: expire group metadata when all offsets have expired

7ed8ef5

move offset cache into GroupMetadata and protect access with the grou…

a12217e

…p lock

hachikuji added 4 commits June 14, 2016 16:09

Add isolated tests for GroupMetadataManager

74459f6

NumOffsets gauge needs synchronized access to group state

0e4253b

minor refactor/fixes from review comments

6c0c83b

Rename start to enableMetadataExpiration in GroupMetadataManager

799d5df

hachikuji force-pushed the KAFKA-2720 branch from bd9b566 to 799d5df Compare June 14, 2016 23:10

Initialize group in the right state

9ffe4d0

asfgit closed this in 8c55167 Jun 16, 2016

ijuma reviewed Jun 16, 2016
View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-2720: expire group metadata when all offsets have expired #1427

KAFKA-2720: expire group metadata when all offsets have expired #1427

hachikuji commented May 25, 2016

hachikuji commented May 25, 2016

ijuma May 25, 2016

hachikuji May 25, 2016

guozhangwang commented May 25, 2016

guozhangwang commented May 25, 2016

hachikuji commented May 25, 2016 •

edited

onurkaraman commented May 25, 2016

guozhangwang commented May 26, 2016

Ishiihara May 27, 2016

hachikuji commented Jun 6, 2016

guozhangwang commented Jun 7, 2016

hachikuji commented Jun 8, 2016

guozhangwang Jun 9, 2016

hachikuji Jun 14, 2016

guozhangwang commented Jun 10, 2016

hachikuji commented Jun 13, 2016

guozhangwang commented Jun 14, 2016

onurkaraman Jun 14, 2016

hachikuji Jun 14, 2016

hachikuji commented Jun 14, 2016

hachikuji commented Jun 15, 2016

guozhangwang commented Jun 16, 2016

ijuma Jun 16, 2016

hachikuji commented Jun 16, 2016

KAFKA-2720: expire group metadata when all offsets have expired #1427

KAFKA-2720: expire group metadata when all offsets have expired #1427

Conversation

hachikuji commented May 25, 2016

hachikuji commented May 25, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guozhangwang commented May 25, 2016

guozhangwang commented May 25, 2016

hachikuji commented May 25, 2016 • edited

onurkaraman commented May 25, 2016

guozhangwang commented May 26, 2016

Choose a reason for hiding this comment

hachikuji commented Jun 6, 2016

guozhangwang commented Jun 7, 2016

hachikuji commented Jun 8, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guozhangwang commented Jun 10, 2016

hachikuji commented Jun 13, 2016

guozhangwang commented Jun 14, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hachikuji commented Jun 14, 2016

hachikuji commented Jun 15, 2016

guozhangwang commented Jun 16, 2016

Choose a reason for hiding this comment

hachikuji commented Jun 16, 2016

hachikuji commented May 25, 2016 •

edited