New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-2720: expire group metadata when all offsets have expired #1427
Conversation
This patch moves group metadata removal to the periodic offset expiration thread so that group metadata (including the generation) is preserved until all offsets for the group have expired. Since the last member may leave the group before offsets have expired, this patch changes the behavior to increment the generation and essentially write an otherwise empty group metadata object to the log. The main idea is to avoid resetting the generation to 0 until it is safe to do so. This also helps users who currently have no easy way to tell if the group simply doesn't exist or it is dead with possibly some offset state left behind. Couple notes on the current patch:
|
delayedGroupStore.foreach(groupManager.store) | ||
// for deadlock if the callback is invoked holding other locks (e.g. the replica | ||
// state change lock) | ||
delayedGroupStore.map(groupManager.store) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you change from foreach
to map
? It doesn't seem like the return value is being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you're right. I briefly had considered refactoring the delayed store api and this may have been a side effect of manually reverting.
Just browsed through the code change, regarding 2) I am still not clear how it prevent concurrent access on the |
Specifically, could this issue be likely to triggered even without log compaction stopping? (second incident of the following blog post): https://engineering.linkedin.com/blog/2016/05/kafkaesque-days-at-linkedin--part-1 |
Summary of brief offline discussion with @guozhangwang: I think we agree that concurrent loading and expiration is safe with this change, but we are trying to find a way to also solve the problem of expiration with a concurrent offset commit which is unprotected with and without this patch. |
hey @hachikuji I'll try to get to this patch after I finish up testing the topic indexing idea you mentioned in the "[DISCUSS] scalability limits in the coordinator" thread. |
Also ping @jjkoshy for reviews as well since you fixed the problem of race condition between offset expiration and added the lock; we are trying to safely remove the lock, and at the same time fix the race condition between offset expiration and offset commit as well. |
* transition: last offsets removed in periodic expiration task => Dead | ||
* join group from a new member => PreparingRebalance | ||
*/ | ||
private[coordinator] case object Empty extends GroupState { val state: Byte = 5 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also document state transitions from other states to EMPTY state.
@onurkaraman @guozhangwang Ready for another look. The main change I made is moving the offset cache of each group into |
Thanks @hachikuji , will review. The Jenkins failure looks familiar to me. I think we already have a JIRA for it? |
@guozhangwang Yeah, looks like an instance of KAFKA-3155. |
} | ||
) | ||
|
||
def currentGroups(): Iterable[GroupMetadata] = groupsCache.values | ||
def start(enableExpiration: Boolean = true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function name is a bit awkward to me:
-
it is only triggered in
GroupCoordinator.startup
, and enableExpiration is always passed (i.e. default value is never used?). -
it only take effects if the passed
enableExpiration
is true.
Could we rename it to startMetadataCleanupInBackground
with no parameters, and let coordinator to call it only if enableExpiration
is true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. I changed the name to enableMetadataExpiration
.
The updated state machine diagram looks reasonable to me, and I will let @onurkaraman to take another check on it. One general comment about naming of |
@guozhangwang @ijuma @Ishiihara @onurkaraman: I'm removing the WIP tag on this patch. I've cleaned up a few small problems and added additional test cases. Take a look when you have a chance. |
@hachikuji All my comments are from in the previous pass and I do not have further ones, would be OK to go with it if others have made their own passes. |
newGauge("NumOffsets", | ||
new Gauge[Int] { | ||
def value = offsetsCache.size | ||
def value = groupMetadataCache.values.map(group => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This worries me a bit. Every time we compute this metric, we need to hold onto many locks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bugs me as well. I debated whether it was worthwhile keeping a separate atomic counter to track the size of the offsets or changing to a concurrent map, but it seemed like premature optimization. If you have any ideas, I'd love to hear them.
FYI: There are some system test failures that I'm looking into. |
Found the problem. Here's a nearly clean run: http://testing.confluent.io/confluent-kafka-branch-builder-system-test-results/?prefix=2016-06-15--001.1465997374--hachikuji--KAFKA-2720--9ffe4d0/. The two failures appear to be unrelated. |
LGTM. Merged to trunk. |
/** | ||
* When this broker becomes a follower for an offsets topic partition clear out the cache for groups that belong to | ||
* that partition. | ||
* @param offsetsPartition Groups belonging to this partition of the offsets topic will be deleted from the cache. | ||
* | ||
* @param offsetsPartition Groups belonging to this partition of the offsets topic will be deleted from the cache. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like an unintentional change, fix it in a follow-up? cc @hachikuji
Thanks @ijuma. I'll prepare a follow-up. |
No description provided.