New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-12468, KAFKA-13659, KAFKA-12566: Fix MM2 causing negative downstream lag #13178
Conversation
Signed-off-by: Greg Harris <greg.harris@aiven.io>
…consumer offsets Signed-off-by: Greg Harris <greg.harris@aiven.io>
Signed-off-by: Greg Harris <greg.harris@aiven.io>
…rds are mirrored Signed-off-by: Greg Harris <greg.harris@aiven.io>
Signed-off-by: Greg Harris <greg.harris@aiven.io>
…ream offsets Signed-off-by: Greg Harris <greg.harris@aiven.io>
Signed-off-by: Greg Harris <greg.harris@aiven.io>
…arvation Signed-off-by: Greg Harris <greg.harris@aiven.io>
…er lag Signed-off-by: Greg Harris <greg.harris@aiven.io>
After #13181 is merged, I'll rebase and remove my fairness patch. |
…nfair starvation" This reverts commit fd4ebd8.
Signed-off-by: Greg Harris <greg.harris@aiven.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gharris1727. I had to read a few parts of this several times over to grok things, but your description and the changes here make sense now.
I believe that the proposed solution is reasonable since the current approach is based on a flawed assumption and we have seen several real-life cases where users are affected negatively by this. Eliminating reliance on this assumption will eliminate cases of negative consumer lag and potential data loss by downstream consumers, at the cost of making duplicate message consumption on failover to downstream clusters more likely. I think this is a tradeoff worth making.
It does seem that there's a chance this could lead to a regression in behavior in cases where the assumption about offsets of upstream records aligning with offsets of downstream records (or rather, of the delta between the two) holds. I notice that we don't do a read-to-end of the offset syncs topic in MirrorCheckpointTask
before we begin syncing consumer group offsets, and we begin reading that topic from the beginning. This may cause us to sync offsets based on stale checkpoints if there are later checkpoints available in the topic that we haven't consumed yet. Do you think it might make sense to add a read-to-end for the offset syncs topic before we begin syncing consumer group offsets in the checkpoint connector? (If so, this should probably be handled in a follow-up PR; I only bring it up now because it seems like the impact of syncing from stale checkpoints may be exacerbated after with this change.
connect/mirror/src/main/java/org/apache/kafka/connect/mirror/OffsetSyncStore.java
Outdated
Show resolved
Hide resolved
...st/java/org/apache/kafka/connect/mirror/integration/MirrorConnectorsIntegrationBaseTest.java
Outdated
Show resolved
Hide resolved
...est/java/org/apache/kafka/connect/mirror/integration/IdentityReplicationIntegrationTest.java
Outdated
Show resolved
Hide resolved
...st/java/org/apache/kafka/connect/mirror/integration/MirrorConnectorsIntegrationBaseTest.java
Outdated
Show resolved
Hide resolved
...st/java/org/apache/kafka/connect/mirror/integration/MirrorConnectorsIntegrationBaseTest.java
Outdated
Show resolved
Hide resolved
...st/java/org/apache/kafka/connect/mirror/integration/MirrorConnectorsIntegrationBaseTest.java
Outdated
Show resolved
Hide resolved
...st/java/org/apache/kafka/connect/mirror/integration/MirrorConnectorsIntegrationBaseTest.java
Show resolved
Hide resolved
...org/apache/kafka/connect/mirror/integration/MirrorConnectorsIntegrationTransactionsTest.java
Outdated
Show resolved
Hide resolved
Oh, and the Jenkins build seems to be consistently failing on the |
Signed-off-by: Greg Harris <greg.harris@aiven.io>
Signed-off-by: Greg Harris <greg.harris@aiven.io>
Signed-off-by: Greg Harris <greg.harris@aiven.io>
Signed-off-by: Greg Harris <greg.harris@aiven.io>
Signed-off-by: Greg Harris <greg.harris@aiven.io>
Ouch, yeah that is certainly an issue that gets worse with my change. |
Signed-off-by: Greg Harris <greg.harris@aiven.io>
Signed-off-by: Greg Harris <greg.harris@aiven.io>
Signed-off-by: Greg Harris <greg.harris@aiven.io>
It looks like this was failing due to a typo in my offset.flush.interval.ms overrides. This should be fixed now. |
Signed-off-by: Greg Harris <greg.harris@aiven.io>
And funnily enough, someone's already filed a ticket for that exact issue! 🎉 Out of an abundance of caution, what do you think about targeting your This has the potential to be a fairly large change in behavior, and I'd like to do everything we can to minimize the chances that it breaks users' setups. Ensuring that this PR is merged if and only if a fix for KAFKA-13659 would help on that front. |
@mimaison This is a moderately large change in behavior and if possible, it'd be nice to get another set of eyes on it before merging. We don't need another reviewer for the PR changes (although comments are always welcome); instead, I'd just like confirmation that this change is safe to make as a bug fix. TL;DR: If an upstream consumer group is ahead of the upstream offset for the latest-emitted checkpoint, we will only sync offsets for that consumer group to the downstream cluster based on the offset pair for that checkpoint, instead of adding the delta of (upstream offset for consumer group - upstream offset in checkpoint), since there is no guarantee that that delta will be accurate in cases where the upstream topic is compacted, has transaction markers, or has some records filtered out via SMT. |
…nteg test with unit Signed-off-by: Greg Harris <greg.harris@aiven.io>
Signed-off-by: Greg Harris <greg.harris@aiven.io>
Signed-off-by: Greg Harris <greg.harris@aiven.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks Greg! This was much trickier than expected but I'm really happy with the quality of this fix and am optimistic it's going to make our users much happier.
Will merge pending CI build.
Hmmm... there appear to be some integration test failures. I've reproduced some of them locally too, which makes flakiness an unlikely cause. Can you look into the integration test failures and see if we can get a green run before merging this? |
…returning Signed-off-by: Greg Harris <greg.harris@aiven.io>
Unfortunately those test failures only appear in the EOS test and appear to be caused by EOS mode. I added a tweak to the firePendingOffsetSyncs which drains the offset syncs map on each commit. Do you think we can use the same blocking-drain for EOS and normal mode, or should this behavior only be enabled for EOS mode? |
Hmmm... we make an effort to periodically invoke |
… before returning" This reverts commit b6b46a2.
…cords are produced Signed-off-by: Greg Harris <greg.harris@aiven.io>
I've split the periodic commit fix into a separate PR (#13262) as it is not related to the changes here, only the test changes. This PR depends on that change landing first, and I'll update this branch once the other PR is merged. |
…en no records are produced" This reverts commit aaf3e16.
…over minimum offsets This de-flakes the MirrorConnectorsIntegrationExactlyOnce test that had this condition exit too early, causing a later assertion to fail. Signed-off-by: Greg Harris <greg.harris@aiven.io>
Signed-off-by: Greg Harris <greg.harris@aiven.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test failures are all unrelated. LGTM!
…tream lag, syncing stale offsets, and flaky integration tests (apache#13178) KAFKA-12468: Fix negative lag on down consumer groups synced by MirrorMaker 2 KAFKA-13659: Stop syncing consumer groups with stale offsets in MirrorMaker 2 KAFKA-12566: Fix flaky MirrorMaker 2 integration tests Reviewers: Chris Egerton <chrise@aiven.io>
…tream lag, syncing stale offsets, and flaky integration tests (apache#13178) KAFKA-12468: Fix negative lag on down consumer groups synced by MirrorMaker 2 KAFKA-13659: Stop syncing consumer groups with stale offsets in MirrorMaker 2 KAFKA-12566: Fix flaky MirrorMaker 2 integration tests Reviewers: Chris Egerton <chrise@aiven.io>
…tream lag, syncing stale offsets, and flaky integration tests (#13178) KAFKA-12468: Fix negative lag on down consumer groups synced by MirrorMaker 2 KAFKA-13659: Stop syncing consumer groups with stale offsets in MirrorMaker 2 KAFKA-12566: Fix flaky MirrorMaker 2 integration tests Reviewers: Chris Egerton <chrise@aiven.io>
…tream lag, syncing stale offsets, and flaky integration tests (#13178) KAFKA-12468: Fix negative lag on down consumer groups synced by MirrorMaker 2 KAFKA-13659: Stop syncing consumer groups with stale offsets in MirrorMaker 2 KAFKA-12566: Fix flaky MirrorMaker 2 integration tests Reviewers: Chris Egerton <chrise@aiven.io>
…ve downstream lag, syncing stale offsets, and flaky integration tests (apache#13178)" This reverts commit b2b9f18.
…ve downstream lag, syncing stale offsets, and flaky integration tests (apache#13178)" This reverts commit ead8dfb.
Hi @gharris1727, if I understand this PR correctly, this will (almost always) cause duplication of data. Our problem is this:
In step 3, these offsets will be different, specifically, the offset from the old cluster will be the last message the service managed to commit an offset for. And the new topic will have as an offset the value: Thus, when we restart the service (and make it consume from the new topic), it will re-process all messages from This behaviour was verified by looking at the output of the |
Hi @gkousouris Thanks for asking!
Your understanding of the offset translation (post-KAFKA-12468) is correct, and I would expect re-processing of messages downstream after a fail-over. I also understand that this doesn't satisfy "exactly once semantics" for some definition, because it allows for re-delivery of the same message to the same "application", when that application uses multiple Kafka clusters. MirrorMaker2 currently provides "exactly once semantics" for replicating data, but not for offsets. I believe this is captured by the "MirrorSourceConnector" declaring that it supports EOS, but the"MirrorCheckpointConnector" does not. This means that when you replicate a topic with EOS mode, and use read_committed on the downstream topic from the beginning, EOS would mean that you read each record present in the upstream topic exactly once. When you instead start reading at the committed downstream offset, you may have records delivered downstream that have already been committed upstream. This is not just caused by the offset translation that this PR implements, it's a limitation of the asynchronous offset translation that MirrorMaker2 uses. Consider this sequence:
Thanks for doing your due diligence on the claims of "exactly once semantics", and I hope that you can still make MirrorMaker2 work for your use-case. I suspect that EOS semantics across multiple Kafka clusters is a much larger effort than just changing the offset translation logic :) If you have a Jira account, please consider opening a ticket about this shortcoming. Thanks! |
Thanks a lot for your reply! I will look into creating a Jira account and creating a ticket for this. I should have mentioned that for we were planning on using Mirror Maker to migrate the topic a service reads from from cluster A to cluster B. So we would not be limited by the asynchronous offset translation that MirrorMaker uses, since we would be:
We would have hoped that the offset would be translated exactly at some point, and would let us to seamlessly start consuming from the same point it was last stopped. MirrorMaker seems like a great use case for us, but this might be a bit of a blocker. Using the old offset translation version before this PR could perhaps work if we were to disable EOS (to get rid of the transactional messages). Otherwise, the only solution I can think of is the hacky approach of reading the offsets and trying to decipher what message to read on the application-side, which seems brittle. Would you perhaps recommend a different approach to not re-process a message twice ? |
@gkousouris Thanks for sharing your use-case. I think you are right to look towards MM2 for this sort of translation, and I think it's unfortunate that it isn't straightforward. The current offset translation doesn't "converge" for consumer groups which are inactive due to memory limitations, but for a single-shot migration use-case, that's not good enough. Are you able to stop the producers to the upstream topic, and let the consumers commit offsets at the end of the topic before performing the migration? If you set offset.lag.max very low, MM2 should be able to translate offsets at the end of the topic.
Yeah, if you want to get a 100% precise translation away from the end of the topic and don't want to modify MM2, you're going to need to "synchronize" the two topics and figure out which messages line up. Between offset.lag.max, the syncs topic throttle semaphore, and the OffsetSyncStore, a lot of intermediate offset syncs get discarded and the precision of the translation decreases significantly. If you let MirrorMaker2 perform a rough translation that you later refine with a custom script, you probably only need to compare a few hundred record checksums for each topic-partition-consumer-group. This would also allow you to compensate for the skipped offsets that EOS mode produces. I think you could make such a script reliable enough for a one-off migration, with some manual spot-checking to make sure it doesn't do anything too incorrect. If you're willing to hack on the MirrorMaker connectors, you could disable the throttling semaphore, the offset.lag.max parameter, and implement the full-depth OffsetSyncStore to get perfect translation. I don't think we could add those to mainline MM2 without a configuration, but you are certainly welcome to temporarily fork MM2 to get the job done. |
The primary issue being addressed here is the incorrect translation of offsets, the title issue KAFKA-12468.
Additionally, this PR addresses KAFKA-13659, which prevents MirrorCheckpointTasks that restart from emitting checkpoints before they can read to the end of the offset syncs topic.
Additionally, this PR stabilizes the MirrorMaker2 integration tests which were too flakey to properly verify this fix.
The MM2 KIP does not discuss the offset translation mechanism in detail, so I'll summarize the mechanism as it currently exists on trunk:
Step (5) is correct when assuming that every offset from the source topic has already been reproduced in the downstream topic. However, this assumption is violated when offsets are not present, which can happen for a variety of reasons, including:
In any of these conditions, an upstream offset may be translated to a downstream offset which is beyond the corresponding record in the downstream topic. Consider the following concrete example of situation (4) resulting in negative lag:
A
has 1000 records, all with contiguous offsetscg
is at the end of the log, offset 1000.target.A
, and writes offset-syncs correlating (A
, 500) with (target.A
, 500).cg
offset 1000, translates the offset to 500 + (1000-500) = 1000, and writes totarget.cg
target.cg
offset fortarget.A
and observes that the group offset is 1000, the topic end offset is 500, and the lag is -500.And the following concrete example of situation (1) resulting in undelivered data.
A
has 1000 records, all emitted with a transactional producer.cg
is in the middle of the topic, at offset 1000.target.A
, and writes offset-syncs correlating (A
, 500) with (target.A
, 250), in addition to other offset-syncs.cg
offset 1000, translates the offset to 250 + (1000 - 500) = 750, and writes totarget.cg
cg
totarget.cg
and someone notices that thecg
application read records 0-500,target.cg
application read 750-1000, but no consumer ever received offsets 500-750.This PR adds a test that replicates transactional data, as in situation (1). It asserts that whenever an offset is translated, it does not pass the end of the downstream topic, and cannot cause negative lag. In addition the tests are strengthened to require the offset syncs to be emitted up to the end of the topic, requiring a fix for the offset-syncs topic starvation issue. This also exposed a number of mistakes and flakiness in the existing tests, so this PR also stabilizes the tests to make them useful for validating the negative offsets fix.
Committer Checklist (excluded from commit message)