Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-15202: Fix MM2 offset translation when syncs are variably spaced #14156

Merged

Conversation

gharris1727
Copy link
Contributor

The new OffsetSyncStore historical translation cache clears more syncs than necessary when the gap between syncs is variable. This is a problem with the replacement promotion logic, which only covered the base case when promoting an one index to the immediately following index.

This has the effect that if a sync fails to satisfy an invariant for the following index, then it is discarded immediately, even if the value would satisfy the invariants at a different index. In particular, invariant B which enforces a maximum distance between two syncs gets more permissive as the index in the array increases, so a sync which does not satisfy invariant B at index 1 may satisfy it at index j > 1.

Instead of the greedy discarding algorithm, the replacement promotion logic should keep a separate index into the potential replacements from the original array, and delay discarding a sync until it can be determined that the sync is not valid at any place in the array. In particular, syncs are certainly not worth keeping if they are duplicates of other syncs, or if they fail invariant C. Invariant C becomes more strict as the index in the array increases, so a sync which does not satisfy invariant C at index 1 will certainly never satisfy it at index j > 1.

In order to verify the changes, new tests which use Random to generate gaps between syncs, and generalize the test for maintaining unique syncs to an arbitrary stream of gaps. The different tests cover:

  1. Constant spacing
  2. Uniform random spacing between 0 and N
  3. Uniform random spacing between M and N with M > 0
  4. Bimodal spacing at maxOffsetLag and 2x maxOffsetLag

The new algorithm is an extension of the existing one, so all of the consistent-spacing tests have the exact same behavior.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

Signed-off-by: Greg Harris <greg.harris@aiven.io>
@monish-byte
Copy link

hey @gharris1727 , can you please guide me how can I start contributing to this project as I am new to open source contribution but I have a strong knowledge of Java. Thank you.

@gharris1727
Copy link
Contributor Author

Hi @monish-byte ! Thanks for considering contributing to Apache Kafka.

You can find the contributing guide here: https://kafka.apache.org/contributing.html
It is a good idea to join the dev mailing list: https://kafka.apache.org/contact.html and ask for access to the JIRA and Confluence. From there, you can ask more questions about contributing, submit your own issues, and work on issues that have already been reported.

I look forward to seeing you on the mailing list!

@monish-byte
Copy link

hey @gharris1727 I have joined the mailing list. I am eager to know what should I do next.

@gharris1727
Copy link
Contributor Author

@monish-byte This specific PR thread isn't the right place to discuss this, so please in the future post your general questions on the dev mailing list.

Your next steps should be:

ask [on the mailing list] for access to the JIRA and Confluence. From there, you can ask more questions [on the mailing list] about contributing, submit your own issues, and work on issues that have already been reported.

Copy link
Contributor

@C0urante C0urante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Greg, the description in the PR and Jira ticket, along with comments in the code changes, all really helped to make this easier to review. I've left some minor suggestions but none that are blocker-worthy; LGTM!

@@ -227,7 +227,7 @@ private void clearSyncArray(OffsetSync[] syncs, OffsetSync offsetSync) {
}
}

private void updateSyncArray(OffsetSync[] syncs, OffsetSync offsetSync) {
private void updateSyncArray(OffsetSync[] syncs, OffsetSync[] original, OffsetSync offsetSync) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: feels a little strange that we're passing in a to-be-mutated sync array. Any reason not to alter updateSyncArray to only take in the original sync array and the new sync, and construct and return the new sync array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is already the signature of the caller, so this would amount to inlining this function into updateExistingSyncs to combine the two.

I think that this function is already large enough, and doesn't also need to be concerned with copying the array or logging the result. I also think it makes batching updates simpler: The caller can manage the lifetimes of the arrays and their mutability, while the inner function can be concerned with just applying the updates.

Also this code was doing that already, here and in the clearSyncArray function.

Signed-off-by: Greg Harris <greg.harris@aiven.io>
Signed-off-by: Greg Harris <greg.harris@aiven.io>
@kirktrue
Copy link
Contributor

@gharris1727 Thanks for your work on this! Hopefully it will help KAFKA-15197 too!

@gharris1727
Copy link
Contributor Author

gharris1727 commented Aug 10, 2023

I saw one of the CI builds with testOffsetTranslationBehindReplicationFlow failing, so I think there is still another source of flakiness somewhere, probably #13911. I'll close the fix JIRA but leave the flaky JIRA open.

@gharris1727
Copy link
Contributor Author

The latest CI run has some unrelated flakiness, and tests pass locally.

@gharris1727 gharris1727 merged commit 0ee2664 into apache:trunk Aug 10, 2023
1 check failed
gharris1727 added a commit that referenced this pull request Aug 10, 2023
gharris1727 added a commit that referenced this pull request Aug 10, 2023
gharris1727 added a commit that referenced this pull request Aug 10, 2023
jeqo pushed a commit to aiven/kafka that referenced this pull request Aug 15, 2023
jeqo pushed a commit to jeqo/kafka that referenced this pull request Aug 15, 2023
jeqo pushed a commit to jeqo/kafka that referenced this pull request Aug 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants