Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-12508: Disable KIP-557 #10397

Merged
merged 1 commit into from
Mar 25, 2021
Merged

Conversation

vvcephei
Copy link
Contributor

A major issue has been raised that this implementation of
emit-on-change is vulnerable to a number of data-loss bugs
in the presence of recovery with dirty state under at-least-once
semantics. This should be fixed in the future when we implement
a way to avoid or clean up the dirty state under at-least-once,
at which point it will be safe to re-introduce KIP-557 and
complete it.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

A major issue has been raised that this implementation of
emit-on-change is vulnerable to a number of data-loss bugs
in the presence of recovery with dirty state under at-least-once
semantics. This should be fixed in the future when we implement
a way to avoid or clean up the dirty state under at-least-once,
at which point it will be safe to re-introduce KIP-557 and
complete it.
@vvcephei
Copy link
Contributor Author

Hey @ableegoldman @cadonna, can you spare a quick review here?

All I did was revert the KTableSourceProcessor to its state before the KIP-557 implementation. I just ignored the unit test, and reverted the FK Join test (it had become possible to simplify it after KIP-557).

Copy link
Contributor

@ableegoldman ableegoldman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, although it may be good to consider replacing the ignored testKTableSourceEmitOnChange with a test that verifies emit-on-change is not enabled.

Also, I believe KIP-557 applied emit-on-change semantics to a few other operators. I don't remember them all of the top of my head but can you confirm that none of the others are impacted by the reported bugs?

@vvcephei
Copy link
Contributor Author

Thanks, @ableegoldman !

As far as I know this was the only operator we updated to KIP-557.

This is the only commit I find referencing the KIP: f54cece

@ableegoldman
Copy link
Contributor

Ah ok, I saw the KIP had included "any aggregation or repartition operation" in addition to the source KTable but I guess this just hasn't been implemented yet. Looks like Richard never got around to making sub-tasks for the other operators so I filed tickets and linked them all in the KIP so it's easier to understand what is/isn't in which versions.

Thanks for doing this PR!

@vvcephei
Copy link
Contributor Author

Woah, thanks for that, @ableegoldman !

vvcephei pushed a commit to vvcephei/kafka that referenced this pull request Mar 25, 2021
While running tests for apache#10397, I got a test timeout under Java 8.

I ran it locally via `./gradlew clean -PscalaVersion=2.12 :clients:unitTest --profile --no-daemon --continue -PtestLoggingEvents=started,passed,skipped,failed -PignoreFailures=true -PmaxTestRetries=1 -PmaxTestRetryFailures=5` (copied from the Jenkins log) and was able to determine that the hanging test is:

org.apache.kafka.clients.admin.KafkaAdminClientTest#testClientSideTimeoutAfterFailureToReceiveResponse

It's odd, but it hangs most times on my branch, and I haven't seen it hang on trunk, despite the fact that my PR doesn't touch the client or core code at all.

Some debugging reveals that when the client is hanging, it's because the listTopics request is still sitting in its pendingRequests queue, and if I understand the test setup correctly, it would never be completed, since we will never advance time or queue up a metadata response for it.

I figure a reasonable blanket response to this is just to make sure that the test harness will close the admin client eagerly instead of lazily.
@vvcephei
Copy link
Contributor Author

I looked into the Java 8 timeout. It should not be related, as it's caused by the KafkaAdminClientTest hanging.

I submitted #10404 to fix that test.

@ableegoldman
Copy link
Contributor

Nice test debugging 😄

Just to confirm, you're going to backport this to 2.7 & 2.6 right?

@vvcephei
Copy link
Contributor Author

Thanks, @ableegoldman !

Yes, I plan to do it right now.

@vvcephei vvcephei merged commit 9ef52dd into apache:trunk Mar 25, 2021
@vvcephei vvcephei deleted the disable-kip-557 branch March 25, 2021 19:42
vvcephei added a commit that referenced this pull request Mar 25, 2021
A major issue has been raised that this implementation of
emit-on-change is vulnerable to a number of data-loss bugs
in the presence of recovery with dirty state under at-least-once
semantics. This should be fixed in the future when we implement
a way to avoid or clean up the dirty state under at-least-once,
at which point it will be safe to re-introduce KIP-557 and
complete it.

Reviewers: A. Sophie Blee-Goldman <ableegoldman@apache.org>
vvcephei added a commit that referenced this pull request Mar 25, 2021
A major issue has been raised that this implementation of
emit-on-change is vulnerable to a number of data-loss bugs
in the presence of recovery with dirty state under at-least-once
semantics. This should be fixed in the future when we implement
a way to avoid or clean up the dirty state under at-least-once,
at which point it will be safe to re-introduce KIP-557 and
complete it.

Reviewers: A. Sophie Blee-Goldman <ableegoldman@apache.org>
vvcephei added a commit that referenced this pull request Mar 25, 2021
A major issue has been raised that this implementation of
emit-on-change is vulnerable to a number of data-loss bugs
in the presence of recovery with dirty state under at-least-once
semantics. This should be fixed in the future when we implement
a way to avoid or clean up the dirty state under at-least-once,
at which point it will be safe to re-introduce KIP-557 and
complete it.

Reviewers: A. Sophie Blee-Goldman <ableegoldman@apache.org>
lct45 pushed a commit to confluentinc/kafka that referenced this pull request Mar 27, 2021
A major issue has been raised that this implementation of
emit-on-change is vulnerable to a number of data-loss bugs
in the presence of recovery with dirty state under at-least-once
semantics. This should be fixed in the future when we implement
a way to avoid or clean up the dirty state under at-least-once,
at which point it will be safe to re-introduce KIP-557 and
complete it.

Reviewers: A. Sophie Blee-Goldman <ableegoldman@apache.org>
Terrdi pushed a commit to Terrdi/kafka that referenced this pull request Apr 1, 2021
A major issue has been raised that this implementation of
emit-on-change is vulnerable to a number of data-loss bugs
in the presence of recovery with dirty state under at-least-once
semantics. This should be fixed in the future when we implement
a way to avoid or clean up the dirty state under at-least-once,
at which point it will be safe to re-introduce KIP-557 and
complete it.

Reviewers: A. Sophie Blee-Goldman <ableegoldman@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants