New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-28474][checkpoint] Fix the bug ChannelStateWriteResult might not fail after checkpoint abort #20233
[FLINK-28474][checkpoint] Fix the bug ChannelStateWriteResult might not fail after checkpoint abort #20233
Conversation
9b618c5
to
3f65f08
Compare
...src/test/java/org/apache/flink/streaming/runtime/tasks/SubtaskCheckpointCoordinatorTest.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/flink/streaming/runtime/tasks/SubtaskCheckpointCoordinatorImpl.java
Show resolved
Hide resolved
e6483d7
to
8eaa749
Compare
...untime/src/main/java/org/apache/flink/runtime/checkpoint/channel/ChannelStateWriterImpl.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes, please see a couple of my comments.
...untime/src/main/java/org/apache/flink/runtime/checkpoint/channel/ChannelStateWriterImpl.java
Outdated
Show resolved
Hide resolved
...untime/src/main/java/org/apache/flink/runtime/checkpoint/channel/ChannelStateWriterImpl.java
Outdated
Show resolved
Hide resolved
...untime/src/main/java/org/apache/flink/runtime/checkpoint/channel/ChannelStateWriterImpl.java
Outdated
Show resolved
Hide resolved
...untime/src/main/java/org/apache/flink/runtime/checkpoint/channel/ChannelStateWriterImpl.java
Outdated
Show resolved
Hide resolved
...untime/src/main/java/org/apache/flink/runtime/checkpoint/channel/ChannelStateWriterImpl.java
Outdated
Show resolved
Hide resolved
...untime/src/main/java/org/apache/flink/runtime/checkpoint/channel/ChannelStateWriterImpl.java
Outdated
Show resolved
Hide resolved
...me/src/test/java/org/apache/flink/runtime/checkpoint/channel/ChannelStateWriterImplTest.java
Outdated
Show resolved
Hide resolved
6fb5bec
to
b1562ef
Compare
...untime/src/main/java/org/apache/flink/runtime/checkpoint/channel/ChannelStateWriterImpl.java
Outdated
Show resolved
Hide resolved
...untime/src/main/java/org/apache/flink/runtime/checkpoint/channel/ChannelStateWriterImpl.java
Outdated
Show resolved
Hide resolved
...me/src/test/java/org/apache/flink/runtime/checkpoint/channel/ChannelStateWriterImplTest.java
Outdated
Show resolved
Hide resolved
...me/src/test/java/org/apache/flink/runtime/checkpoint/channel/ChannelStateWriterImplTest.java
Outdated
Show resolved
Hide resolved
Thanks @1996fanrui, I think the code looks mostly good. I've started a benchmark request to check the impact of replacing http://codespeed.dak8s.net:8080/job/flink-benchmark-request/171/ Also FYI @rkhachatryan, I've already reviewed this code, but you might be interested in this issue. |
b1562ef
to
8e1f7a6
Compare
Checkpoint is a relatively low-frequency behavior. My understanding is that this change should have little or no impact on data processing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checkpoint is a relatively low-frequency behavior. My understanding is that this change should have little or no impact on data processing.
Checkpointing yes, but you have added isCheckpointSubsumedOrAborted
call for every addInputData()
call for example, which might cause some performance regression
I've checked the benchmark results and didn't notice anything suspicious, however our benchmark coverage around checkpointing is pretty scarce. I have left a couple of more comments however.
...untime/src/main/java/org/apache/flink/runtime/checkpoint/channel/ChannelStateWriterImpl.java
Outdated
Show resolved
Hide resolved
...me/src/test/java/org/apache/flink/runtime/checkpoint/channel/ChannelStateWriterImplTest.java
Outdated
Show resolved
Hide resolved
8e1f7a6
to
bfeff4b
Compare
bfeff4b
to
bb204e9
Compare
Hi @pnowojski , I updated this PR based on the first version. Please help take a look in your free time, thanks a lot~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @1996fanrui for the update. I think this version looks much simpler compared to the previous. And sorry again for the detour.
I've left a couple of smaller comments.
...src/main/java/org/apache/flink/streaming/runtime/tasks/SubtaskCheckpointCoordinatorImpl.java
Show resolved
Hide resolved
...src/main/java/org/apache/flink/streaming/runtime/tasks/SubtaskCheckpointCoordinatorImpl.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/flink/streaming/runtime/tasks/SubtaskCheckpointCoordinatorImpl.java
Outdated
Show resolved
Hide resolved
bb204e9
to
46c9b8d
Compare
Hi @pnowojski , thanks for your review, I have addressed all comments. |
Thanks @1996fanrui for the update LGTM % last final minor issue that I think came back after reverting to the earlier version: |
46c9b8d
to
1b6f5c7
Compare
...src/test/java/org/apache/flink/streaming/runtime/tasks/SubtaskCheckpointCoordinatorTest.java
Outdated
Show resolved
Hide resolved
…ot fail after checkpoint abort
1b6f5c7
to
52f3fdb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, let's wait for green azure
What is the purpose of the change
Simplify the ChannelStateWriterImpl due to concurrent unaligned checkpoint isn't supported.
Brief change log
ChannelStateWriter#start
.Verifying this change
This change added tests and can be verified as follows:
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: yesDocumentation