Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-28474][checkpoint] Fix the bug ChannelStateWriteResult might not fail after checkpoint abort #20233

Conversation

1996fanrui
Copy link
Member

@1996fanrui 1996fanrui commented Jul 9, 2022

What is the purpose of the change

Simplify the ChannelStateWriterImpl due to concurrent unaligned checkpoint isn't supported.

Brief change log

  • Simplify the ChannelStateWriterImpl due to concurrent unaligned checkpoint isn't supported.
  • If the checkpointId is aborted, the checkpoint will be ignored within ChannelStateWriter#start.

Verifying this change

This change added tests and can be verified as follows:

  • ChannelStateWriterImplTest#testStartMultipleTimes()
  • ChannelStateWriterImplTest#testBuffersRecycledOnAbortedCheckpoint()
  • ChannelStateWriterImplTest#testOldCheckpointIsAborted()

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): yes
  • The public API, i.e., is any changed class annotated with @Public(Evolving): yes
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not documented

@flinkbot
Copy link
Collaborator

flinkbot commented Jul 9, 2022

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@1996fanrui 1996fanrui force-pushed the 28474/channel-satte-result-not-fail-after-abort branch 2 times, most recently from 9b618c5 to 3f65f08 Compare July 9, 2022 15:53
@1996fanrui 1996fanrui force-pushed the 28474/channel-satte-result-not-fail-after-abort branch 4 times, most recently from e6483d7 to 8eaa749 Compare July 17, 2022 13:05
@1996fanrui 1996fanrui changed the title [FLINK-28474][checkpoint] Fix the bug ChannelStateWriteResult might not fail after checkpoint abort [FLINK-28474][checkpoint] Simplify the ChannelStateWriterImpl due to concurrent unaligned checkpoint isn't supported Jul 18, 2022
Copy link
Contributor

@pnowojski pnowojski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, please see a couple of my comments.

@1996fanrui 1996fanrui force-pushed the 28474/channel-satte-result-not-fail-after-abort branch 4 times, most recently from 6fb5bec to b1562ef Compare July 19, 2022 11:05
@pnowojski
Copy link
Contributor

pnowojski commented Jul 20, 2022

Thanks @1996fanrui, I think the code looks mostly good. I've started a benchmark request to check the impact of replacing ConcurrentHashMap with a lock. I don't expect it to be an issue, but let's check that.

http://codespeed.dak8s.net:8080/job/flink-benchmark-request/171/

Also FYI @rkhachatryan, I've already reviewed this code, but you might be interested in this issue.

@1996fanrui 1996fanrui force-pushed the 28474/channel-satte-result-not-fail-after-abort branch from b1562ef to 8e1f7a6 Compare July 20, 2022 09:41
@1996fanrui
Copy link
Member Author

Thanks @1996fanrui, I think the code looks mostly good. I've started a benchmark request to check the impact of replacing ConcurrentHashMap with a lock. I don't expect it to be an issue, but let's check that.

http://codespeed.dak8s.net:8080/job/flink-benchmark-request/171/

Also FYI @rkhachatryan, I've already reviewed this code, but you might be interested in this issue.

Checkpoint is a relatively low-frequency behavior. My understanding is that this change should have little or no impact on data processing.

Copy link
Contributor

@pnowojski pnowojski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checkpoint is a relatively low-frequency behavior. My understanding is that this change should have little or no impact on data processing.

Checkpointing yes, but you have added isCheckpointSubsumedOrAborted call for every addInputData() call for example, which might cause some performance regression

I've checked the benchmark results and didn't notice anything suspicious, however our benchmark coverage around checkpointing is pretty scarce. I have left a couple of more comments however.

@1996fanrui 1996fanrui force-pushed the 28474/channel-satte-result-not-fail-after-abort branch from 8e1f7a6 to bfeff4b Compare July 27, 2022 12:16
@1996fanrui 1996fanrui force-pushed the 28474/channel-satte-result-not-fail-after-abort branch from bfeff4b to bb204e9 Compare October 5, 2022 09:09
@1996fanrui
Copy link
Member Author

Hi @pnowojski , I updated this PR based on the first version. Please help take a look in your free time, thanks a lot~

@1996fanrui 1996fanrui changed the title [FLINK-28474][checkpoint] Simplify the ChannelStateWriterImpl due to concurrent unaligned checkpoint isn't supported [FLINK-28474][checkpoint] Fix the bug ChannelStateWriteResult might not fail after checkpoint abort Oct 5, 2022
Copy link
Contributor

@pnowojski pnowojski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @1996fanrui for the update. I think this version looks much simpler compared to the previous. And sorry again for the detour.

I've left a couple of smaller comments.

@1996fanrui 1996fanrui force-pushed the 28474/channel-satte-result-not-fail-after-abort branch from bb204e9 to 46c9b8d Compare October 6, 2022 11:24
@1996fanrui
Copy link
Member Author

Thanks @1996fanrui for the update. I think this version looks much simpler compared to the previous. And sorry again for the detour.

I've left a couple of smaller comments.

Hi @pnowojski , thanks for your review, I have addressed all comments.

@pnowojski
Copy link
Contributor

Thanks @1996fanrui for the update LGTM % last final minor issue that I think came back after reverting to the earlier version:
#20233 (comment)

@1996fanrui 1996fanrui force-pushed the 28474/channel-satte-result-not-fail-after-abort branch from 46c9b8d to 1b6f5c7 Compare October 6, 2022 11:50
@1996fanrui 1996fanrui force-pushed the 28474/channel-satte-result-not-fail-after-abort branch from 1b6f5c7 to 52f3fdb Compare October 6, 2022 13:39
Copy link
Contributor

@pnowojski pnowojski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's wait for green azure

@pnowojski pnowojski merged commit a611912 into apache:master Oct 7, 2022
@1996fanrui 1996fanrui deleted the 28474/channel-satte-result-not-fail-after-abort branch October 7, 2022 08:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants