Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-34210][Runtime/Checkpointing] Fix DefaultExecutionGraphBuilder#isCheckpointingEnabled return wrong value when checkpoint disabled #24173

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

mayuehappy
Copy link
Contributor

…#isCheckpointingEnabled return wrong value when checkpoint disabled

What is the purpose of the change

Fix DefaultExecutionGraphBuilder#isCheckpointingEnabled return wrong value when checkpoint disabled

Brief change log

reusing the result of jobGraph.isCheckpointingEnabled() when Calling DefaultExecutionGraphBuilder#isCheckpointingEnabled

Verifying this change

This change is a trivial rework

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: ( no )
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no)
  • The S3 file system connector: ( no )

Documentation

  • Does this pull request introduce a new feature? (no)

@mayuehappy mayuehappy force-pushed the fix-enable-checkpoint branch 2 times, most recently from 528f202 to d8ac577 Compare January 23, 2024 09:58
@flinkbot
Copy link
Collaborator

flinkbot commented Jan 23, 2024

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Copy link
Contributor

@masteryhx masteryhx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pr.
LGTM.

@masteryhx
Copy link
Contributor

@flinkbot run azure

@masteryhx
Copy link
Contributor

@flinkbot run azure

…#isCheckpointingEnabled return wrong value when checkpoint disabled
@mayuehappy
Copy link
Contributor Author

@flinkbot run azure

@mayuehappy
Copy link
Contributor Author

@masteryhx
I found some unit tests failed after change. https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=56811&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=24c3384f-1bcb-57b3-224f-51bf973bbee8
That's because some test set the CheckpointInterval as Long.MAX_VALUE but as expect the Checkpoint can trigger Normally .I will add another fix commit

…eckpointCoordinatorConfiguration.DISABLED_CHECKPOINT_INTERVAL when enableCheckpointing
@mayuehappy
Copy link
Contributor Author

@flinkbot run azure

@JunRuiLee
Copy link
Contributor

@mayuehappy I noticed that the creation of the CheckpointCoordinator currently depends on the isCheckpointingEnabled condition. This implies that if we disable checkpointing, the entire coordinator won't get created. Are we certain this won't pose a problem? For instance, could this also prevent triggering savepoints?

cc @masteryhx

Copy link
Contributor

@JunRuiLee JunRuiLee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the current status of this PR is approved, but the PR skips the creation of the checkpoint coordinator when disabling the checkpoint scheduling, we might need some double confirmation here. Therefore, I'm changing the status to request changes before further clarification.

@mayuehappy
Copy link
Contributor Author

mayuehappy commented Jan 24, 2024

@JunRuiLee Thanks for joining the disscussion .

the PR skips the creation of the checkpoint coordinator when disabling the checkpoint scheduling

Yes, actually I think that's also why we made this change. Because Flink does not require the creation of components such as CheckpointCoordinator or CheckpointIDCounter in many scenarios, such as OLAP or Flink Batch. In these scenarios, I believe it is reasonable not to create a CheckpointCoordinator.

For instance, could this also prevent triggering savepoints?

Yes. If disabling Checkpoint, the savepoint cannot be triggered after Change. But I think this issue is indeed worth discussing. In my opinion, if the user disable Checkpoint, does it mean that the user can accept state loss. I can't imagine a scenario where users don't need Checkpoint but need Savepoint. So i thinks it's fine here or we need notice the user savepoint can only be triggered while Checkpoint is enabled .
WDYT

@WencongLiu
Copy link
Contributor

WencongLiu commented Jan 24, 2024

@mayuehappy The key point is that if the users set the job with streaming execution mode and set the checkpoint interval with Long.MAX_VALUE or a small value less than 10, the savepoint will be not supported at all.
image

@JunRuiLee
Copy link
Contributor

@JunRuiLee Thanks for joining the disscussion .

the PR skips the creation of the checkpoint coordinator when disabling the checkpoint scheduling

Yes, actually I think that's also why we made this change. Because Flink does not require the creation of components such as CheckpointCoordinator or CheckpointIDCounter in many scenarios, such as OLAP or Flink Batch. In these scenarios, I believe it is reasonable not to create a CheckpointCoordinator.

For instance, could this also prevent triggering savepoints?

Yes. If disabling Checkpoint, the savepoint cannot be triggered after Change. But I think this issue is indeed worth discussing. In my opinion, if the user disable Checkpoint, does it mean that the user can accept state loss. I can't imagine a scenario where users don't need Checkpoint but need Savepoint. So i thinks it's fine here or we need notice the user savepoint can only be triggered while Checkpoint is enabled . WDYT

In my opinion, disabling checkpoints does not necessarily equate to disabling savepoints or other functionalities of the Checkpoint Coordinator, such as restoring from state. Some users may prefer to avoid the automatic triggering of checkpoints, yet still retain the option to initiate savepoints manually when needed.

@mayuehappy
Copy link
Contributor Author

@JunRuiLee @WencongLiu Thanks for replying

In my opinion, disabling checkpoints does not necessarily equate to disabling savepoints or other functionalities of the Checkpoint Coordinator, such as restoring from state. Some users may prefer to avoid the automatic triggering of checkpoints, yet still retain the option to initiate savepoints manually when needed.

Maybe that makes sense, although I haven't used it this way.
Let me explain again ,my original starting point for creating this PR is that for scenarios that do not require Checkpoint/Savepoint/Restore at all. We can avoid the creation of CheckpointCoordinator or related components, that's all

@WencongLiu
Copy link
Contributor

Thanks @mayuehappy . In adaptive batch scenarios, the creation of a CheckpointCoordinator is definitely not necessary. In other scenarios, we may need to further sort out whether we need the CheckpointCoordinator carefully.

Although Flink's execution modes are currently limited to streaming and batch, the definitions between these two modes may become more blurred in the future with the progress of unified stream and batch processing. In this background, retaining the CheckpointCoordinator at least has no downside.

@masteryhx
Copy link
Contributor

Thanks @JunRuiLee and @WencongLiu pointing out this.
@mayuehappy After disscussed offline with @JunRuiLee , I agree that we could not break the change at least for some potential users relying on this (e.g. triggering savepoint with disabled checkpoint).
Of course, it's still strange that we still need to start some heavyweight components e.g. CheckpointCoordinator even if users want to disable checkpoint completely. And it's confused that we have different judgement about disabling checkpoint in the code path.
So I'd like to hold on this until we found a better solution for this.

@X-czh
Copy link
Contributor

X-czh commented Jan 25, 2024

Not supporting savepoint on checkpoint disabled is indeed a breaking change from my side. Usu. jobs without checkpoints don't need savepoints as well, but users or platforms may already rely on the stop-with-savepoint feature for gracefully stopping a Flink job no matter it enables checkpoint or not. It's worth further discussion, but maybe a FLIP is needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants