Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-6755][CLI] Support manual checkpoints triggering #23679

Merged
merged 2 commits into from
Nov 29, 2023

Conversation

Zakelly
Copy link
Contributor

@Zakelly Zakelly commented Nov 7, 2023

What is the purpose of the change

Since FLINK-24280, we have supported manual checkpoint triggering from REST API. This PR wire command-line interface and rest endpoint, supporting manual checkpoints triggering from CLI.

Brief change log

  • Add an action "checkpoint" in CLI, and related parameters parsing logic.
  • Wire this action with rest endpoint.
  • Documentation about this feature.

Verifying this change

This change added tests and can be verified as follows:

  • Unit tests for newly added checkpoint cli action
  • Manually verified triggering checkpoint using command line interface on sample wordcout job deployed on a yarn cluster.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? yes
  • If yes, how is the feature documented? docs

@flinkbot
Copy link
Collaborator

flinkbot commented Nov 7, 2023

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@Zakelly
Copy link
Contributor Author

Zakelly commented Nov 8, 2023

Hi, @pnowojski @masteryhx Would you please take a look? Thanks!

Copy link
Contributor

@pnowojski pnowojski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Although I'm not the best person to review CLI changes, the change looks mostly good to me. I've left a couple of minor comments.

Comment on lines +171 to +172
If you want to trigger a full checkpoint while the job periodically triggering incremental checkpoints,
please use the `--full` option.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rephrase it to sth like:

By default this command triggers the same type of checkpoint that is configured for the given job. If you want override this behaviour, you can force either full or incremental checkpoint to be performed via --full or --incremental flags respectively.
(code example)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some digging, I found when a job is configured to run with rocksdb and incremental checkpoint disabled, the --incremental flag could not take effect. That is because the RocksNativeFullSnapshotStrategy does not take CheckpointOptions into account when taking async snapshots. So since only --full flag works when incremental checkpoint enabled, I only describe this usage here. But I'm ok with your phrasing.

Also, we should reconsider the incremental file-sharing chain in this senario. How does a manually triggered incremental checkpoint reuse the files from previous full checkpoints, which exist in chk-x instead of shared directory? I think we need a more flexible checkpoint directory layout. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhhh, ok. Then yes, this description works better.

But in that case, what happens when user selects --incremental flag with configured HashMapStateBackend or full RocksDB? In that case the CLI command should ideally fail (exception probably would have to be thrown from somewhere around CheckpointCoordinator?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User selects --incremental flag in this case will trigger a full checkpoint. This behavior align with the REST API. I think a possible way is to remove the --incremental flag (or just remove it from the help messages and documents). After we really support triggering a incremental checkpoint with configured full checkpointing state backend, we add this flag back.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, let's remove --incremental flag from the CLI. Adding some runtime check somewhere would make also sense anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll create another ticket adding some runtime check and remove the INCREMENTAL flag from REST API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you post a JIRA ticket here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, here: https://issues.apache.org/jira/browse/FLINK-33498 . I will start working on this recently.

Comment on lines +169 to +170
If you want to trigger a full checkpoint while the job periodically triggering incremental checkpoints,
please use the `--full` option.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor

@pnowojski pnowojski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you squash the latest two fixup commits into the first two commits:

[[FLINK-6755] Support manual checkpoints triggering from CLI](https://github.com/apache/flink/pull/23679/commits/c563fd15191137aafc548b0a19fc09a28833aff5)

@[Zakelly](https://github.com/Zakelly/flink/commits?author=Zakelly)
Zakelly committed 2 days ago
[[FLINK-6755] Docs about manual checkpoints triggering from CLI](https://github.com/apache/flink/pull/23679/commits/bd404c8dad2c78dc5106fde39e3f81a689b90c85)

? LGTM otherwise :)

@Zakelly
Copy link
Contributor Author

Zakelly commented Nov 10, 2023

Can you squash the latest two fixup commits into the first two commits:

[[FLINK-6755] Support manual checkpoints triggering from CLI](https://github.com/apache/flink/pull/23679/commits/c563fd15191137aafc548b0a19fc09a28833aff5)

@[Zakelly](https://github.com/Zakelly/flink/commits?author=Zakelly)
Zakelly committed 2 days ago
[[FLINK-6755] Docs about manual checkpoints triggering from CLI](https://github.com/apache/flink/pull/23679/commits/bd404c8dad2c78dc5106fde39e3f81a689b90c85)

? LGTM otherwise :)

Already done. Many thanks!

Copy link
Contributor

@JingGe JingGe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pnowojski pnowojski merged commit 7265438 into apache:master Nov 29, 2023
@pnowojski
Copy link
Contributor

Thanks! Merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants