Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-4754] [checkpoints] Make number of retained checkpoints user configurable #3374

Closed
wants to merge 2 commits into from

Conversation

tony810430
Copy link
Contributor

I add CheckpointConfig.setMaxNumberOfCheckpointsToRetain to expose user the configuration for number of retained checkpoints, and update the constructor of JobSnapshottingSettings to pass the value to CheckpointRecoveryFactory.

However, I didn't make this value lazily in the checkpoint store implementations.
It is useful to make it lazily if there is a need to reconfigure it during job is running, but I think that should be another issue.

@StephanEwen
Copy link
Contributor

Hi @tony810430 thank you for the pull request!
The code looks good.

My feeling is, though, that the number of checkpoints to retain is something that we want rather in the configuration of the JobManager, than in the programs snapshot settings.

Think of it like that: There are often two roles, developer and ops.

  • The developer writes the streaming program, and sets the values for the snapshot settings, like what checkpoint interval would work well for the application etc.
  • The ops person writes the cluster's config and is concerned with running the job reliably.

Having multiple retained checkpoints is something that concerns more the ops person.

What do you think?

@tony810430
Copy link
Contributor Author

Hi @StephanEwen thanks for your feedback.

I totally agree your opinion. I will make this setting be configured in FlinkConfiguration and be passed to CheckpointRecoveryFactory.

Besides, I have some questions for the following implementations.

  • Should the verification of this setting throw exception or make it be the default value when the setting is set with illegal value like -1.
  • Where should I verify this setting If we need to throw the exception? I couldn't find the validation util for FlinkConfiguration, so I don't know where I should place my code.
  • Should I remain the setting to provide flexibility for developers or just make it be the ops' responsibility only?

Looking forward to having your opinion. Thank you.

@StephanEwen
Copy link
Contributor

I think having it only in the configuration is probably fine. I think we do not need both paths here.
Illegal values are probably checked when creating the recovery factory.

It would be nice to have a "configuration validator" early, but we do not have something like that currently.

@tony810430
Copy link
Contributor Author

Hi @StephanEwen

Thanks for your comment and I make some change on this PR. I would appreciate it if you have time to review it.

@StephanEwen
Copy link
Contributor

Looks good in general. I would suggest two improvements:

  1. Migrate the config parameter to CoreOptions
  2. Add a test

@tony810430
Copy link
Contributor Author

Hi @StephanEwen

Thank you for the review. I have rebased and done those improvements in your suggestion.

@StephanEwen
Copy link
Contributor

Looks good, thanks.
Merging this...

@StephanEwen
Copy link
Contributor

@tony810430 I am adding small followups. Most notably, I renamed the config parameter to state.checkpoints.num-retained. In my experience, shorter parameter names are better (less typos, easier to remember for users, etc...)

Please comment if you have objections.

@tony810430
Copy link
Contributor Author

@StephanEwen I think state.checkpoints.max-num-retained is more clear to me.

@tony810430
Copy link
Contributor Author

@StephanEwen state.checkpoints.num-retained is also good to me. Choose what you think is proper to you. Thank you.

StephanEwen pushed a commit to StephanEwen/flink that referenced this pull request Mar 16, 2017
@asfgit asfgit closed this in b46f5e0 Mar 16, 2017
@tony810430 tony810430 deleted the FLINK-4754 branch March 17, 2017 08:44
stefanobortoli pushed a commit to huawei-flink/flink that referenced this pull request Mar 20, 2017
p16i pushed a commit to p16i/flink that referenced this pull request Apr 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants