-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-13884] Set default delay for restart strategies to 1s #9637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 64fbd2a (Wed Oct 16 08:15:29 UTC 2019) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. DetailsThe Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
| */ | ||
| public class NoOrFixedIfCheckpointingEnabledRestartStrategyFactory extends RestartStrategyFactory { | ||
| private static final long DEFAULT_RESTART_DELAY = 0; | ||
| private static final long DEFAULT_RESTART_DELAY = Duration.ofSeconds(1L).toMillis(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make this dependent on the default delay for the FixedDelayStrategy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would need to parse the string into a long. I would like to avoid to do this in a static block because if this fails, then the JVM process will fail to start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we could parse it when creating the NoOrFixedDelayRestartStrategy in the create method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you ultimately decide against it? @tillrohrmann
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can change it once we have replaced Scala Duration parsing logic. I didn't want to further spread the usage of the Scala classes in the flink-runtime module.
| config.setInteger(RestartStrategyOptions.RESTART_STRATEGY_FAILURE_RATE_MAX_FAILURES_PER_INTERVAL, 1); | ||
| config.setString(RestartStrategyOptions.RESTART_STRATEGY_FAILURE_RATE_FAILURE_RATE_INTERVAL, "1 second"); | ||
| config.setString(RestartStrategyOptions.RESTART_STRATEGY_FAILURE_RATE_DELAY, "100 ms"); | ||
| config.setString(RestartStrategyOptions.RESTART_STRATEGY_FAILURE_RATE_DELAY, "0 s"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we reducing the delay here? This doesn't seem strictly necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is true. It speeds up the test though. I'll add it as a separate hotfix commit.
a4d0425 to
6e6addf
Compare
|
I've updated this PR to update the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good news that the Flink framework now can get rid of scala Duration.
Maybe we should open a JIRA for duration parsing refactoring?
This seems not to a very small change and it's better other developers can be aware of this change.
I also find some previous parsing logics are not refactored( by search scala.concurrent.duration.Duration in java code), including
{MesosEntrypointUtils, TaskManagerConfiguration, TableConfigUtils}.
(ProcessShutDownThread does not parse duration strings but it is referencing the scala Duration. Since it is not used anymore, I opened a issue FLINK-14000 to remove it.)
| .text( | ||
| "Time interval for measuring failure rate if %s has been set to %s. " + | ||
| "It can be specified using Scala's %s notation: \"1 min\", \"20 s\"", | ||
| "It can be specified using the notation: \"1 min\", \"20 s\", \"100 ms\"", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change fewer time units are supported.
Previously supported time units are
{
DAYS -> "d day",
HOURS -> "h hour",
MINUTES -> "min minute",
SECONDS -> "s sec second",
MILLISECONDS -> "ms milli millisecond",
MICROSECONDS -> "µs micro microsecond",
NANOSECONDS -> "ns nano nanosecond"
}
Current TimeUtils only supports "h", "min", "s" and "ms".
To avoid break existing job, we may need to support those time units supported by scala FiniteDuration.
And maybe we also need a doc to specify supported time units and link this description to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a valid point. Though, for the affected durations in this commits it probably does not make sense to specify these in days, microseconds, or nanoseconds. I would say that the changes in this commit are out of scope for a hotfix commit. Strictly speaking the changes in this commit even break backwards compatibility.
Edit: Now I understand the full extent of the issue that @zhuzhurk raised. There are many more notations that we do not support anymore (not only units). Scala's Duration also supported specifications such as "5 minutes", which TimeUtils does not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll revert these commits. Somehow the urge to get rid of Scala dependencies just overcame me....
| public static FailureRateRestartStrategyFactory createFactory(Configuration configuration) throws Exception { | ||
| int maxFailuresPerInterval = configuration.getInteger(RestartStrategyOptions.RESTART_STRATEGY_FAILURE_RATE_MAX_FAILURES_PER_INTERVAL); | ||
| String failuresIntervalString = configuration.getString(RestartStrategyOptions.RESTART_STRATEGY_FAILURE_RATE_FAILURE_RATE_INTERVAL); | ||
| String timeoutString = configuration.getString(AkkaOptions.WATCH_HEARTBEAT_INTERVAL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's not proper to change the default value of restart delay in the "Factor duration parsing..." commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valid point. What is the justification for not falling back to WATCH_HEARTBEAT_INTERVAL anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change should not be part of the parsing commit. I'll change it. WATCH_HEARTBEAT_INTERVAL is no longer used and will be deprecated in #9553. Moreover, I think it is better to separate configurations and try to make things as explicit as possible without much magic happening underneath.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
|
All Travis stages are currently failing. Example: |
6e6addf to
9b959c0
Compare
|
The build breaks due to the default values of for "restart-strategy.failure-rate.delay" and "restart-strategy.fixed-delay.delay" are not updated in generated docs. I think we can merge this PR once the docs are regenerated. |
…e to 0 s This speeds up the SimpleRecoveryFailureRateStrategyITBase.
This commit sets the default delay for all restart strategies to 1s in order to prevent restart storms from happening.
9b959c0 to
64fbd2a
Compare
|
Thanks for the pointer @zhuzhurk. I've regenerated the documentation and pushed an update. |
|
Thanks for the review everyone. Merging this PR now. |
What is the purpose of the change
This PR is based on #9562.
This commit sets the default delay for all restart strategies to 1s in order
to prevent restart storms from happening.
cc @zentol @zhuzhurk
Verifying this change
This change is a trivial rework / code cleanup without any test coverage.
Does this pull request potentially affect one of the following parts:
@Public(Evolving): (yes)Documentation