Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-32895][Scheduler] Introduce the max attempts for Exponential Delay Restart Strategy #23247

Merged

Conversation

1996fanrui
Copy link
Member

@1996fanrui 1996fanrui commented Aug 21, 2023

What is the purpose of the change

See FLIP-364: Improve the exponential-delay restart-strategy . This PR includes the subtask2 of FLIP-364.

Brief change log

  • [FLINK-32895][Scheduler][hotfix] Generate latest strategy string for ExponentialDelayRestartBackoffTimeStrategy due to some information will be updated, such as: currentBackoffMS and lastFailureTimestamp
  • [FLINK-32895][Scheduler] Introduce the max attempts for Exponential Delay Restart Strategy
    • Add the option: restart-strategy.exponential-delay.attempts-before-reset-backoff
  • [FLINK-32895][Scheduler][refactor] Remove some unused throws

Verifying this change

  • Improve some tests in ExponentialDelayRestartBackoffTimeStrategyTest.
  • Change the testAlwaysRestart to testMaxAttempts in ExponentialDelayRestartBackoffTimeStrategyTest.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): yes
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? yes
  • If yes, how is the feature documented? Java doc and config doc are updated, and updated a little official web doc related to restart-strategy.exponential-delay.attempts-before-reset-backoff.

@flinkbot
Copy link
Collaborator

flinkbot commented Aug 21, 2023

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@1996fanrui 1996fanrui force-pushed the 32895/exponnential-restart-max-attemepts branch 7 times, most recently from 7272d54 to 92401d9 Compare December 4, 2023 09:12
@1996fanrui 1996fanrui marked this pull request as ready for review December 4, 2023 09:19
@1996fanrui 1996fanrui force-pushed the 32895/exponnential-restart-max-attemepts branch from 92401d9 to c0de3c8 Compare December 4, 2023 11:22
Copy link
Member Author

@1996fanrui 1996fanrui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @RocMarshal @qinf , would you mind helping take a look this PR in your free time? thanks~

Copy link
Contributor

@qinf qinf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@1996fanrui Thank you for your PR. I have a few comments, please take a look.

final ExponentialDelayRestartBackoffTimeStrategy restartStrategy =
new ExponentialDelayRestartBackoffTimeStrategy(
new ManualClock(), 1L, 3L, 2.0, 4L, 0.25);
new ManualClock(), 1L, 3L, 2.0, 4L, 0.25, maxAttempts);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value of ConfigOption restart-strategy.exponential-delay.backoff-multiplier has been set to 1.2. Is it necessary to update backoffMultiplier to 1.2?

final long initialBackoffMS = 1L;
final long resetBackoffThresholdMS = 8L;
final ManualClock clock = new ManualClock();

final ExponentialDelayRestartBackoffTimeStrategy restartStrategy =
new ExponentialDelayRestartBackoffTimeStrategy(
clock, initialBackoffMS, 5L, 2.0, resetBackoffThresholdMS, 0.25);
clock, initialBackoffMS, 5L, 2.0, resetBackoffThresholdMS, 0.25, 100);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you help explain why set the attemptsBeforeResetBackoff to 100 here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't have strong reason, I just set it to a big value to ensure this test doesn't reach this limitation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update attemptsBeforeResetBackoff to Integer.MAX_VALUE for all tests not related to maxAttempts.

Integer.MAX_VALUE makes it easier for other developers to know that attemptsBeforeResetBackoff does not take effect.

Comment on lines +119 to +120
exponentialDelayConfig.getJitterFactor(),
RestartStrategyOptions.RESTART_STRATEGY_EXPONENTIAL_DELAY_ATTEMPTS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ExponentialDelayRestartStrategyConfiguration contains other parameters needed to create ExponentialDelayRestartBackoffTimeStrategyFactory. Is it necessary to add a parameter attemptsBeforeResetBackoff to ExponentialDelayRestartStrategyConfiguration?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't do it because ExponentialDelayRestartStrategyConfiguration is deprecated since 1.19. The RestartStrategies has been deprecated, and ExponentialDelayRestartStrategyConfiguration is the inner class of it.

So the new option won't be supportted in the deprecated api.

@1996fanrui 1996fanrui force-pushed the 32895/exponnential-restart-max-attemepts branch from c0de3c8 to 5374606 Compare December 7, 2023 05:19
Copy link
Member Author

@1996fanrui 1996fanrui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @zhuzhurk , would you mind helping double check this PR in your free time? Thanks~

This PR is the subtask1 and subtask2 of improving FLIP-364: Exponential Delay Restart Strategy.

Copy link
Contributor

@RocMarshal RocMarshal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @1996fanrui for the contribution~
LGTM +1.

Description.builder()
.text(
"The number of times that Flink retries the execution before the job is declared as failed "
+ "before reset the backoff to its initial value if %s has been set to %s.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reset -> resetting

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @RocMarshal for the review! Updated~

@@ -184,7 +184,7 @@ public class RestartStrategyOptions {
public static final ConfigOption<Double> RESTART_STRATEGY_EXPONENTIAL_DELAY_BACKOFF_MULTIPLIER =
ConfigOptions.key("restart-strategy.exponential-delay.backoff-multiplier")
.doubleType()
.defaultValue(2.0)
.defaultValue(1.2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we set this to 1.5 to make it a bit less aggressive?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mxm for the feedback!

1.5 is fine for me, and I'd like to cc @zhuzhurk who propose change these 2 default value as well!

Max and Mason have a little feedback after voting in the user mail list, and Max and I had a offline discussion yesterday. Max think the 1.2 is a little small or aggressive(delay time is too short). Here is the reason:

  • Every time the job restarts, it will make a bunch of calls to the Kubernetes API, e.g. read/write to config maps, create task managers.
  • When his producation had the default fixed-delay(1s) restart strategy turned on. A Kubernetes cluster became instable.

Following is the relationship between restart-attempts and retry-delay-time:

  • The delay-time will reach 1 min after 12 attempts when backoff-multiplier is 1.5
  • The delay-time will reach 1 min after 24 attempts when backoff-multiplier is 1.2

Hey @zhuzhurk , what do you think about setting1.5 as the default value? If you agree it, I will update the FLIP and feedback it to the user mail list, and go ahead this PR.

image

Copy link
Contributor

@zhuzhurk zhuzhurk Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.5 sounds good to me. Yet I think we need to send the updates to the FLIP discussion ML and open another vote for it, in case it turned out to be a surprise for those who have reviewed and voted on the FLIP.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zhuzhurk for the quick feedback!

I have updated it in the dev&user mail list. Let us wait for more feedback for a while. Also, I pause FLINK-33736 first. Therefore, I will remove the first commit from this PR and continue working on the rest of the JIRA for FLIP-364.

[1] https://lists.apache.org/thread/6glz0d57r8gtpzq4f71vf9066c5x6nyw

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value related commit is migrated to a new separate PR, we can follow it there, thanks~

Also, if this PR(FLINK-32895) no other comments within 72 hours, I will merge it, thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for considering our suggestion!

@1996fanrui 1996fanrui force-pushed the 32895/exponnential-restart-max-attemepts branch from d6bdb4a to 3189fc0 Compare December 12, 2023 03:41
…g for ExponentialDelayRestartBackoffTimeStrategy due to some information will be updated
@1996fanrui 1996fanrui force-pushed the 32895/exponnential-restart-max-attemepts branch from 3189fc0 to f91f422 Compare December 12, 2023 07:30
Copy link
Contributor

@zhuzhurk zhuzhurk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating this PR! @1996fanrui
I just have one minor comment for it.

Description.builder()
.text(
"The number of times that Flink retries the execution before the job is declared as failed "
+ "before resetting the backoff to its initial value if %s has been set to %s.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe separate the long statement to make it easier for reading? e.g.

The number of times that Flink retries the execution before failing the job if %s has been set to %s. The number will be reset once the backoff is reset to its initial value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zhuzhurk for the great suggestion! It's easier to understand~

I updated the original commit directly due to this change is easy enough, and the change can be viewed by this link[1].

[1] https://github.com/apache/flink/compare/f91f422f3350b622715d140a1255746aafb6a1e7..08efd3f4e062490ac2431408bcaf3d079cf1e6bb

@1996fanrui 1996fanrui force-pushed the 32895/exponnential-restart-max-attemepts branch from f91f422 to 08efd3f Compare December 12, 2023 09:23
Copy link
Contributor

@zhuzhurk zhuzhurk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@1996fanrui
Copy link
Member Author

Thanks everyone for the review, merging~

@1996fanrui 1996fanrui merged commit 3d4d396 into apache:master Dec 15, 2023
@1996fanrui 1996fanrui deleted the 32895/exponnential-restart-max-attemepts branch December 15, 2023 04:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants