Skip to content

Conversation

@eaglewatcherwb
Copy link
Contributor

@eaglewatcherwb eaglewatcherwb commented Jun 12, 2019

What is the purpose of the change

Implement FixedDelayRestartBackoffTimeStrategy.

Brief change log

  • Implement FixedDelayRestartBackoffTimeStrategy.

Verifying this change

This change added tests and can be verified as follows:

  • Added unit test in FixedDelayRestartBackoffTimeStrategyTest.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (no)
  • If yes, how is the feature documented? (not documented)

@flinkbot
Copy link
Collaborator

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details
The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening this PR @eaglewatcherwb. I had three minor comments which we should address before merging this PR.

@PublicEvolving
public static final ConfigOption<String> RESTART_BACKOFF_TIME_STRATEGY_FIXED_DELAY_BACKOFF_TIME = ConfigOptions
.key("restart-backoff-time-strategy.fixed-delay.backoff-time")
.defaultValue("1 min")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be in favor of specifying the backoff time in milli seconds in order to get eventually rid of Scala's FiniteDuration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

RESTART_BACKOFF_TIME_STRATEGY_FIXED_DELAY_ATTEMPTS);
this.backoffTimeMS = getIntervalMSFromConfiguration(
configuration,
RESTART_BACKOFF_TIME_STRATEGY_FIXED_DELAY_BACKOFF_TIME);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to move the parsing of the configuration out of the factory into the factory method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

private long getIntervalMSFromConfiguration(Configuration configuration, ConfigOption<String> configOption) {
Duration interval = Duration.apply(configuration.getString(configOption));
return Time.milliseconds(interval.toMillis()).toMilliseconds();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate logic wrt the FailureRateRestartBackoffTimeStrategy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Copy link
Contributor

@zhuzhurk zhuzhurk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Bo for the PR. I have a few minor comments.

@PublicEvolving
public static final ConfigOption<Integer> RESTART_BACKOFF_TIME_STRATEGY_FIXED_DELAY_ATTEMPTS = ConfigOptions
.key("restart-backoff-time-strategy.fixed-delay.attempts")
.defaultValue(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, a large number would be better here, e.g. Int.Max.
Most of our Flink streaming customers would like the job to try more times before it fails. Especially when the cluster is not very stable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The legacy default value of restart-strategy.fixed-delay.attempts is 1, or Integer.MAX_VALUE if activated by checkpointing, let's make it Integer.MAX_VALUE

@PublicEvolving
public static final ConfigOption<String> RESTART_BACKOFF_TIME_STRATEGY_FIXED_DELAY_BACKOFF_TIME = ConfigOptions
.key("restart-backoff-time-strategy.fixed-delay.backoff-time")
.defaultValue("1 min")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 min seems to be a bit too long here.
The legacy FixedDelayRestartStrategy take 0 as the default value.
In our production practice we usually set it to be 10s.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, 10s is good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually be in favor of 0 because a delay is not strictly needed anymore with proper support for queued scheduling.

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my comments @eaglewatcherwb and the review @zhuzhurk. I had two last comments and then we can merge this PR.

<td><h5>restart-backoff-time-strategy.fixed-delay.backoff-time</h5></td>
<td style="word-wrap: break-word;">"1 min"</td>
<td>Backoff time between two consecutive restart attempts.</td>
</tr>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be regenerated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@PublicEvolving
public static final ConfigOption<Long> RESTART_BACKOFF_TIME_STRATEGY_FIXED_DELAY_BACKOFF_TIME = ConfigOptions
.key("restart-backoff-time-strategy.fixed-delay.backoff-time")
.defaultValue(10_000L)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's set the default value to 0. With queued scheduling there is no longer a strict need for a delay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

* use ManualClock in unit test and add configuration description

This closes apache#8573.
Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my comments @eaglewatcherwb. LGTM. Merging this PR.

@tillrohrmann tillrohrmann force-pushed the FLINK-12669-fixed-delay-strategy branch from 5b922dc to 2d993b2 Compare June 13, 2019 15:01
@eaglewatcherwb eaglewatcherwb deleted the FLINK-12669-fixed-delay-strategy branch June 14, 2019 07:40
sjwiesman pushed a commit to sjwiesman/flink that referenced this pull request Jun 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants