Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-30593][autoscaler] Improve restart time tracking #735

Merged
merged 4 commits into from
Jan 10, 2024

Conversation

afedulov
Copy link
Contributor

This PR contains the following improvements to the restart tracking logic:

  • Adds more debug logs
  • Stores restart Duration directly instead of the endTime Instant
  • Fixes a bug that makes restart duration tracking dependent on whether metrics are considered fully collected

Copy link
Contributor

@mxm mxm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nices fixes. LGTM. Just a minor suggestion.

Comment on lines 188 to 189
if (ctx.getJobStatus() == JobStatus.RUNNING) {
if (scalingTracking.setEndTimeIfTrackedAndParallelismMatches(
if (scalingTracking.recordRestartDurationIfTrackedAndParallelismMatches(
now, jobTopology, scalingHistory)) {
stateStore.storeScalingTracking(ctx, scalingTracking);
}
}
Copy link
Contributor

@mxm mxm Dec 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need the RUNNING job state check. This block can be reduced to:

          if (scalingTracking.recordRestartDurationIfTrackedAndParallelismMatches(
                  now, jobTopology, scalingHistory)) {
              stateStore.storeScalingTracking(ctx, scalingTracking);
          }

The reason is that this method only gets called when the job is in running state (see line 99). Enforcing a RUNNING state has always been a precondition for executing the autoscaling logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks.

@mxm
Copy link
Contributor

mxm commented Dec 18, 2023

This needs a rebase. I'll run the tests afterwards.

Comment on lines 188 to 189
if (ctx.getJobStatus() == JobStatus.RUNNING) {
if (scalingTracking.setEndTimeIfTrackedAndParallelismMatches(
if (scalingTracking.recordRestartDurationIfTrackedAndParallelismMatches(
now, jobTopology, scalingHistory)) {
stateStore.storeScalingTracking(ctx, scalingTracking);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please extract this logic into a method to keep the flow simpler?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be part of stateStore.storeScalingTracking(ctx, scalingTracking); even

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to recommend this as well. But these are only three lines of code (after removing the unneeded RUNNING condition) and the resulting method signature would be quite big. I don't think we need to block the PR on this refactoring.

Copy link
Contributor Author

@afedulov afedulov Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be part of stateStore.storeScalingTracking(ctx, scalingTracking); even

I removed the redundant RUNNING check, as Max recommended, so it looks more straightforward now. Pushing this call down into the storeScalingTracking would make it harder to reason, since it is key that runRescaleLogic is only executed when the job is in the RUNNING state and hence the transition is considered complete. It also does not seem right to bundle the logic specific to this concrete situation into KubernetesAutoScalerStateStore which acts more as a simple persistence layer. Hope this is fine by you.

Alexander Fedulov added 4 commits January 9, 2024 19:28
…tion to record ScalingTracking endTime

This change is required because without it, the restart duration won't be recorded until the metrics are considered to be fully collected.
@afedulov
Copy link
Contributor Author

afedulov commented Jan 9, 2024

@mxm I addressed the comments and rebased, could you please kick off the tests again?

@mxm
Copy link
Contributor

mxm commented Jan 9, 2024

Sure, they should be running now.

Copy link
Contributor

@mxm mxm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Alex! Looks good. Are you ok with merging @gyfora?

Copy link
Contributor

@gyfora gyfora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

@mxm mxm merged commit f6496f5 into apache:main Jan 10, 2024
119 checks passed
@afedulov
Copy link
Contributor Author

@mxm @gyfora
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants