[FLINK-30593][autoscaler] Improve restart time tracking #735

afedulov · 2023-12-15T18:16:34Z

This PR contains the following improvements to the restart tracking logic:

Adds more debug logs
Stores restart Duration directly instead of the endTime Instant
Fixes a bug that makes restart duration tracking dependent on whether metrics are considered fully collected

mxm

Nices fixes. LGTM. Just a minor suggestion.

mxm · 2023-12-18T09:48:23Z

flink-autoscaler/src/main/java/org/apache/flink/autoscaler/JobAutoScalerImpl.java

        if (ctx.getJobStatus() == JobStatus.RUNNING) {
-            if (scalingTracking.setEndTimeIfTrackedAndParallelismMatches(
+            if (scalingTracking.recordRestartDurationIfTrackedAndParallelismMatches(
                    now, jobTopology, scalingHistory)) {
                stateStore.storeScalingTracking(ctx, scalingTracking);
            }
        }


We don't need the RUNNING job state check. This block can be reduced to:

if (scalingTracking.recordRestartDurationIfTrackedAndParallelismMatches( now, jobTopology, scalingHistory)) { stateStore.storeScalingTracking(ctx, scalingTracking); }

The reason is that this method only gets called when the job is in running state (see line 99). Enforcing a RUNNING state has always been a precondition for executing the autoscaling logic.

Good catch, thanks.

mxm · 2023-12-18T09:51:28Z

This needs a rebase. I'll run the tests afterwards.

gyfora · 2023-12-19T16:13:15Z

flink-autoscaler/src/main/java/org/apache/flink/autoscaler/JobAutoScalerImpl.java

        if (ctx.getJobStatus() == JobStatus.RUNNING) {
-            if (scalingTracking.setEndTimeIfTrackedAndParallelismMatches(
+            if (scalingTracking.recordRestartDurationIfTrackedAndParallelismMatches(
                    now, jobTopology, scalingHistory)) {
                stateStore.storeScalingTracking(ctx, scalingTracking);
            }
        }


can you please extract this logic into a method to keep the flow simpler?

It could be part of stateStore.storeScalingTracking(ctx, scalingTracking); even

I was going to recommend this as well. But these are only three lines of code (after removing the unneeded RUNNING condition) and the resulting method signature would be quite big. I don't think we need to block the PR on this refactoring.

It could be part of stateStore.storeScalingTracking(ctx, scalingTracking); even

I removed the redundant RUNNING check, as Max recommended, so it looks more straightforward now. Pushing this call down into the storeScalingTracking would make it harder to reason, since it is key that runRescaleLogic is only executed when the job is in the RUNNING state and hence the transition is considered complete. It also does not seem right to bundle the logic specific to this concrete situation into KubernetesAutoScalerStateStore which acts more as a simple persistence layer. Hope this is fine by you.

…tion to record ScalingTracking endTime This change is required because without it, the restart duration won't be recorded until the metrics are considered to be fully collected.

…n ScalingRecords

afedulov · 2024-01-09T18:54:06Z

@mxm I addressed the comments and rebased, could you please kick off the tests again?

mxm · 2024-01-09T18:55:46Z

Sure, they should be running now.

mxm

Thanks Alex! Looks good. Are you ok with merging @gyfora?

gyfora

🚢

afedulov · 2024-01-10T11:58:07Z

@mxm @gyfora
Thanks!

mxm approved these changes Dec 18, 2023

View reviewed changes

gyfora requested changes Dec 19, 2023

View reviewed changes

Alexander Fedulov added 4 commits January 9, 2024 19:28

[FLINK-30593][autoscaler] Add debug logging for restart time tracking

9c5ab76

[FLINK-30593][autoscaler][fix] Remove waiting for full metrics collec…

2c3ba7f

…tion to record ScalingTracking endTime This change is required because without it, the restart duration won't be recorded until the metrics are considered to be fully collected.

[FLINK-30593][autoscaler] Store Duration instead of endTime Instant i…

483734c

…n ScalingRecords

[review|fixup] Remove redundant RUNNING check

d73dd18

afedulov force-pushed the tracking-impr branch from dcd9b08 to d73dd18 Compare January 9, 2024 18:30

mxm approved these changes Jan 10, 2024

View reviewed changes

gyfora approved these changes Jan 10, 2024

View reviewed changes

mxm merged commit f6496f5 into apache:main Jan 10, 2024
119 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-30593][autoscaler] Improve restart time tracking #735

[FLINK-30593][autoscaler] Improve restart time tracking #735

afedulov commented Dec 15, 2023

mxm left a comment

mxm Dec 18, 2023 •

edited

Loading

afedulov Jan 9, 2024

mxm commented Dec 18, 2023

gyfora Dec 19, 2023

gyfora Dec 19, 2023

mxm Dec 19, 2023

afedulov Jan 9, 2024 •

edited

Loading

afedulov commented Jan 9, 2024

mxm commented Jan 9, 2024

mxm left a comment

gyfora left a comment

afedulov commented Jan 10, 2024

[FLINK-30593][autoscaler] Improve restart time tracking #735

[FLINK-30593][autoscaler] Improve restart time tracking #735

Conversation

afedulov commented Dec 15, 2023

mxm left a comment

Choose a reason for hiding this comment

mxm Dec 18, 2023 • edited Loading

Choose a reason for hiding this comment

afedulov Jan 9, 2024

Choose a reason for hiding this comment

mxm commented Dec 18, 2023

gyfora Dec 19, 2023

Choose a reason for hiding this comment

gyfora Dec 19, 2023

Choose a reason for hiding this comment

mxm Dec 19, 2023

Choose a reason for hiding this comment

afedulov Jan 9, 2024 • edited Loading

Choose a reason for hiding this comment

afedulov commented Jan 9, 2024

mxm commented Jan 9, 2024

mxm left a comment

Choose a reason for hiding this comment

gyfora left a comment

Choose a reason for hiding this comment

afedulov commented Jan 10, 2024

mxm Dec 18, 2023 •

edited

Loading

afedulov Jan 9, 2024 •

edited

Loading