Skip to content

Conversation

clarax
Copy link
Contributor

@clarax clarax commented Oct 18, 2023

What is the purpose of the change

This is to implement the scaling report comparison algorithm for event suppression by using hashcodes of the parallelism maps. Originally I used the full string of report message which contains the metrics fluctuating with current load. As long as parallelism map doesn't change, we don't need to generate new events within the defined interval.

Brief change log

  • Save hashcode of the parallelism map to metadata labels
  • Use the hashcode to compare two scaling report for advice

Verifying this change

  • Updated and added unit tests.
  • Verified in integration test env.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): No
  • The public API, i.e., is any changes to the CustomResourceDescriptors: No
  • Core observer or reconciler logic that is regularly executed: No

Documentation

  • Does this pull request introduce a new feature? No
  • If yes, how is the feature documented? N/A

@clarax clarax closed this Oct 18, 2023
@gyfora
Copy link
Contributor

gyfora commented Oct 19, 2023

Sorry @clarax I think I gave confusing feedback and didn't express myself clearly. I was thinking something like this:

interface AutoScalerEventHandler {

void handleGenericEvent(Ctx, Type, Reson, Message, Key, Interval);

default void handleScalingEvent(Ctx, Map<JobVertexId, ScalingSummary> summaries, boolean scalingEnabled, Interval) {
    // Provide default implementation without proper deduplication
   handleGenericEvent(...., toMessage(summaries), interval);
} 

}

Then for Kubernetes we have the custom label logic encapsulated:

class KubernetesAutoScalerEventHandler {

...

@Override 
void handleScalingEvent(Ctx, Map<JobVertexId, ScalingSummary> summaries, boolean scalingEnabled, Interval) {
   var labels = createLabels(summaries)
   getOldEvent, compare labels, whatever we need to do we do not leak to ScalingExecutor
}

}

This way the ScalingExecutor simply calls:

eventHandler.handleScalingEvent(...)

And we could remove the predicate/label logic that really do not belong there as that is very Kubernetes specific and move it to the KubernetesAutoScalerEventHandler implementation.

What do you think?

@clarax
Copy link
Contributor Author

clarax commented Oct 20, 2023

Refactored as @gyfora suggusted.

Copy link
Contributor

@gyfora gyfora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks really good now, we have a much nicer separation of responsibilities in the different components. Please verify the latest version manually (locally or something) if you can @clarax

@1996fanrui What do you think about the interface change?

@1996fanrui 1996fanrui self-requested a review October 20, 2023 07:51
Copy link
Contributor

@gyfora gyfora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing we have been discussing with @1996fanrui

The interval config should probably be renamed scaling.report.interval->scaling.event.interval this way we can use it generally in the future for autoscaler triggered events.

We should also make sure that the simple handleEvent method also respects the interval if specified. And we should probably use the interval also for ineffective scaling events. I know that some of these changes are not directly related to this PR but it may be better to clean it up so we leave it in a good state afterwards.

@clarax
Copy link
Contributor Author

clarax commented Oct 24, 2023

Another thing we have been discussing with @1996fanrui

The interval config should probably be renamed scaling.report.interval->scaling.event.interval this way we can use it generally in the future for autoscaler triggered events.

We should also make sure that the simple handleEvent method also respects the interval if specified. And we should probably use the interval also for ineffective scaling events. I know that some of these changes are not directly related to this PR but it may be better to clean it up so we leave it in a good state afterwards.

Resolved all requested changes.

@1996fanrui
Copy link
Member

Hi, the ci fails, and please run the mvn clean install -DskipTests -Pgenerate-docs again, thanks

https://github.com/apache/flink-kubernetes-operator/actions/runs/6621898931/job/17987047236?pr=685#step:5:19405

Copy link
Contributor

@gyfora gyfora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks great @clarax , thank you!

I found 2 very minor things (see comments), otherwise it is good to go

@gyfora gyfora merged commit faaff56 into apache:main Oct 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants