Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-34655] Copy IOMetricsInfo to flink-autoscaler-standalone module #797

Merged
merged 1 commit into from
Mar 14, 2024

Conversation

1996fanrui
Copy link
Member

@1996fanrui 1996fanrui commented Mar 12, 2024

flink-ubernetes-operator is committed to supporting the latest 4 flink minor versions, and autoscaler is a part of flink-ubernetes-operator. Currently, the latest 4 flink minor versions are 1.15, 1.16, 1.17 and 1.18.

But autoscaler doesn't work for flink 1.15.

Root cause:

  • FLINK-28310 added some properties in IOMetricsInfo in flink-1.16
  • IOMetricsInfo is a part of JobDetailsInfo
  • JobDetailsInfo is necessary for autoscaler [1]
  • flink's RestClient doesn't allow miss any property during deserializing the json

That means that the RestClient after 1.15 cannot fetch JobDetailsInfo for 1.15 jobs.

How to fix it properly?

  • [FLINK-34655] Copy IOMetricsInfo to flink-autoscaler-standalone module
  • Removing them after 1.15 are not supported

[1]

getJobDetailsInfo(ctx, conf.get(AutoScalerOptions.FLINK_CLIENT_TIMEOUT));

[2] https://cwiki.apache.org/confluence/display/FLINK/FLIP-401%3A+REST+API+JSON+response+deserialization+unknown+field+tolerance

@gyfora
Copy link
Contributor

gyfora commented Mar 12, 2024

Instead of a new rest client you could also do what is done in the Kubernetes-operator module, see:
https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/runtime/rest/messages/job/metrics/IOMetricsInfo.java

By shadowing the class under the same package this will hide the incompatible implementation coming from flink

@gyfora
Copy link
Contributor

gyfora commented Mar 12, 2024

But still it will not make it work fully for 1.15 as the aggregated metrics need to be back ported on the Flink side (similar to other features required for the autoscaler)

@1996fanrui
Copy link
Member Author

Thanks @gyfora for the valuable feedback!

Instead of a new rest client you could also do what is done in the Kubernetes-operator module, see: https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/runtime/rest/messages/job/metrics/IOMetricsInfo.java

By shadowing the class under the same package this will hide the incompatible implementation coming from flink

If so, autoscaler works with 1.15 when users uses the flink-kubernetes-operator, right?

I'm using the autoscaler Standalone, and I found 1.15 job doesn't work. In the short term, could I move the IOMetricsInfo to autoscaler module? IIUC, it will let both of flink-kubernetes-operator and autoscaler Standalone work.

In the long term, should we maintain the IOMetricsInfo class in the flink-kubernetes-operator repo or we disable DeserializationFeature.FAIL_ON_NULL_FOR_PRIMITIVES in the flink repo?

Looking forward to your opinion, and I'm happy to fix them.

But still it will not make it work fully for 1.15 as the aggregated metrics need to be back ported on the Flink side (similar to other features required for the autoscaler)

Thanks for the reminder, we have a lot of jobs run with 1.15 version, so we are backing port them. Of course, we (our internal platform) only analyse them and don't scale them when the version before 1.18.

@gyfora
Copy link
Contributor

gyfora commented Mar 12, 2024

Thanks @gyfora for the valuable feedback!

Instead of a new rest client you could also do what is done in the Kubernetes-operator module, see: https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/runtime/rest/messages/job/metrics/IOMetricsInfo.java
By shadowing the class under the same package this will hide the incompatible implementation coming from flink

If so, autoscaler works with 1.15 when users uses the flink-kubernetes-operator, right?

I'm using the autoscaler Standalone, and I found 1.15 job doesn't work. In the short term, could I move the IOMetricsInfo to autoscaler module? IIUC, it will let both of flink-kubernetes-operator and autoscaler Standalone work.

In the long term, should we maintain the IOMetricsInfo class in the flink-kubernetes-operator repo or we disable DeserializationFeature.FAIL_ON_NULL_FOR_PRIMITIVES in the flink repo?

Looking forward to your opinion, and I'm happy to fix them.

But still it will not make it work fully for 1.15 as the aggregated metrics need to be back ported on the Flink side (similar to other features required for the autoscaler)

Thanks for the reminder, we have a lot of jobs run with 1.15 version, so we are backing port them. Of course, we (our internal platform) only analyse them and don't scale them when the version before 1.18.

Not sure if we need to have the IOMetricsInfo override in both packages to be safe actually. I think it has to be copied.

I can personally confirm that given the aggregated metrics back ported to 1.15 (easy to do) + the IOMetricsInfo, the autoscaler works with 1.15

@1996fanrui 1996fanrui force-pushed the 34655/json-deserialize branch 2 times, most recently from 3e06a99 to dc64eee Compare March 13, 2024 08:33
@1996fanrui 1996fanrui changed the title [FLINK-34655] Autoscaler doesn't work for flink 1.15 [FLINK-34655] Copy IOMetricsInfo to flink-autoscaler-standalone module Mar 13, 2024
@1996fanrui 1996fanrui requested a review from gyfora March 13, 2024 08:49
Copy link
Contributor

@gyfora gyfora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, if you tested it and works 🚢

Copy link
Member Author

@1996fanrui 1996fanrui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gyfora and @mxm for the review!

looks good, if you tested it and works 🚢

I have tested locally, the main branch doesn't work for 1.15 job. And it works with current PR. Merging~

@1996fanrui 1996fanrui merged commit ab41083 into apache:main Mar 14, 2024
131 checks passed
@1996fanrui 1996fanrui deleted the 34655/json-deserialize branch March 14, 2024 02:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants