Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-27925] [kubernetes]Performance optimization when watch tm pod and list pod. #21527

Merged
merged 2 commits into from
Jun 12, 2023

Conversation

Mrart
Copy link
Contributor

@Mrart Mrart commented Dec 19, 2022

What is the purpose of the change

In the case of large-scale start and stop jobs, constantly reading data from etcd can cause a bottleneck in etcd performance. We can increase resourceversion=0 in watch to reduce data read from etcd.

Brief change log

    • Add withResourceVersion("0") in Fabric8FlinkKubeClient#watchPodsAndDoCallback

Verifying this change

Please make sure both new and modified tests in this PR follows the conventions defined in our code quality guide: https://flink.apache.org/contributing/code-style-and-quality-common.html#testing

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (100MB)
  • Extended integration test for recovery after master (JobManager) failure
  • Added test that validates that TaskInfo is transferred only once across recoveries
  • Manually verified the change by running a 4 node cluster with 2 JobManagers and 4 TaskManagers, a stateful streaming program, and killing one JobManager and two TaskManagers during the execution, verifying that recovery happens correctly.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): ( no)
  • The serializers: (no / don't)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no )
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (no)
  • If yes, how is the feature documented? (not documented)

@flinkbot
Copy link
Collaborator

flinkbot commented Dec 19, 2022

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@wangyang0918
Copy link
Contributor

We need to add a test to guard this behavior.

@Mrart
Copy link
Contributor Author

Mrart commented Jan 2, 2023

@wangyang0918 Added a unit test to test that we can use withResourceversion("0 ") to listen on pods events when pod resourceversion=5668.5668 is a random resource version.

@Mrart Mrart changed the title [FLINK-27925] watch tm pod performance optimization [FLINK-27925] [kubernetes]watch tm pod performance optimization Jan 4, 2023
@Mrart Mrart changed the title [FLINK-27925] [kubernetes]watch tm pod performance optimization [FLINK-27925] [kubernetes]Performance optimization when watch tm pod and list pod. Jan 4, 2023
@Mrart
Copy link
Contributor Author

Mrart commented Jan 6, 2023

@flinkbot run azure

Copy link
Contributor

@huwh huwh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Mrart prepare for this pr, I have left some comments, please take a look.

Copy link
Member

@reswqa reswqa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Mrart for this optimization and thanks @huwh for the review!

Apart from a suggestion for refactoring, my opinion is basically in line with Weihua. Please take a look~

@reswqa
Copy link
Member

reswqa commented May 29, 2023

It seems that some conflicts need to be resolved.

@Mrart
Copy link
Contributor Author

Mrart commented May 30, 2023

@flinkbot run azure

@Mrart Mrart force-pushed the 27925 branch 3 times, most recently from 39b203f to 783fcaf Compare May 30, 2023 17:49
@Mrart Mrart force-pushed the 27925 branch 3 times, most recently from b001725 to 2708fe2 Compare June 7, 2023 07:30
Copy link
Member

@reswqa reswqa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update, I have no other comments. Let's wait and see if @huwh has any other concerns.

Copy link
Contributor

@huwh huwh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Mrart and @reswqa for the update. The PR looks much better now. LGTM

@reswqa
Copy link
Member

reswqa commented Jun 7, 2023

@Mrart It seems that the e2e test case is failed, can you rebase on latest master branch to see if it is fixed.

@Mrart
Copy link
Contributor Author

Mrart commented Jun 11, 2023

@Mrart It seems that the e2e test case is failed, can you rebase on latest master branch to see if it is fixed.

Fixed

@reswqa reswqa merged commit 025a95b into apache:master Jun 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants