Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-26074] Improve FlameGraphs scalability for high parallelism jobs #19228

Merged
merged 1 commit into from May 25, 2022

Conversation

afedulov
Copy link
Contributor

What is the purpose of the change

The FlameGraph feature added in FLINK-13550 issues 1 RPC call per subtask. This may cause performance problems for jobs with high paralleism and a lot of subtasks. This PR improves sampling by grouping thread sampling request and issuing only one call per TaskManager for all the relevant threads at once.

Verifying this change

Verified manually and by the existing tests.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

@flinkbot
Copy link
Collaborator

flinkbot commented Mar 24, 2022

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Copy link
Contributor

@AHeise AHeise left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a fantastic addition. We should run some manual benchmark with and without this fix though.
I'd remove the use of Set at various places.

@afedulov
Copy link
Contributor Author

@flinkbot run azure

Copy link
Contributor

@AHeise AHeise left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks much clearer now with the ImmutableSet. If you don't like it, then there is still the option with creating a small record type around it.

@afedulov
Copy link
Contributor Author

@AHeise could you take another look? There are only a couple of minor points remaining.

@fapaul
Copy link
Contributor

fapaul commented May 19, 2022

I think all comments have been addressed and overall the code looks fine. Please rebase your branch since a lot of tests are currently failing on the CI of this branch

We should run some manual benchmark with and without this fix though.

@afedulov Did you run the benchmarks?

@afedulov afedulov force-pushed the FLINK-26074-scale-FG branch 3 times, most recently from fe56326 to 0f0d121 Compare May 23, 2022 16:09
@afedulov
Copy link
Contributor Author

@flinkbot run azure

Copy link
Contributor

@fapaul fapaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % one logger is unused

@afedulov
Copy link
Contributor Author

afedulov commented May 24, 2022

Thanks @fapaul, addressed the comments. I also did a final manual test on the cluster started from flink-dist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants