New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-26074] Improve FlameGraphs scalability for high parallelism jobs #19228
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a fantastic addition. We should run some manual benchmark with and without this fix though.
I'd remove the use of Set
at various places.
flink-runtime/src/main/java/org/apache/flink/runtime/messages/ThreadInfoSample.java
Outdated
Show resolved
Hide resolved
flink-runtime/src/main/java/org/apache/flink/runtime/messages/ThreadInfoSample.java
Outdated
Show resolved
Hide resolved
flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java
Outdated
Show resolved
Hide resolved
flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java
Outdated
Show resolved
Hide resolved
flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/ThreadInfoSampleService.java
Show resolved
Hide resolved
flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/ThreadInfoSampleService.java
Show resolved
Hide resolved
...e/src/main/java/org/apache/flink/runtime/webmonitor/threadinfo/JobVertexThreadInfoStats.java
Outdated
Show resolved
Hide resolved
@flinkbot run azure |
40ccea6
to
bbb3576
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks much clearer now with the ImmutableSet
. If you don't like it, then there is still the option with creating a small record type around it.
...st/java/org/apache/flink/runtime/webmonitor/threadinfo/ThreadInfoRequestCoordinatorTest.java
Outdated
Show resolved
Hide resolved
flink-runtime/src/test/java/org/apache/flink/runtime/taskexecutor/IdleTestTask.java
Show resolved
Hide resolved
flink-runtime/src/main/java/org/apache/flink/runtime/messages/ThreadInfoSample.java
Outdated
Show resolved
Hide resolved
flink-runtime/src/main/java/org/apache/flink/runtime/messages/ThreadInfoSample.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/flink/runtime/webmonitor/threadinfo/JobVertexThreadInfoTracker.java
Show resolved
Hide resolved
...src/main/java/org/apache/flink/runtime/webmonitor/threadinfo/JobVertexThreadInfoTracker.java
Show resolved
Hide resolved
...st/java/org/apache/flink/runtime/webmonitor/threadinfo/ThreadInfoRequestCoordinatorTest.java
Outdated
Show resolved
Hide resolved
7acc24f
to
908d208
Compare
@AHeise could you take another look? There are only a couple of minor points remaining. |
I think all comments have been addressed and overall the code looks fine. Please rebase your branch since a lot of tests are currently failing on the CI of this branch
@afedulov Did you run the benchmarks? |
fe56326
to
0f0d121
Compare
@flinkbot run azure |
0f0d121
to
ff5a8fb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM % one logger is unused
flink-runtime/src/main/java/org/apache/flink/runtime/messages/ThreadInfoSample.java
Outdated
Show resolved
Hide resolved
...runtime/src/test/java/org/apache/flink/runtime/taskexecutor/ThreadInfoSampleServiceTest.java
Outdated
Show resolved
Hide resolved
ff5a8fb
to
6901eab
Compare
Thanks @fapaul, addressed the comments. I also did a final manual test on the cluster started from flink-dist. |
6901eab
to
8480a20
Compare
What is the purpose of the change
The FlameGraph feature added in FLINK-13550 issues 1 RPC call per subtask. This may cause performance problems for jobs with high paralleism and a lot of subtasks. This PR improves sampling by grouping thread sampling request and issuing only one call per TaskManager for all the relevant threads at once.
Verifying this change
Verified manually and by the existing tests.
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (yes / no)Documentation