-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-31650][metrics][rest] Remove transient metrics for subtasks in terminal state #23447
Conversation
@X-czh |
Hi @JunRuiLee, thanks for reviewing. +1 for the need to preserve some non-transient metrics like numRecordsIn and numRecordsOut after vertices reaching terminal state. I'll redesign to remove only the backpressure-related metrics mentioned in this issue |
I previously thought it fine to completely remove metrics of terminal vertices, as the core I/O metrics displayed on the UI of terminal vertices are actually accessed via the metrics stored in |
Hi @JunRuiLee. It's been a while, I've updated the PR to remove the transient metrics (idle/busy/backpressured time) for terminal subtasks only. Could you help take a review when you are free? Many thanks in advance~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @X-czh , thanks for the update. It looks good to me. However, I'd like to hear @wanglijie95 's opinion on these changes. @wanglijie95 What do you think?
@wanglijie95 Kindly remind~ Could you help take a look when you have time? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @X-czh. My apologies for the oversight. After a more thorough review, I've noticed an issue with this pr. So I am updating the PR status to 'requests changes'.
subtaskMetricStore.retainAttempts( | ||
attempts.getCurrentAttempts())); | ||
subtaskMetricStore -> { | ||
subtaskMetricStore.retainAttempts( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the metrics for subtasks also exist in the taskMetricsStore in the form of taskInfo.subtaskIndex + "." + name, it is also necessary to clean up the transient metrics stored in the taskMetricsStore. Otherwise, it could lead to inconsistent behaviors that may confuse users, as depicted in the screenshot below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! But I'm wondering if this is a good approach to do so here. It complicates the code and makes maintenance more difficult. Since the duplication was introduced to overcome the issue that WebInterface task metric queries currently do not account for subtasks, how about leaving it unremoved here for now and create a new JIRA on updating the WebInterface to account for subtasks? I'd be willing to help with that as well. cc @JunRuiLee @wanglijie95
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarification @X-czh .
I'm not quite certain I understand your concern.
In my opinion, this issue is unrelated to the web interface and more related to the inconsistency in the MetricStore because the WebUI is also get data from MetricStore. Specifically, the metrics in the subtaskMetricsStore are being removed, while the metrics in the taskMetricsStore are not synchronously removed, which could be confusing for users.
Based on your changes, you can perform the following test:
For a jobVertex that has already finished, you can use the JobVertexMetricsHandler to retrieve subtask metrics like below:
http://localhost:8081/jobs//vertices//metrics?get=0.backPressuredTimeMsPerSecond,0.busyTimeMsPerSecond
Then, compare the results with the SubtaskMetricsHandler:
http://localhost:8081/jobs//vertices//subtasks/0/metrics?get=backPressuredTimeMsPerSecond,busyTimeMsPerSecond
The results from these two endpoints are different. In my local test, the results are as shown in the attached image. I prefer that cleaning up should be done simultaneously for both, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @JunRuiLee, thanks for the clarification. I get your point and my concern is actual about maintainability of the code in the future. When there're both insertion and deletion operations, the duplication of task-level metrics actually makes it more difficult to maintain consistency (as is the case here), so I think it would be better to optimize the duplication issue here in the future.
I'll clean up simultaneously for both cases first here, and create a new issue for the optimization later. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clean up simultaneously for both cases first here sounds good to me, Please feel free and go ahead~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, PTAL when you are free, thx~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for creating this PR @X-czh , I left two minor comments, PTAL.
flink-runtime/src/main/java/org/apache/flink/runtime/messages/webmonitor/JobDetails.java
Outdated
Show resolved
Hide resolved
...-runtime/src/main/java/org/apache/flink/runtime/rest/handler/legacy/metrics/MetricStore.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@X-czh , Thanks for addressing my comments. Everything looks great except for a minor comment, PTAL.
if (subtasks.containsKey(subtaskIndex)) { | ||
// Remove in both places as task metrics are duplicated in task metric store and | ||
// subtask metric store for metrics query on WebInterface. | ||
metrics.keySet() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer not specifically mentioning 'query on WebInterface' since the metric store also exposes metrics to users through the REST API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed, thx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@X-czh , Thanks for the quick fix. Looks good to me! Approved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for update @X-czh , LGTM. Could you squash the commits and rebase your code based on the latest master?
@wanglijie95 Squashed and rebased, thanks! |
… terminal state This closes #23447
@X-czh Could you prepare a backport PR for branch release-1.17 ? It has conflicts when I try to cherry-pick the changes here to release-1.17 |
Sure, no problem |
Hi @X-czh, kindly remind again, we need a PR for release-1.17 ~ |
OK, I'll prepare it tonight |
… terminal state This closes apache#23447 (cherry picked from commit dd02828)
Thanks @X-czh , and thanks for the review of @JunRuiLee. I will close this issue. |
What is the purpose of the change
This pull request cleanups transient metrics (idle/busy/backpressured time) for terminal subtasks to avoid confusion caused by outdated metrics. For example, a FINISHED task may have its last updated 100 % busy time metrics retained in the metric store and shown on the UI, which is obviously unreasonable.
Brief change log
Removes transient metrics (idle/busy/backpressured time) for terminal subtasks in
MetricStore
.Verifying this change
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (yes / no)Documentation