Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-31650][metrics][rest] Remove transient metrics for subtasks in terminal state #23447

Closed
wants to merge 1 commit into from

Conversation

X-czh
Copy link
Contributor

@X-czh X-czh commented Sep 20, 2023

What is the purpose of the change

This pull request cleanups transient metrics (idle/busy/backpressured time) for terminal subtasks to avoid confusion caused by outdated metrics. For example, a FINISHED task may have its last updated 100 % busy time metrics retained in the metric store and shown on the UI, which is obviously unreasonable.

Brief change log

Removes transient metrics (idle/busy/backpressured time) for terminal subtasks in MetricStore.

Verifying this change

  • Added UT to test that the busy time metric is removed for terminal subtasks.
  • Tested with a real job with a bounded source and verified on UI that the metrics were unavailable after the bounded source finished:
    888709fa-4eca-4e83-9374-9383e4e44b95

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

@flinkbot
Copy link
Collaborator

flinkbot commented Sep 20, 2023

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@JunRuiLee
Copy link
Contributor

@X-czh
Hi Zhanghao, thanks for your effort.
After a quick review of your PR, if I understand correctly, your changes will remove all metrics from the operators in the terminal state.
I think this change is not entirely reliable because some non-transient metrics like numRecordsIn and numRecordsOut should not be removed just although the task state becomes finished. I would prefer to remove only those metrics that may cause confused when the tasks in the terminal state, and the decision on which metrics are unnecessary for terminal state tasks should be determined through community discussion.
However, to keep it simple for this PR, I could also accept the removal of only the backpressure-related metrics mentioned in this issue.
Both approaches are acceptable to me. WDYT?

@X-czh
Copy link
Contributor Author

X-czh commented Sep 22, 2023

Hi @JunRuiLee, thanks for reviewing. +1 for the need to preserve some non-transient metrics like numRecordsIn and numRecordsOut after vertices reaching terminal state. I'll redesign to remove only the backpressure-related metrics mentioned in this issue

@X-czh
Copy link
Contributor Author

X-czh commented Sep 22, 2023

I previously thought it fine to completely remove metrics of terminal vertices, as the core I/O metrics displayed on the UI of terminal vertices are actually accessed via the metrics stored in AccessExecution and would not be affected. But after some offline discussion with my colleagues, I realized some users may rely on it. So +1 for your suggestion.

@X-czh
Copy link
Contributor Author

X-czh commented Nov 25, 2023

Hi @JunRuiLee. It's been a while, I've updated the PR to remove the transient metrics (idle/busy/backpressured time) for terminal subtasks only. Could you help take a review when you are free? Many thanks in advance~

Copy link
Contributor

@JunRuiLee JunRuiLee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @X-czh , thanks for the update. It looks good to me. However, I'd like to hear @wanglijie95 's opinion on these changes. @wanglijie95 What do you think?

@X-czh X-czh changed the title [FLINK-31650][metrics][rest] Cleanup terminal subtask attempt metrics to avoid confusion caused by outdated metrics [FLINK-31650][metrics][rest] Remove transient metrics for subtasks in terminal state Dec 18, 2023
@X-czh
Copy link
Contributor Author

X-czh commented Dec 18, 2023

@wanglijie95 Kindly remind~ Could you help take a look when you have time?

Copy link
Contributor

@JunRuiLee JunRuiLee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @X-czh. My apologies for the oversight. After a more thorough review, I've noticed an issue with this pr. So I am updating the PR status to 'requests changes'.

subtaskMetricStore.retainAttempts(
attempts.getCurrentAttempts()));
subtaskMetricStore -> {
subtaskMetricStore.retainAttempts(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the metrics for subtasks also exist in the taskMetricsStore in the form of taskInfo.subtaskIndex + "." + name, it is also necessary to clean up the transient metrics stored in the taskMetricsStore. Otherwise, it could lead to inconsistent behaviors that may confuse users, as depicted in the screenshot below.
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! But I'm wondering if this is a good approach to do so here. It complicates the code and makes maintenance more difficult. Since the duplication was introduced to overcome the issue that WebInterface task metric queries currently do not account for subtasks, how about leaving it unremoved here for now and create a new JIRA on updating the WebInterface to account for subtasks? I'd be willing to help with that as well. cc @JunRuiLee @wanglijie95

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarification @X-czh .

I'm not quite certain I understand your concern.
In my opinion, this issue is unrelated to the web interface and more related to the inconsistency in the MetricStore because the WebUI is also get data from MetricStore. Specifically, the metrics in the subtaskMetricsStore are being removed, while the metrics in the taskMetricsStore are not synchronously removed, which could be confusing for users.

Based on your changes, you can perform the following test:
For a jobVertex that has already finished, you can use the JobVertexMetricsHandler to retrieve subtask metrics like below:
http://localhost:8081/jobs//vertices//metrics?get=0.backPressuredTimeMsPerSecond,0.busyTimeMsPerSecond
Then, compare the results with the SubtaskMetricsHandler:
http://localhost:8081/jobs//vertices//subtasks/0/metrics?get=backPressuredTimeMsPerSecond,busyTimeMsPerSecond

The results from these two endpoints are different. In my local test, the results are as shown in the attached image. I prefer that cleaning up should be done simultaneously for both, WDYT?

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @JunRuiLee, thanks for the clarification. I get your point and my concern is actual about maintainability of the code in the future. When there're both insertion and deletion operations, the duplication of task-level metrics actually makes it more difficult to maintain consistency (as is the case here), so I think it would be better to optimize the duplication issue here in the future.

I'll clean up simultaneously for both cases first here, and create a new issue for the optimization later. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean up simultaneously for both cases first here sounds good to me, Please feel free and go ahead~

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, PTAL when you are free, thx~

Copy link
Contributor

@wanglijie95 wanglijie95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating this PR @X-czh , I left two minor comments, PTAL.

Copy link
Contributor

@JunRuiLee JunRuiLee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@X-czh , Thanks for addressing my comments. Everything looks great except for a minor comment, PTAL.

if (subtasks.containsKey(subtaskIndex)) {
// Remove in both places as task metrics are duplicated in task metric store and
// subtask metric store for metrics query on WebInterface.
metrics.keySet()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not specifically mentioning 'query on WebInterface' since the metric store also exposes metrics to users through the REST API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed, thx

Copy link
Contributor

@JunRuiLee JunRuiLee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@X-czh , Thanks for the quick fix. Looks good to me! Approved.

Copy link
Contributor

@wanglijie95 wanglijie95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for update @X-czh , LGTM. Could you squash the commits and rebase your code based on the latest master?

@X-czh
Copy link
Contributor Author

X-czh commented Dec 19, 2023

@wanglijie95 Squashed and rebased, thanks!

wanglijie95 pushed a commit that referenced this pull request Dec 19, 2023
@wanglijie95
Copy link
Contributor

wanglijie95 commented Dec 19, 2023

@X-czh Could you prepare a backport PR for branch release-1.17 ? It has conflicts when I try to cherry-pick the changes here to release-1.17

@X-czh
Copy link
Contributor Author

X-czh commented Dec 20, 2023

Sure, no problem

@wanglijie95
Copy link
Contributor

Hi @X-czh, kindly remind again, we need a PR for release-1.17 ~

@X-czh
Copy link
Contributor Author

X-czh commented Dec 25, 2023

OK, I'll prepare it tonight

X-czh added a commit to X-czh/flink that referenced this pull request Dec 25, 2023
… terminal state

This closes apache#23447

(cherry picked from commit dd02828)
wanglijie95 pushed a commit that referenced this pull request Dec 26, 2023
… terminal state (#23988)

This closes #23447

(cherry picked from commit dd02828)
@wanglijie95
Copy link
Contributor

Thanks @X-czh , and thanks for the review of @JunRuiLee. I will close this issue.

@X-czh X-czh deleted the FLINK-31650 branch May 27, 2024 05:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants