New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-6165] Flink portable metrics: get ptransform from MonitoringInfo, not stage name #7971
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Some minor comments. Did you check that the metric names are correct for regular and portable pipelines? There are some users which rely on them being consistent. We should definitely create a JIRA issue that can be part of the release notes if the format changed.
@@ -133,6 +133,8 @@ public void testSavepointRestoreLegacy() throws Exception { | |||
runSavepointAndRestore(false); | |||
} | |||
|
|||
// TODO(ryan): make these fail when an exception is thrown (in the runner, I | |||
// think?), instead of just timing out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this related to the metrics? Have you seen this test timing out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not directly related, but when working on metrics PVR tests, if something throws in the runner harness (I think), that manifests as a hang and timeout of this test case, without a good stack trace.
Not essential to add this, ofc, but it's not the first time I've experienced it while doing unrelated Flink work, so thought I'd try to raise awareness.
// in the operator name which is passed to Flink's MetricGroup to which | ||
// the metric with the following name will be added. | ||
return metricName.getNamespace() + METRIC_KEY_SEPARATOR + metricName.getName(); | ||
return String.join( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you check how the metric names look in Flink? I think we changed this some time ago because the metric names contained duplicate strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not. I'll do that and report back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you had a chance to check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So just to confirm, including the step name here will duplicate it because it is already used as the metric group name for the Flink operator which calls out to the Metrics code here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed comments, I'll follow up on one of them.
I also simplified some of the FlinkMetricContainer.updateMetric helpers.
@@ -133,6 +133,8 @@ public void testSavepointRestoreLegacy() throws Exception { | |||
runSavepointAndRestore(false); | |||
} | |||
|
|||
// TODO(ryan): make these fail when an exception is thrown (in the runner, I | |||
// think?), instead of just timing out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not directly related, but when working on metrics PVR tests, if something throws in the runner harness (I think), that manifests as a hang and timeout of this test case, without a good stack trace.
Not essential to add this, ofc, but it's not the first time I've experienced it while doing unrelated Flink work, so thought I'd try to raise awareness.
// in the operator name which is passed to Flink's MetricGroup to which | ||
// the metric with the following name will be added. | ||
return metricName.getNamespace() + METRIC_KEY_SEPARATOR + metricName.getName(); | ||
return String.join( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not. I'll do that and report back.
Run Java PreCommit |
Run Python PreCommit |
(Python PreCommit was flaky; |
Run Java PreCommit |
|
This looks fine to me. It's quite cool TBH : D |
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions. |
This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
This fixes a problem with portable Flink metrics that dates back to #7183: the runner's "step name" is just the name of an executable stage, not the ptransform that metrics are actually part of (which come over the fn API in MIs' "ptransform" label).
Metrics weren't properly tagged with the ptransform they came from until #7624; it wasn't possible to do the right thing before that.
Versions of these Flink-specific changes exist in both #7915 and #7934, so I'm seeing about factoring them out here; if this is straightforward and can go in first, I can simplify those others by rebasing them on top of this.
R: @mxm, @ajamato
CC: @robertwb
Post-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.