[BEAM-6165] Flink portable metrics: get ptransform from MonitoringInfo, not stage name #7971

ryan-williams · 2019-03-01T05:49:25Z

This fixes a problem with portable Flink metrics that dates back to #7183: the runner's "step name" is just the name of an executable stage, not the ptransform that metrics are actually part of (which come over the fn API in MIs' "ptransform" label).

Metrics weren't properly tagged with the ptransform they came from until #7624; it wasn't possible to do the right thing before that.

Versions of these Flink-specific changes exist in both #7915 and #7934, so I'm seeing about factoring them out here; if this is straightforward and can go in first, I can simplify those others by rebasing them on top of this.

R: @mxm, @ajamato
CC: @robertwb

Post-Commit Tests Status (on master branch)

Lang	Apex	Dataflow	Flink	Gearpump	Samza	Spark
Go	---	---	---	---	---	---
Java
Python	---			---	---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

mxm

Thanks for the PR! Some minor comments. Did you check that the metric names are correct for regular and portable pipelines? There are some users which rely on them being consistent. We should definitely create a JIRA issue that can be part of the release notes if the format changed.

CC @tweise @mwylde

mxm · 2019-03-04T11:45:33Z

runners/flink/src/test/java/org/apache/beam/runners/flink/FlinkSavepointTest.java

@@ -133,6 +133,8 @@ public void testSavepointRestoreLegacy() throws Exception {
    runSavepointAndRestore(false);
  }

+  // TODO(ryan): make these fail when an exception is thrown (in the runner, I
+  //  think?), instead of just timing out


Is this related to the metrics? Have you seen this test timing out?

It's not directly related, but when working on metrics PVR tests, if something throws in the runner harness (I think), that manifests as a hang and timeout of this test case, without a good stack trace.

Not essential to add this, ofc, but it's not the first time I've experienced it while doing unrelated Flink work, so thought I'd try to raise awareness.

mxm · 2019-03-04T11:49:25Z

runners/flink/src/main/java/org/apache/beam/runners/flink/metrics/FlinkMetricContainer.java

-    // in the operator name which is passed to Flink's MetricGroup to which
-    // the metric with the following name will be added.
-    return metricName.getNamespace() + METRIC_KEY_SEPARATOR + metricName.getName();
+    return String.join(


Did you check how the metric names look in Flink? I think we changed this some time ago because the metric names contained duplicate strings.

I've not. I'll do that and report back.

Have you had a chance to check?

Please see https://issues.apache.org/jira/browse/BEAM-6172

So just to confirm, including the step name here will duplicate it because it is already used as the metric group name for the Flink operator which calls out to the Metrics code here.

ryan-williams

Addressed comments, I'll follow up on one of them.

I also simplified some of the FlinkMetricContainer.updateMetric helpers.

ryan-williams · 2019-03-07T22:08:23Z

runners/flink/src/test/java/org/apache/beam/runners/flink/FlinkSavepointTest.java

@@ -133,6 +133,8 @@ public void testSavepointRestoreLegacy() throws Exception {
    runSavepointAndRestore(false);
  }

+  // TODO(ryan): make these fail when an exception is thrown (in the runner, I
+  //  think?), instead of just timing out


It's not directly related, but when working on metrics PVR tests, if something throws in the runner harness (I think), that manifests as a hang and timeout of this test case, without a good stack trace.

Not essential to add this, ofc, but it's not the first time I've experienced it while doing unrelated Flink work, so thought I'd try to raise awareness.

ryan-williams · 2019-03-07T22:09:15Z

runners/flink/src/main/java/org/apache/beam/runners/flink/metrics/FlinkMetricContainer.java

-    // in the operator name which is passed to Flink's MetricGroup to which
-    // the metric with the following name will be added.
-    return metricName.getNamespace() + METRIC_KEY_SEPARATOR + metricName.getName();
+    return String.join(


I've not. I'll do that and report back.

ryan-williams · 2019-03-08T15:09:07Z

Run Java PreCommit

ryan-williams · 2019-03-08T15:09:14Z

Run Python PreCommit

ryan-williams · 2019-03-08T17:16:17Z

(Python PreCommit was flaky; :beam-sdks-python:testPy3Gcp failed (scan))

ryan-williams · 2019-03-08T18:47:59Z

Run Java PreCommit

ryan-williams · 2019-03-09T04:12:52Z

:beam-runners-flink-1.6:test failed previously (scan) but I think it was a flake unrelated to this PR.

ajamato · 2019-03-13T17:05:30Z

@pabloem

pabloem · 2019-03-18T21:41:25Z

This looks fine to me. It's quite cool TBH : D
Max seems to be asking the right questions. I leave the rest of the review to him. I agree thjat it's important to have consistent metric naming (perhaps add a test?) I leave that to you.
Thanks!

stale · 2019-06-04T10:57:59Z

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions.

stale · 2019-06-11T11:00:01Z

This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

ryan-williams added 2 commits March 1, 2019 05:16

rename updateMetrics → updateFlinkMetrics

d71c319

add @nullable to MetricContainerImpl.getContainer param

6be1442

mxm reviewed Mar 4, 2019

View reviewed changes

ryan-williams mentioned this pull request Mar 6, 2019

Use MonitoringInfo data model in Java SDK metrics #7915

Closed

ryan-williams added 4 commits March 7, 2019 22:08

flink get ptransform from MIs

ebd14f6

add FlinkSavepointTest TODO

106a8b0

guard against null stepName in MetricFiltering

456b671

FlinkMetricContainerTest cleanup

dd3b354

ryan-williams force-pushed the fm branch from f26784b to dd3b354 Compare March 7, 2019 22:39

ryan-williams commented Mar 7, 2019

View reviewed changes

stale bot added the stale label Jun 4, 2019

stale bot closed this Jun 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-6165] Flink portable metrics: get ptransform from MonitoringInfo, not stage name #7971

[BEAM-6165] Flink portable metrics: get ptransform from MonitoringInfo, not stage name #7971

ryan-williams commented Mar 1, 2019 •

edited

mxm left a comment

mxm Mar 4, 2019

ryan-williams Mar 7, 2019

mxm Mar 4, 2019

ryan-williams Mar 7, 2019

mxm Mar 13, 2019

tweise Mar 21, 2019

mxm Apr 5, 2019

ryan-williams left a comment

ryan-williams Mar 7, 2019

ryan-williams Mar 7, 2019

ryan-williams commented Mar 8, 2019

ryan-williams commented Mar 8, 2019

ryan-williams commented Mar 8, 2019

ryan-williams commented Mar 8, 2019

ryan-williams commented Mar 9, 2019

ajamato commented Mar 13, 2019

pabloem commented Mar 18, 2019

stale bot commented Jun 4, 2019

stale bot commented Jun 11, 2019

[BEAM-6165] Flink portable metrics: get ptransform from MonitoringInfo, not stage name #7971

[BEAM-6165] Flink portable metrics: get ptransform from MonitoringInfo, not stage name #7971

Conversation

ryan-williams commented Mar 1, 2019 • edited

Post-Commit Tests Status (on master branch)

mxm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryan-williams left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryan-williams commented Mar 8, 2019

ryan-williams commented Mar 8, 2019

ryan-williams commented Mar 8, 2019

ryan-williams commented Mar 8, 2019

ryan-williams commented Mar 9, 2019

ajamato commented Mar 13, 2019

pabloem commented Mar 18, 2019

stale bot commented Jun 4, 2019

stale bot commented Jun 11, 2019

ryan-williams commented Mar 1, 2019 •

edited