Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-6165] Flink portable metrics: get ptransform from MonitoringInfo, not stage name #7971

Closed
wants to merge 6 commits into from

Conversation

ryan-williams
Copy link
Contributor

@ryan-williams ryan-williams commented Mar 1, 2019

This fixes a problem with portable Flink metrics that dates back to #7183: the runner's "step name" is just the name of an executable stage, not the ptransform that metrics are actually part of (which come over the fn API in MIs' "ptransform" label).

Metrics weren't properly tagged with the ptransform they came from until #7624; it wasn't possible to do the right thing before that.

Versions of these Flink-specific changes exist in both #7915 and #7934, so I'm seeing about factoring them out here; if this is straightforward and can go in first, I can simplify those others by rebasing them on top of this.

R: @mxm, @ajamato
CC: @robertwb

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- --- --- --- ---
Java Build Status Build Status Build Status Build Status
Build Status
Build Status
Build Status Build Status Build Status
Python Build Status
Build Status
--- Build Status
Build Status
Build Status --- --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

Copy link
Contributor

@mxm mxm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Some minor comments. Did you check that the metric names are correct for regular and portable pipelines? There are some users which rely on them being consistent. We should definitely create a JIRA issue that can be part of the release notes if the format changed.

CC @tweise @mwylde

@@ -133,6 +133,8 @@ public void testSavepointRestoreLegacy() throws Exception {
runSavepointAndRestore(false);
}

// TODO(ryan): make these fail when an exception is thrown (in the runner, I
// think?), instead of just timing out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related to the metrics? Have you seen this test timing out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not directly related, but when working on metrics PVR tests, if something throws in the runner harness (I think), that manifests as a hang and timeout of this test case, without a good stack trace.

Not essential to add this, ofc, but it's not the first time I've experienced it while doing unrelated Flink work, so thought I'd try to raise awareness.

// in the operator name which is passed to Flink's MetricGroup to which
// the metric with the following name will be added.
return metricName.getNamespace() + METRIC_KEY_SEPARATOR + metricName.getName();
return String.join(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you check how the metric names look in Flink? I think we changed this some time ago because the metric names contained duplicate strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not. I'll do that and report back.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you had a chance to check?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So just to confirm, including the step name here will duplicate it because it is already used as the metric group name for the Flink operator which calls out to the Metrics code here.

Copy link
Contributor Author

@ryan-williams ryan-williams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed comments, I'll follow up on one of them.

I also simplified some of the FlinkMetricContainer.updateMetric helpers.

@@ -133,6 +133,8 @@ public void testSavepointRestoreLegacy() throws Exception {
runSavepointAndRestore(false);
}

// TODO(ryan): make these fail when an exception is thrown (in the runner, I
// think?), instead of just timing out
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not directly related, but when working on metrics PVR tests, if something throws in the runner harness (I think), that manifests as a hang and timeout of this test case, without a good stack trace.

Not essential to add this, ofc, but it's not the first time I've experienced it while doing unrelated Flink work, so thought I'd try to raise awareness.

// in the operator name which is passed to Flink's MetricGroup to which
// the metric with the following name will be added.
return metricName.getNamespace() + METRIC_KEY_SEPARATOR + metricName.getName();
return String.join(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not. I'll do that and report back.

@ryan-williams
Copy link
Contributor Author

Run Java PreCommit

@ryan-williams
Copy link
Contributor Author

Run Python PreCommit

@ryan-williams
Copy link
Contributor Author

(Python PreCommit was flaky; :beam-sdks-python:testPy3Gcp failed (scan))

@ryan-williams
Copy link
Contributor Author

Run Java PreCommit

@ryan-williams
Copy link
Contributor Author

:beam-runners-flink-1.6:test failed previously (scan) but I think it was a flake unrelated to this PR.

@ajamato
Copy link

ajamato commented Mar 13, 2019

@pabloem

@pabloem
Copy link
Member

pabloem commented Mar 18, 2019

This looks fine to me. It's quite cool TBH : D
Max seems to be asking the right questions. I leave the rest of the review to him. I agree thjat it's important to have consistent metric naming (perhaps add a test?) I leave that to you.
Thanks!

@stale
Copy link

stale bot commented Jun 4, 2019

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions.

@stale stale bot added the stale label Jun 4, 2019
@stale
Copy link

stale bot commented Jun 11, 2019

This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@stale stale bot closed this Jun 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants