[BEAM-4776] Add metrics support to Java PortableRunner #10105

mwalenia · 2019-11-14T11:29:57Z

This PR adds conversion of portable MonitoringInfos to MetricResults in Java's PortableRunner.

R: @lgajowy @mxm @angoenka @iemejia
Can you take a look, guys? Thanks!

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang	SDK	Apex	Dataflow	Gearpump	Samza	Spark
Go		---	---	---	---
Java
Python		---		---	---
XLang	---	---	---	---	---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website
Non-portable
Portable	---		---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

mwalenia · 2019-11-15T09:20:35Z

Run Java PreCommit

mxm

Thanks @mwalenia. Looks good. Do we have any integration tests that we can enable for end-to-end testing this?

mwalenia · 2019-11-15T13:20:37Z

@mxm I don't think so, we don't run portable e2e tests in Java yet

mxm · 2019-11-15T14:32:26Z

I think we do through the Portable ValidatesRunner tests, we might want to enable these:

beam/runners/flink/job-server/flink_job_server.gradle

Line 142 in 780ef7a

excludeCategories 'org.apache.beam.sdk.testing.UsesAttemptedMetrics'

mwalenia · 2019-11-15T14:54:11Z

You're right, thanks for pointing it out. I'll delete the exclusions and run the ValidatesRunner.

mwalenia · 2019-11-15T14:57:35Z

Run Java Flink PortableValidatesRunner Batch

mwalenia · 2019-11-18T19:33:04Z

@mxm I'm going to fix these failures and get back to you. Thanks again for pointing out those tests :)

mxm · 2019-11-18T19:34:21Z

Sounds good. Thanks!

mxm · 2019-11-18T19:34:28Z

Run Java Flink PortableValidatesRunner Streaming

mwalenia · 2019-11-20T19:07:04Z

Run Java Flink PortableValidatesRunner Streaming

mwalenia · 2019-11-20T19:07:08Z

Run Java Flink PortableValidatesRunner Batch

mwalenia · 2019-11-20T19:08:24Z

@mxm I excluded tests regarding committed metrics, as they are not supported.
I also excluded gauge metric tests, since it seems that they aren't supported on portable Flink either - I checked in the accumulators and there was no trace of gauges. Do you know anything about it?
Thanks!

mwalenia · 2019-11-20T21:12:13Z

Run Java PreCommit

mwalenia · 2019-11-20T22:35:08Z

Run Java PreCommit

mwalenia · 2019-11-20T22:35:25Z

Run Java Flink PortableValidatesRunner Batch

mwalenia · 2019-11-20T22:35:29Z

Run Java Flink PortableValidatesRunner Streaming

mxm · 2019-11-21T10:11:50Z

Gauges should be supported. I'm using them on a production system. Beam by default doesn't expose any gauges though, so you might have to add some manually.

mwalenia · 2019-11-21T17:11:03Z

How would I go about that? I'm not sure how exposing the metrics is done. Can you point me in a right direction?

mwalenia · 2019-11-21T18:10:56Z

@echauchot Hi, I've stumbled upon a MetricsPusherTest failure in this PR.
I know why it happens:

the runner reports more than just the user metric defined in the test. TestMetricSink returns just the first metric from the list to the test. Since there's no guarantee that it will be the user metric, the assert is likely to catch a wrong value and fail.

Do you think this is a good reason to make the test account for such a situation?

I hope you're the person to reach out to in this case - MetricsPusherTest seems to be your creation :)

iemejia · 2019-11-22T08:43:26Z

runners/flink/job-server/flink_job_server.gradle

@@ -139,12 +139,9 @@ def portableValidatesRunnerTask(String name, Boolean streaming) {
      includeCategories 'org.apache.beam.sdk.testing.ValidatesRunner'
      excludeCategories 'org.apache.beam.sdk.testing.FlattenWithHeterogeneousCoders'
      excludeCategories 'org.apache.beam.sdk.testing.LargeKeys$Above100MB'
-      excludeCategories 'org.apache.beam.sdk.testing.UsesAttemptedMetrics'


Not a blocker for this PR but out of curiosity, do enabling these in the Portable Spark Runner pass? It would be a good idea to enable it to if so, or report the errors so they can be fixed if not.

I'm not sure. I can create a PR to check this, that's a topic worth investigating.

#10198 it's here

Hmm, now that I think of it, you probably wanted to check the impact of my changes on the Spark runner, right?

@iemejia Enabling the tests on Portable Spark runner fails. I'd have to investigate further in order to pinpoint the areas that fail

mwalenia · 2019-11-22T17:30:04Z

@mxm How can I go about manually adding gauges? Does that mean changing the FlinkRunner to publish gauge metrics?

mwalenia · 2019-11-22T21:42:13Z

Run Java Spark PortableValidatesRunner Batch

mxm · 2019-11-25T15:55:49Z

Gauges are reported here:

beam/runners/flink/src/main/java/org/apache/beam/runners/flink/metrics/FlinkMetricContainer.java

Line 149 in 885ecbf

private void updateGauge(Iterable<MetricResult<GaugeResult>> gauges) {

Also they are added to the accumulator here:

beam/runners/flink/src/main/java/org/apache/beam/runners/flink/metrics/FlinkMetricContainer.java

Line 106 in 885ecbf

    
           MetricResults metricResults = asAttemptedOnlyMetricResults(metricsAccumulator.getLocalValue());

I don't know why the tests are not passing, but we can also fix gauges in a follow-up.

echauchot

Comment about enhancement of MetricsPusher

echauchot · 2019-11-26T14:07:27Z

runners/portability/java/src/main/java/org/apache/beam/runners/portability/PortableMetrics.java

+  private Iterable<MetricResult<DistributionResult>> distributions;
+  private Iterable<MetricResult<GaugeResult>> gauges;
+
+  private PortableMetrics(


@mwalenia to answer your question, I'm indeed the correct person for MetricsPusher related questions.
Regarding MetricsPusher, the problem goes beyond the test itself. The whole MetricsPusher feature reports for now only user metrics (that is why the test sink that is tailored for it only reads user metrics). But the aim since the beginning of the architectural design (pull vs push essentially) was to allow in the future to support system metrics. Here is the design I did at the time: https://s.apache.org/runner_independent_metrics_extraction.
Long story short, the good thing to do IMHO is to enhance MetricsPusher to support system metrics as well and, of course, update the test/sink.

mwalenia · 2019-11-27T11:18:10Z

@mxm you're right, it will be simpler to figure out gauges in another PR.

mwalenia · 2019-11-27T14:49:54Z

runners/core-java/src/test/java/org/apache/beam/runners/core/metrics/MetricsPusherTest.java

+
+  @Category({ValidatesRunner.class, UsesAttemptedMetrics.class, UsesCounterMetrics.class})
+  @Test
+  public void pushesSystemMetrics() throws InterruptedException {


@echauchot I added a test that checks if system metrics are supported by MetricPusher. It seems that they work :)

I also fixed the TestMetricsSink to account for this fact.

mwalenia · 2019-11-27T14:56:57Z

Run Java PreCommit

mwalenia · 2019-11-28T11:44:40Z

@mxm It seems that there is no support for gauges in portability - I didn't find a proper MonitoringInfo type in metrics.proto.

mwalenia · 2019-11-28T12:33:53Z

Run Java PreCommit

mwalenia · 2019-11-28T13:38:44Z

@mxm the tests are green :) I think we need to take the gauge issue elsewhere, as it seems the gauges aren't portable at all.
If everything looks good to you, let me know, I'll clean up the commits and get them ready for merging.

mxm · 2019-11-28T14:08:14Z

Gauges are portable. The type is beam:metrics:latest_int_64. We can take care of the gauge tests separately of this PR.

mxm · 2019-11-28T14:08:34Z

Could you squash the commits?

iemejia · 2019-11-28T15:58:08Z

For Portable Spark runner the issue tracking passing metrics from SDK harness to Spark is this https://issues.apache.org/jira/browse/BEAM-7219
Great this is done now at the Portable Runner side, this will allow Nexmark to be run too, great work @mwalenia !

mwalenia · 2019-11-29T09:06:35Z

@mxm the commits are squashed.
As for gauges, my bad - I meant user gauge metrics.
Thanks!

kennknowles · 2019-12-02T21:22:19Z

This has broken the Flink runner, it seems: https://issues.apache.org/jira/browse/BEAM-8869

It is also failing in some of Google's internal testing. I am still investigating that but will try to summarize and repro externally.

mxm · 2019-12-03T11:44:12Z

runners/core-java/src/test/java/org/apache/beam/runners/core/metrics/MetricsPusherTest.java

+    pipeline.run();
+    // give metrics pusher time to push
+    Thread.sleep(
+        (pipeline.getOptions().as(MetricsOptions.class).getMetricsPushPeriod() + 1L) * 1000);


We should probably lower this interval and build in a retry logic. Otherwise this is prone to breaking.

mwalenia force-pushed the BEAM-4776-metrics-portableRunner-java branch from 9f683e7 to 5d0dd59 Compare November 14, 2019 14:23

mxm reviewed Nov 15, 2019

View reviewed changes

iemejia reviewed Nov 22, 2019

View reviewed changes

echauchot reviewed Nov 26, 2019

View reviewed changes

mwalenia commented Nov 27, 2019

View reviewed changes

mwalenia force-pushed the BEAM-4776-metrics-portableRunner-java branch from e5c819f to 7517573 Compare November 28, 2019 11:43

[BEAM-4776] Add metrics support to Java PortableRunner

4575e1c

mwalenia force-pushed the BEAM-4776-metrics-portableRunner-java branch from 7517573 to 4575e1c Compare November 29, 2019 09:05

mxm merged commit 1f64ba3 into apache:master Nov 29, 2019

mxm reviewed Dec 3, 2019

View reviewed changes

mwalenia deleted the BEAM-4776-metrics-portableRunner-java branch January 24, 2020 10:41

[BEAM-4776] Add metrics support to Java PortableRunner #10105

[BEAM-4776] Add metrics support to Java PortableRunner #10105

Conversation

mwalenia commented Nov 14, 2019

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

mwalenia commented Nov 15, 2019

mxm left a comment

Choose a reason for hiding this comment

mwalenia commented Nov 15, 2019

mxm commented Nov 15, 2019

mwalenia commented Nov 15, 2019

mwalenia commented Nov 15, 2019

mwalenia commented Nov 18, 2019

mxm commented Nov 18, 2019

mxm commented Nov 18, 2019

mwalenia commented Nov 20, 2019

mwalenia commented Nov 20, 2019

mwalenia commented Nov 20, 2019

mwalenia commented Nov 20, 2019

mwalenia commented Nov 20, 2019

mwalenia commented Nov 20, 2019

mwalenia commented Nov 20, 2019

mxm commented Nov 21, 2019

mwalenia commented Nov 21, 2019

mwalenia commented Nov 21, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mwalenia commented Nov 22, 2019

mwalenia commented Nov 22, 2019

mxm commented Nov 25, 2019

echauchot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mwalenia commented Nov 27, 2019

Choose a reason for hiding this comment

mwalenia commented Nov 27, 2019

mwalenia commented Nov 28, 2019

mwalenia commented Nov 28, 2019

mwalenia commented Nov 28, 2019

mxm commented Nov 28, 2019

mxm commented Nov 28, 2019

iemejia commented Nov 28, 2019

mwalenia commented Nov 29, 2019

kennknowles commented Dec 2, 2019

Choose a reason for hiding this comment