[BEAM-4775] Clean up metric protos; support integer distributions, gauges #7876

ryan-williams · 2019-02-18T23:26:55Z

(factored out of #7823)

Drops extrema protos, adds gauges.
Drops non-integer metric proto types

Also starts tying together MetricName and MonitoringInfoMetricName:

the former is used for "user" metrics (defined by {ptransform,namespace,name}), the latter for "system" metrics ({urn,labels})
user-metrics are also representable in {urn,labels} form, but as of this PR the Java SDK intercepts attempts to create user-metrics that way (in MonitoringInfoMetricName.create) and creates a MetricName AutoValue instance instead, which preserves existing behavior most straightforwardly.

Post-Commit Tests Status (on master branch)

Lang	Apex	Dataflow	Flink	Gearpump	Samza	Spark
Go	---	---	---	---	---	---
Java
Python	---			---	---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

ryan-williams · 2019-02-19T15:04:21Z

Run Java PreCommit

ajamato · 2019-02-19T22:15:40Z

model/fn-execution/src/main/proto/beam_fn_api.proto

-  double sum = 2;
-  double min = 3;
-  double max = 4;
+message IntGaugeData {


Is there a use case for this at the moment? Are you implementing these right now?

Yea, in #7823 I support all the types here (counters, distributions, gauges) everywhere in Java; I think Python already supports them as well, but need to dig in to that further.

ajamato · 2019-02-19T22:18:16Z

runners/core-java/src/main/java/org/apache/beam/runners/core/metrics/LabeledMetrics.java

@@ -29,7 +30,7 @@
  /**
   * Create a metric that can be incremented and decremented, and is aggregated by taking the sum.
   */
-  public static Counter counter(MonitoringInfoMetricName metricName) {
+  public static Counter counter(MetricName metricName) {


Let's keep it as MonitoringInfoMetricName, since this is meant to be an implementation only used by SDK and RunnerHarness authors. For creating system style metrics with URN+labels,

Quick explanation for why this changed:

Before this PR, MetricName.named allowed creating (user-)metrics from {namespace,name}, while MonitoringInfoMetricName.named created metrics from {URN, labels}.

The latter can create user- or system-metrics, but naively creating MonitoringInfoMetricNames for user-metrics is dangerous: they don't hash/equal the "same" metric created via MetricName.named, which I ran into in tests.

Meanwhile, user code wants to make metrics from arbitrary MonitoringInfos, so a single API for that makes sense, but it should return a MetricName (AutoValue) for user-URNs, and a MonitoringInfoMetricName otherwise; I've changed the MonitoringInfoMetricName constructor to do that in this PR. That means that it returns a MetricName, not a MonitoringInfoMetricName.

Ultimately (#7823), I merge these two into something that looks like MonitoringInfoMetricName but is called MetricName. In the meantime, there's little cost to using MetricName as the super-type of both.

So, re: your comment here, I could create an additional MonitoringInfoMetricName constructor that's only to be used for system metric URNs, and have ElementCountFnDataReceiver call that here instead, and then pass the MonitoringInfoMetricName it got to LabeledMetrics.Counter here, but I don't think that is worth it.

That new constructor would feel unsafe if it didn't validate that the passed URN was in fact a system-metric, and that would start to be a lot of overhead, when LabeledMetrics.Counter (and other callers) don't actually care whether they are dealing with a system- or user-metric.

So that's the whole story 😄 ! lmk what you think.

ajamato · 2019-02-19T22:39:35Z

model/fn-execution/src/main/proto/beam_fn_api.proto

-  oneof distribution {
-    IntDistributionData int_distribution_data = 1;
-    DoubleDistributionData double_distribution_data = 2;
+    int64 counter = 1;


There was long debate when, and the existing protos were what we could agree on when we introduced this design. While I am fine with this change you are proposing personally. I'd like you to get buy in on the dev list.

https://s.apache.org/beam-fn-api-metrics

https://lists.apache.org/list.html?dev@beam.apache.org:2018-2

I think most of it was in the comments of the doc. So unfortunately, I cannot point you toward too much.

I don't have strong opinion about removing CounterData, but I'm concerned that we make counter int64. In my opinion, double will be much more flexible.

Thanks for the info and pointers!

I was just going off of what seemed to be already supported in the Java and Python SDKs.

In particular, existing Java metrics are all in terms of long, and I thought I discussed keeping that and dropping these other types with someone (@robertwb?) in the course of planning this work.

I'll read those links and we can discuss further.

I don't remember discussing this, it must have been with someone else.

I agree that supporting double counters (and especially distributions) would be useful. Also, we haven't really nailed down what the global semantics are for gauge metrics.

My position was, and remains, that reifying the exact set of supported metric types via a set of nested set of protos is not a stable, long-term solution. It took a long time to come up with this list, and here we are 6 months later changing it. (And change will become harder the more we actually use these protos everywhere.) Instead, data should be either (1) a bytes field or (2) something like

message data {
repeated int64 int_data = 1;
repeated double double_data = 2;
repeated string string_data = 3;
...?
}

with the type specifying its encoding into this data attribute. We can then support everything from simple counters to complex distributions without changing the protos, only introducing new types, where the set of types are defined by URNs, and runners simply pass through types they do not understand. This is consistent with the rest of the way the FnAPI works as well (e.g. coders, transforms, ...)

Thanks Robert. It seems we're stuck between 3 data models:

existing protos

{int,double,string} counters

{int,double} x {distribution,extrema}

unused in SDKs

deprecated protos

int64-only {counter, distribution, gauge}

supported in Java, Python SDKs

opaque bytes-payloads

Robert's suggestion above

not implemented anywhere afaik

In this PR, I changed (1)'s protos to match the SDK support that is modeled on (2), which is probably wrong!

(3) sgtm but seems like it is not the existing consensus / spec.

I think the larger goal (sending metrics over the job API) requires converting between SDK and proto metrics structures (in both directions), so they need to match.

Maybe there is a way I can side-step that, or maybe we can just get to the real long-term solution here.

@robertwb I'd like to push back a bit about a significant change, where we use opaque bytes payloads.

I am okay with removing the layers in this proto and using the design Ryan has proposed here. But not an extensive rewrite at this stage of the game, which will slow down progress significantly.

We have protos for the MonitoringInfoTable. This solution is the extensible format, where you can store basically anything and the producer and consumer of this proto can be the only ones who need to understand it (and it could be passed through the RunnerHarness)

beam/model/fn-execution/src/main/proto/beam_fn_api.proto

Line 583 in 197a068

message MonitoringTableData {

But we chose to have separate protos for the common well known metric formats (Metrics being a well defined concept shared by many systems, i.e. a timeseries of data): counter, gauge, distribution, etc.

Additionally, if you wish to modify these formats significantly at a future stage, you can always

add to these protos without deleting

Change the URN version number of a metric

Upgrade all SDKs to package the metric with a format for the previous and new version number of the URN

Upgrade all Runners that wish to support it.

Upgrading the URN version is the safe way to do this without introducing a breaking change, as consumers of MonitoringInfos can freely chose which URNs+versions they support.

ajamato · 2019-02-19T22:42:24Z

...s/core-java/src/main/java/org/apache/beam/runners/core/metrics/MonitoringInfoMetricName.java

-    builder.append(this.urn.toString());
-    builder.append(" ");
-    builder.append(this.labels.toString());
+    if (labels.containsKey(PCOLLECTION_LABEL)) {


just iterate over the keyas and values and add them to the return string, instead of inspecting the labels for specific keys.

Then there is no need to update this code for new labels.

I wrote things that way in several places initially, but ran into interesting problems that we'll need to discuss eventually, so I'll describe them below.

Here's what happens in this PR if I naively append the label kv-pairs: ryan-williams@5576286.

Note that Flink receives a metric with the unseemly name step.PTRANSFORM.step.PCOLLECTION.pcoll.beam.metric.element_count.v1 (which goes e.g. to the Flink web UI, which we've taken pains to make more human-readable).

The first step there is added by MetricKey.toString, which calls this MetricName.toString routine, expecting info about the metric's "name" (URN) only.

MetricKey vs MetricName

The issue, In my view, is that MonitoringInfoMetricName is actually more of a MetricKey implementation:

MetricKey ≈ MonitoringInfoMetricName ({URN, labels})

stepName ≈ [labels map with just a PTRANSFORM entry]

MetricName ≈ URN

MetricResult ≈ MonitoringInfo ({key, metric value})

I ultimately lay things out in this manner in #7823, folding MIMN into MetricKey.

Special-casing single-label ptransform/pcollection-scoped metrics

A bigger issue is that I think runners and endpoints are going to want to hard-code [the only 2 cases that we will have for the foreseeable future]: <ptransform>.<namespace>.<name> and <pcollection>.<urn> (<ptransform>.<urn> folds into this nicely as well).

They'll do this for backwards-compat reasons, and/or to make things look nice in e.g. web UIs. I've touched code like this in at least the Samza, Flink, and Spark runners (and that MetricsHttpSink we were looking at).

I think it's great that the wire format is extensible to many+new labels, but ended up feeling that the SDK should provide the APIs that deal specifically with the shapes all calling code will care about atm.

Sorry that's a lot! I don't care much what we do in this case, but I think we'll have some harder similar choices shortly 😲 lmk what you think.

ajamato · 2019-02-19T22:50:46Z

runners/flink/src/main/java/org/apache/beam/runners/flink/metrics/FlinkMetricContainer.java

  /**
   * Update this container with metrics from the passed {@link MonitoringInfo}s, and send updates
   * along to Flink's internal metrics framework.
   */
-  public void updateMetrics(String stepName, List<MonitoringInfo> monitoringInfos) {
-    MetricsContainer metricsContainer = getMetricsContainer(stepName);
+  public void updateMetrics(List<MonitoringInfo> monitoringInfos) {


Question about the overall design here. Much of this class doesn't seem flink specific. If its just aggregating the values in memory, perhaps it should just aggregate the MonitoringInfos. Then have some method to extract them in a flink specific format (possibly in another class)

Interesting point. To paraphrase: ingesting MonitoringInfos into a MetricContainerStepMap (which other runners also use) will be a common path, and code for that should get factored out.

I experimented with keeping fn-api metrics natively in MonitoringInfos in MCSM, alongside traditional Java-SDK-formatted ones, but ended up needing to do a lot of other cleanup in there (#7890) before I could see straight 😄.

ajamato · 2019-02-19T22:50:47Z

runners/flink/src/main/java/org/apache/beam/runners/flink/metrics/FlinkMetricContainer.java

-              ExtremaData extremaData = metric.getExtremaData();
-              LOG.warn("Extrema metric unsupported: {}", extremaData);
-            }
+          if (!monitoringInfo.hasMetric()) {


Whitelisting the URNs you support would be the preferred method of doing this.

What you've done here will allow various SDK system metrics which get added later on, without the devs of this runner getting any sort of heads up. The intention of the design is to allow the runner to whitelist the specific URNs it wishes to support. You may just wish to support the user metric prefixes. Doing that would drop the element count, execution time, etc. metrics.

Interesting. It seems in this case like it's easier and better to just handle all the metrics that come through conforming to the standard format? Or maybe I'm not understanding what you mean.

ajamato · 2019-02-19T22:54:38Z

@Ardagan You might want to take a look at this as well. To see the proto changes.

ryan-williams

phew, there's a lot here, thanks for the comments

ryan-williams · 2019-02-20T05:36:45Z

model/fn-execution/src/main/proto/beam_fn_api.proto

-  double sum = 2;
-  double min = 3;
-  double max = 4;
+message IntGaugeData {


Yea, in #7823 I support all the types here (counters, distributions, gauges) everywhere in Java; I think Python already supports them as well, but need to dig in to that further.

ryan-williams · 2019-02-20T06:07:21Z

runners/core-java/src/main/java/org/apache/beam/runners/core/metrics/LabeledMetrics.java

@@ -29,7 +30,7 @@
  /**
   * Create a metric that can be incremented and decremented, and is aggregated by taking the sum.
   */
-  public static Counter counter(MonitoringInfoMetricName metricName) {
+  public static Counter counter(MetricName metricName) {


Quick explanation for why this changed:

Before this PR, MetricName.named allowed creating (user-)metrics from {namespace,name}, while MonitoringInfoMetricName.named created metrics from {URN, labels}.

The latter can create user- or system-metrics, but naively creating MonitoringInfoMetricNames for user-metrics is dangerous: they don't hash/equal the "same" metric created via MetricName.named, which I ran into in tests.

Meanwhile, user code wants to make metrics from arbitrary MonitoringInfos, so a single API for that makes sense, but it should return a MetricName (AutoValue) for user-URNs, and a MonitoringInfoMetricName otherwise; I've changed the MonitoringInfoMetricName constructor to do that in this PR. That means that it returns a MetricName, not a MonitoringInfoMetricName.

Ultimately (#7823), I merge these two into something that looks like MonitoringInfoMetricName but is called MetricName. In the meantime, there's little cost to using MetricName as the super-type of both.

So, re: your comment here, I could create an additional MonitoringInfoMetricName constructor that's only to be used for system metric URNs, and have ElementCountFnDataReceiver call that here instead, and then pass the MonitoringInfoMetricName it got to LabeledMetrics.Counter here, but I don't think that is worth it.

That new constructor would feel unsafe if it didn't validate that the passed URN was in fact a system-metric, and that would start to be a lot of overhead, when LabeledMetrics.Counter (and other callers) don't actually care whether they are dealing with a system- or user-metric.

So that's the whole story 😄 ! lmk what you think.

ryan-williams · 2019-02-20T06:09:26Z

model/fn-execution/src/main/proto/beam_fn_api.proto

-  oneof distribution {
-    IntDistributionData int_distribution_data = 1;
-    DoubleDistributionData double_distribution_data = 2;
+    int64 counter = 1;


Thanks for the info and pointers!

I was just going off of what seemed to be already supported in the Java and Python SDKs.

In particular, existing Java metrics are all in terms of long, and I thought I discussed keeping that and dropping these other types with someone (@robertwb?) in the course of planning this work.

I'll read those links and we can discuss further.

ryan-williams · 2019-02-20T07:18:26Z

...s/core-java/src/main/java/org/apache/beam/runners/core/metrics/MonitoringInfoMetricName.java

-    builder.append(this.urn.toString());
-    builder.append(" ");
-    builder.append(this.labels.toString());
+    if (labels.containsKey(PCOLLECTION_LABEL)) {


I wrote things that way in several places initially, but ran into interesting problems that we'll need to discuss eventually, so I'll describe them below.

Here's what happens in this PR if I naively append the label kv-pairs: ryan-williams@5576286.

Note that Flink receives a metric with the unseemly name step.PTRANSFORM.step.PCOLLECTION.pcoll.beam.metric.element_count.v1 (which goes e.g. to the Flink web UI, which we've taken pains to make more human-readable).

The first step there is added by MetricKey.toString, which calls this MetricName.toString routine, expecting info about the metric's "name" (URN) only.

MetricKey vs MetricName

The issue, In my view, is that MonitoringInfoMetricName is actually more of a MetricKey implementation:

MetricKey ≈ MonitoringInfoMetricName ({URN, labels})

stepName ≈ [labels map with just a PTRANSFORM entry]

MetricName ≈ URN

MetricResult ≈ MonitoringInfo ({key, metric value})

I ultimately lay things out in this manner in #7823, folding MIMN into MetricKey.

Special-casing single-label ptransform/pcollection-scoped metrics

A bigger issue is that I think runners and endpoints are going to want to hard-code [the only 2 cases that we will have for the foreseeable future]: <ptransform>.<namespace>.<name> and <pcollection>.<urn> (<ptransform>.<urn> folds into this nicely as well).

They'll do this for backwards-compat reasons, and/or to make things look nice in e.g. web UIs. I've touched code like this in at least the Samza, Flink, and Spark runners (and that MetricsHttpSink we were looking at).

I think it's great that the wire format is extensible to many+new labels, but ended up feeling that the SDK should provide the APIs that deal specifically with the shapes all calling code will care about atm.

Sorry that's a lot! I don't care much what we do in this case, but I think we'll have some harder similar choices shortly 😲 lmk what you think.

ryan-williams · 2019-02-20T07:38:49Z

runners/flink/src/main/java/org/apache/beam/runners/flink/metrics/FlinkMetricContainer.java

  /**
   * Update this container with metrics from the passed {@link MonitoringInfo}s, and send updates
   * along to Flink's internal metrics framework.
   */
-  public void updateMetrics(String stepName, List<MonitoringInfo> monitoringInfos) {
-    MetricsContainer metricsContainer = getMetricsContainer(stepName);
+  public void updateMetrics(List<MonitoringInfo> monitoringInfos) {


Interesting point. To paraphrase: ingesting MonitoringInfos into a MetricContainerStepMap (which other runners also use) will be a common path, and code for that should get factored out.

I experimented with keeping fn-api metrics natively in MonitoringInfos in MCSM, alongside traditional Java-SDK-formatted ones, but ended up needing to do a lot of other cleanup in there (#7890) before I could see straight 😄.

ryan-williams · 2019-02-20T07:40:05Z

runners/flink/src/main/java/org/apache/beam/runners/flink/metrics/FlinkMetricContainer.java

-              ExtremaData extremaData = metric.getExtremaData();
-              LOG.warn("Extrema metric unsupported: {}", extremaData);
-            }
+          if (!monitoringInfo.hasMetric()) {


Interesting. It seems in this case like it's easier and better to just handle all the metrics that come through conforming to the standard format? Or maybe I'm not understanding what you mean.

ajamato · 2019-02-20T17:46:23Z

model/fn-execution/src/main/proto/beam_fn_api.proto

-  double max = 4;
+message IntGaugeData {
+  int64 value = 1;
+  int64 timestamp_ms = 2;


Can we keep this on the top level of the MonitoringInfo

ajamato · 2019-02-20T22:22:50Z

model/fn-execution/src/main/proto/beam_fn_api.proto

-    DoubleDistributionData double_distribution_data = 2;
+    int64 counter = 1;
+    IntDistributionData distribution = 2;
+    IntGaugeData gauge = 3;


The one thing worth pointing out here, is that we never introduced this before because this only holds an int64, which is the same as a counter, which holds an int64 (Note you don't need the timestamp inside IntGaugeData), as its already in the MonitoringInfo proto.

So we didn't include both as separate fields in the oneof. You would just specify which way it is aggregated using the MonitoringInfo type field.

ryan-williams · 2019-02-21T04:42:33Z

Thanks for the feedback, these discussions were very useful for me!

As I mentioned here, these proto changes are basically backwards from the direction they are supposed to be evolving in.

I'm going to close this out; I think I see how to move the other #7823-associated PRs forward orthogonally, and probably more simply for not needing to deal with new proto structures.

ryan-williams mentioned this pull request Feb 18, 2019

[DO NOT MERGE] [BEAM-4775] Second take on portable metrics over the job-server API #7823

Closed

ryan-williams added 3 commits February 19, 2019 00:32

Clean up metric protos; support integer distributions, gauges

86534b7

SimpleMonitoringInfoBuilder: support distributions, gauges

b2f6bc0

allow custom delimiter in Metric{Key,Name}.toString

55eadb2

ryan-williams force-pushed the mp branch 2 times, most recently from ccf32fc to 7366084 Compare February 19, 2019 09:33

ajamato reviewed Feb 19, 2019

View reviewed changes

ryan-williams force-pushed the mp branch from 7366084 to 6266c75 Compare February 20, 2019 04:41

ryan-williams added 2 commits February 20, 2019 05:01

unify MetricName creation for user- vs system-metrics

29d3f48

get flink metrics working with new protos

3477683

ryan-williams force-pushed the mp branch from 6266c75 to 3477683 Compare February 20, 2019 05:03

ryan-williams commented Feb 20, 2019

View reviewed changes

ajamato reviewed Feb 20, 2019

View reviewed changes

ryan-williams closed this Feb 21, 2019

ryan-williams mentioned this pull request Feb 23, 2019

[BEAM-4775] add GetJobMetrics RPC #7934

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-4775] Clean up metric protos; support integer distributions, gauges #7876

[BEAM-4775] Clean up metric protos; support integer distributions, gauges #7876

ryan-williams commented Feb 18, 2019 •

edited

Loading

ryan-williams commented Feb 19, 2019

ajamato Feb 19, 2019

ryan-williams Feb 20, 2019

ajamato Feb 19, 2019

ryan-williams Feb 20, 2019

ajamato Feb 19, 2019

Ardagan Feb 19, 2019

ryan-williams Feb 20, 2019

robertwb Feb 20, 2019

ryan-williams Feb 20, 2019

ajamato Feb 20, 2019

ajamato Feb 19, 2019

ryan-williams Feb 20, 2019

ajamato Feb 19, 2019

ryan-williams Feb 20, 2019

ajamato Feb 19, 2019

ryan-williams Feb 20, 2019

ajamato commented Feb 19, 2019

ryan-williams left a comment

ryan-williams Feb 20, 2019

ryan-williams Feb 20, 2019

ryan-williams Feb 20, 2019

ryan-williams Feb 20, 2019

ryan-williams Feb 20, 2019

ryan-williams Feb 20, 2019

ajamato Feb 20, 2019

ajamato Feb 20, 2019

ryan-williams commented Feb 21, 2019

[BEAM-4775] Clean up metric protos; support integer distributions, gauges #7876

[BEAM-4775] Clean up metric protos; support integer distributions, gauges #7876

Conversation

ryan-williams commented Feb 18, 2019 • edited Loading

Post-Commit Tests Status (on master branch)

ryan-williams commented Feb 19, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MetricKey vs MetricName

Special-casing single-label ptransform/pcollection-scoped metrics

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajamato commented Feb 19, 2019

ryan-williams left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MetricKey vs MetricName

Special-casing single-label ptransform/pcollection-scoped metrics

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryan-williams commented Feb 21, 2019

ryan-williams commented Feb 18, 2019 •

edited

Loading