[FLINK-4389] Expose metrics to WebFrontend #2363

zentol · 2016-08-12T12:43:20Z

This PR exposes metrics to the Webfrontend, as proposed in FLIP-7.

This PR builds on-top of #2300, meaning that 2866f56 is not part of the PR.

I've split the implementation into 5 commits that implement

the generation of a separate scope string for the WebInterface
the MetricQueryService, a separate actor running on all Job-/TaskManagers whose main purpose is to create and return a dump of the metrics when queried to do so
the MetricStore, a nested data structure used in the WebInterface to store transmitted metrics
the MetricFetcher, which is used by the WebInterface to fetch metrics from Job-/TaskManagers
various MetricsHandler classes, which handle REST calls requesting specific metrics

MetricQueryService

The MetricQueryService is an actor running inside the MetricRegistry acting like an unscheduled reporter that is queried from the outside for a report. The MetricRegistry notifies it of added/removed metrics whereas the MetricFetcher sends report requests to the JM/TM which are then forwarded to the MetricQueryService, which answers directly to the MetricFetcher.

The report is one big Object[], which contains for each metric

the type of the metric, encoded as a byte (so that we know how many values are transmitted)
the fully qualified metric name (based on the separate format)
the value(s) of the metric (turned into Strings for Gauges)

MetricStore

The MetricStore is a relatively simple nested data-structure that contains one HashMap<String, Object> for every JM/TM/job/task. Received metrics are added to these HashMaps based on the format string. There is only a single MetricStore instance in the WebInterface.

MetricFetcher

The MetricFetcher initiates the transfer and cleanup of metrics. It contains the MetricStore instance, which is accessed by MetricHandlers. The fetching is only done when a handler asks for it, with a minimum duration of 10 seconds between updates. As such no fetching will be done if the metrics are not accessed with REST calls.

The fetching procedure can be summed up in pseudo-code as following:

fetch():
    askJobManagerForJobDetails()
        => retain all metrics belonging to the given jobs
    askJobManagerForMetrics()
        => add received metrics to MetricStore
    askJobManagerForRegisteredTaskManagers()
        => retain all metrics belonging to registered task managers
        => for each TaskManager:
            askTaskManagerForMetrics()
                => add received metrics to MetricStore

MetricsHandler

The MetricsHandlers deal with two requests:

getAllAvailableMetrics - any REST request that does not have a get query parameter is treated as a request for all available metrics for a given JM/TM/job/task, denoted by the REST path. The reply will be a JSON array, for example: [{"id":"metric_1"},{"id":"metric_2"}]
getMetricValues - the Webfrontend can request the values for several metrics by passing a comma-separated list of metric id's as the get query parameter. The reply will be a JSON array of id:value pairs, for example: [{"id":"metric_1", "value":"4"}] or an empty string if an error occurred.

tillrohrmann · 2016-08-18T14:03:36Z

The test case TaskManagerComponentsStartupShutdownTest.testComponentsStartupShutdown fails on Travis.

zentol · 2016-08-18T15:09:44Z

Fixed the failing test.

tillrohrmann · 2016-08-18T15:16:34Z

...me-web/src/main/java/org/apache/flink/runtime/webmonitor/metrics/AbstractMetricsHandler.java

+ * Abstract request handler that returns a list of all available metrics or the values for a set of metrics.
+ *
+ * If the query parameters do not contain a "get" parameter the list of all metrics is returned.
+ * {@code {"available": [ { "name" : "X", "id" : "X" } ] } }


"name" : "X" won't be written, will it?

That javadoc is a bit outdated, it should be {@code [ { "id" : "X" } ] }

tillrohrmann · 2016-08-18T17:11:47Z

Thanks for your contribution @zentol. I've gone over the code and made some inline comments. My main concern/question is actually the representation of metric's type and hierarchy information. I think that encoding it in a string and then re-parsing it on the receiver side to reconstruct the information is rather fragile and error-prone especially wrt maintainability. Maybe you can give me some background why you decided to do it so.

Apart from that, I think the code contains many tests, which I really like :-)

zentol · 2016-08-19T12:35:16Z

@tillrohrmann I've addressed most of your comments. Excluded are calling checkNotNull inside formatScope and the serializer/serialization format.

tillrohrmann · 2016-08-22T09:36:35Z

But we still send metric data as strings encoded over the wire and have no checks that the histogram field order is actually correct, right?

zentol · 2016-08-22T09:55:03Z

only Gauge values are sent as strings.

tillrohrmann · 2016-08-22T10:03:11Z

Sorry, I meant that the hierarchy information is still encoded in a string and then re-parsed. Furthermore, the histogram data is sent as an object array without any information about the field orderings.

zentol · 2016-08-22T10:12:00Z

well...currently that is still done. Whether it will be done once this is merged is up in the air.

tillrohrmann · 2016-08-22T11:26:07Z

I think this should be addressed (either way) before merging this PR.

zentol · 2016-08-22T13:27:29Z

Regarding hierarchy: I'm close to being done with a container for the scope information.

tillrohrmann · 2016-08-23T16:03:04Z

Great to hear @zentol 👍

zentol · 2016-08-25T13:16:47Z

I've updated and rebased the PR.

The scope information is now stored in a QueryScopeInfo inside the MetricGroups, for which sub-classes like TaskQueryScopeInfo exist. They contain fields for specific values, like the job ID, and the remaining scope not covered by these fields.

Metrics, or rather their scope, name and value(s), are now serialized with a new MetricDumpSerializer to a byte[] using a Data-/ByteArrayOutputStream.

On the other end we have the MetricDumpDeserializer which deserializes the metrics to a List<MetricDump>. A MetricDumpis a container for the metric value, it's name the QueryScopeInfo. There are sub-classes for each metric type, like CounterDump.

Neither the MetricQueryService nor MetricFetcher know anything about the serialized format, just that it's a byte array.

There is no encoding for field orderings but tests that verify that the fields are assigned correctly. If a developer were to change the order of fields a test would fail, and the only way for this to make it into master would be if a) the test is simply changed to give a green light and b) it isn't noticed in the review, at which point all bets are of anyway. So i decided to keep it a bit simpler.

The MetricStore#addMetric() method has now become a bit smarter in regards to handling Histograms. With all values being contained in a HistogramDump we now only have to analyze the scope once.

tillrohrmann · 2016-09-05T14:10:50Z

flink-runtime-web/src/main/java/org/apache/flink/runtime/webmonitor/metrics/MetricStore.java

+			}
+
+			switch (info.getCategory()) {
+				case INFO_CATEGORY_JM:


What's the benefit of having an explicit type field over using instanceof? I think encoding the type via the actual type has the advantage that you don't mix up classes with wrong category types.

eh, seemed like the proper way of handling it. Also, (up to) 4 comparisons vs a jump.

That is true. Performance-wise it is the more efficient way to execute it, no doubt. I was just wondering whether this is not a case of premature optimization with the price of harder maintainability.

On the other hand, it does not seem too overly complicated to be not maintainable. With that in mind, my other comments are mainly obsolete.

tillrohrmann · 2016-09-05T14:24:51Z

I think the changes look good. Thanks for your work @zentol :-) I only had a minor question whether we can substitute the explicit category information by the type information of the metric dumps and the QueryScopeInfo instances (not for serialization but in the MetricStore).

zentol · 2016-09-08T09:36:45Z

I'll address the checkNotNull/comment formatting while merging, which I'm doing now. Thank you for looking over it again @tillrohrmann .

This closes apache#2363

tillrohrmann reviewed Aug 18, 2016
View reviewed changes

zentol force-pushed the 4389_metrics_exposed branch 2 times, most recently from 986127f to 0122061 Compare August 25, 2016 13:16

zentol force-pushed the 4389_metrics_exposed branch from b4ffee2 to 202c823 Compare September 5, 2016 13:15

tillrohrmann reviewed Sep 5, 2016
View reviewed changes

zentol force-pushed the 4389_metrics_exposed branch from af21eb4 to e62a02b Compare September 8, 2016 10:01

zentol added a commit to zentol/flink that referenced this pull request Sep 8, 2016

[FLINK-4389] Expose metrics to WebFrontend

db1b3a3

This closes apache#2363

zentol force-pushed the 4389_metrics_exposed branch from 1f0c779 to 26f91a4 Compare September 8, 2016 13:46

zentol added 2 commits September 15, 2016 10:06

[FLINK-4389] Expose metrics to WebFrontend

c0c6e88

This closes apache#2363

[hotfix] [metrics] Add missing @internal annotations

2363530

zentol force-pushed the 4389_metrics_exposed branch from 282122d to 2363530 Compare September 15, 2016 08:07

zentol added 2 commits September 15, 2016 13:15

+

f83f2cc

++

a91831d

asfgit closed this in 70704de Sep 15, 2016

zentol deleted the 4389_metrics_exposed branch September 15, 2016 17:45

liuyuzhong pushed a commit to liuyuzhong/flink that referenced this pull request Dec 5, 2016

[FLINK-4389] Expose metrics to WebFrontend

3b1ccab

This closes apache#2363

rmetzger added component=Runtime/WebFrontend component=Runtime/Metrics labels Mar 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-4389] Expose metrics to WebFrontend #2363

[FLINK-4389] Expose metrics to WebFrontend #2363

zentol commented Aug 12, 2016

tillrohrmann commented Aug 18, 2016

zentol commented Aug 18, 2016

tillrohrmann Aug 18, 2016

zentol Aug 18, 2016 •

edited

tillrohrmann commented Aug 18, 2016

zentol commented Aug 19, 2016 •

edited

tillrohrmann commented Aug 22, 2016

zentol commented Aug 22, 2016

tillrohrmann commented Aug 22, 2016

zentol commented Aug 22, 2016

tillrohrmann commented Aug 22, 2016

zentol commented Aug 22, 2016

tillrohrmann commented Aug 23, 2016

zentol commented Aug 25, 2016 •

edited

tillrohrmann Sep 5, 2016

zentol Sep 5, 2016

tillrohrmann Sep 5, 2016

tillrohrmann Sep 5, 2016

tillrohrmann commented Sep 5, 2016

zentol commented Sep 8, 2016

[FLINK-4389] Expose metrics to WebFrontend #2363

[FLINK-4389] Expose metrics to WebFrontend #2363

Conversation

zentol commented Aug 12, 2016

MetricQueryService

MetricStore

MetricFetcher

MetricsHandler

tillrohrmann commented Aug 18, 2016

zentol commented Aug 18, 2016

tillrohrmann Aug 18, 2016

Choose a reason for hiding this comment

zentol Aug 18, 2016 • edited

Choose a reason for hiding this comment

tillrohrmann commented Aug 18, 2016

zentol commented Aug 19, 2016 • edited

tillrohrmann commented Aug 22, 2016

zentol commented Aug 22, 2016

tillrohrmann commented Aug 22, 2016

zentol commented Aug 22, 2016

tillrohrmann commented Aug 22, 2016

zentol commented Aug 22, 2016

tillrohrmann commented Aug 23, 2016

zentol commented Aug 25, 2016 • edited

tillrohrmann Sep 5, 2016

Choose a reason for hiding this comment

zentol Sep 5, 2016

Choose a reason for hiding this comment

tillrohrmann Sep 5, 2016

Choose a reason for hiding this comment

tillrohrmann Sep 5, 2016

Choose a reason for hiding this comment

tillrohrmann commented Sep 5, 2016

zentol commented Sep 8, 2016

zentol Aug 18, 2016 •

edited

zentol commented Aug 19, 2016 •

edited

zentol commented Aug 25, 2016 •

edited