Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-4389] Expose metrics to WebFrontend #2363

Closed
wants to merge 4 commits into from

Conversation

zentol
Copy link
Contributor

@zentol zentol commented Aug 12, 2016

This PR exposes metrics to the Webfrontend, as proposed in FLIP-7.

This PR builds on-top of #2300, meaning that 2866f56 is not part of the PR.

I've split the implementation into 5 commits that implement

  • the generation of a separate scope string for the WebInterface
  • the MetricQueryService, a separate actor running on all Job-/TaskManagers whose main purpose is to create and return a dump of the metrics when queried to do so
  • the MetricStore, a nested data structure used in the WebInterface to store transmitted metrics
  • the MetricFetcher, which is used by the WebInterface to fetch metrics from Job-/TaskManagers
  • various MetricsHandler classes, which handle REST calls requesting specific metrics

MetricQueryService

The MetricQueryService is an actor running inside the MetricRegistry acting like an unscheduled reporter that is queried from the outside for a report. The MetricRegistry notifies it of added/removed metrics whereas the MetricFetcher sends report requests to the JM/TM which are then forwarded to the MetricQueryService, which answers directly to the MetricFetcher.

The report is one big Object[], which contains for each metric

  1. the type of the metric, encoded as a byte (so that we know how many values are transmitted)
  2. the fully qualified metric name (based on the separate format)
  3. the value(s) of the metric (turned into Strings for Gauges)

MetricStore

The MetricStore is a relatively simple nested data-structure that contains one HashMap<String, Object> for every JM/TM/job/task. Received metrics are added to these HashMaps based on the format string. There is only a single MetricStore instance in the WebInterface.

MetricFetcher

The MetricFetcher initiates the transfer and cleanup of metrics. It contains the MetricStore instance, which is accessed by MetricHandlers. The fetching is only done when a handler asks for it, with a minimum duration of 10 seconds between updates. As such no fetching will be done if the metrics are not accessed with REST calls.

The fetching procedure can be summed up in pseudo-code as following:

fetch():
    askJobManagerForJobDetails()
        => retain all metrics belonging to the given jobs
    askJobManagerForMetrics()
        => add received metrics to MetricStore
    askJobManagerForRegisteredTaskManagers()
        => retain all metrics belonging to registered task managers
        => for each TaskManager:
            askTaskManagerForMetrics()
                => add received metrics to MetricStore

MetricsHandler

The MetricsHandlers deal with two requests:

  • getAllAvailableMetrics - any REST request that does not have a get query parameter is treated as a request for all available metrics for a given JM/TM/job/task, denoted by the REST path. The reply will be a JSON array, for example: [{"id":"metric_1"},{"id":"metric_2"}]
  • getMetricValues - the Webfrontend can request the values for several metrics by passing a comma-separated list of metric id's as the get query parameter. The reply will be a JSON array of id:value pairs, for example: [{"id":"metric_1", "value":"4"}] or an empty string if an error occurred.

@tillrohrmann
Copy link
Contributor

The test case TaskManagerComponentsStartupShutdownTest.testComponentsStartupShutdown fails on Travis.

@zentol
Copy link
Contributor Author

zentol commented Aug 18, 2016

Fixed the failing test.

* Abstract request handler that returns a list of all available metrics or the values for a set of metrics.
*
* If the query parameters do not contain a "get" parameter the list of all metrics is returned.
* {@code {"available": [ { "name" : "X", "id" : "X" } ] } }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"name" : "X" won't be written, will it?

Copy link
Contributor Author

@zentol zentol Aug 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That javadoc is a bit outdated, it should be {@code [ { "id" : "X" } ] }

@tillrohrmann
Copy link
Contributor

Thanks for your contribution @zentol. I've gone over the code and made some inline comments. My main concern/question is actually the representation of metric's type and hierarchy information. I think that encoding it in a string and then re-parsing it on the receiver side to reconstruct the information is rather fragile and error-prone especially wrt maintainability. Maybe you can give me some background why you decided to do it so.

Apart from that, I think the code contains many tests, which I really like :-)

@zentol
Copy link
Contributor Author

zentol commented Aug 19, 2016

@tillrohrmann I've addressed most of your comments. Excluded are calling checkNotNull inside formatScope and the serializer/serialization format.

@tillrohrmann
Copy link
Contributor

But we still send metric data as strings encoded over the wire and have no checks that the histogram field order is actually correct, right?

@zentol
Copy link
Contributor Author

zentol commented Aug 22, 2016

only Gauge values are sent as strings.

@tillrohrmann
Copy link
Contributor

Sorry, I meant that the hierarchy information is still encoded in a string and then re-parsed. Furthermore, the histogram data is sent as an object array without any information about the field orderings.

@zentol
Copy link
Contributor Author

zentol commented Aug 22, 2016

well...currently that is still done. Whether it will be done once this is merged is up in the air.

@tillrohrmann
Copy link
Contributor

I think this should be addressed (either way) before merging this PR.

@zentol
Copy link
Contributor Author

zentol commented Aug 22, 2016

Regarding hierarchy: I'm close to being done with a container for the scope information.

@tillrohrmann
Copy link
Contributor

Great to hear @zentol 👍

@zentol zentol force-pushed the 4389_metrics_exposed branch 2 times, most recently from 986127f to 0122061 Compare August 25, 2016 13:16
@zentol
Copy link
Contributor Author

zentol commented Aug 25, 2016

I've updated and rebased the PR.

The scope information is now stored in a QueryScopeInfo inside the MetricGroups, for which sub-classes like TaskQueryScopeInfo exist. They contain fields for specific values, like the job ID, and the remaining scope not covered by these fields.

Metrics, or rather their scope, name and value(s), are now serialized with a new MetricDumpSerializer to a byte[] using a Data-/ByteArrayOutputStream.

On the other end we have the MetricDumpDeserializer which deserializes the metrics to a List<MetricDump>. A MetricDumpis a container for the metric value, it's name the QueryScopeInfo. There are sub-classes for each metric type, like CounterDump.

Neither the MetricQueryService nor MetricFetcher know anything about the serialized format, just that it's a byte array.

There is no encoding for field orderings but tests that verify that the fields are assigned correctly. If a developer were to change the order of fields a test would fail, and the only way for this to make it into master would be if a) the test is simply changed to give a green light and b) it isn't noticed in the review, at which point all bets are of anyway. So i decided to keep it a bit simpler.

The MetricStore#addMetric() method has now become a bit smarter in regards to handling Histograms. With all values being contained in a HistogramDump we now only have to analyze the scope once.

}

switch (info.getCategory()) {
case INFO_CATEGORY_JM:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit of having an explicit type field over using instanceof? I think encoding the type via the actual type has the advantage that you don't mix up classes with wrong category types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eh, seemed like the proper way of handling it. Also, (up to) 4 comparisons vs a jump.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true. Performance-wise it is the more efficient way to execute it, no doubt. I was just wondering whether this is not a case of premature optimization with the price of harder maintainability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, it does not seem too overly complicated to be not maintainable. With that in mind, my other comments are mainly obsolete.

@tillrohrmann
Copy link
Contributor

I think the changes look good. Thanks for your work @zentol :-) I only had a minor question whether we can substitute the explicit category information by the type information of the metric dumps and the QueryScopeInfo instances (not for serialization but in the MetricStore).

@zentol
Copy link
Contributor Author

zentol commented Sep 8, 2016

I'll address the checkNotNull/comment formatting while merging, which I'm doing now. Thank you for looking over it again @tillrohrmann .

zentol added a commit to zentol/flink that referenced this pull request Sep 8, 2016
@asfgit asfgit closed this in 70704de Sep 15, 2016
@zentol zentol deleted the 4389_metrics_exposed branch September 15, 2016 17:45
liuyuzhong pushed a commit to liuyuzhong/flink that referenced this pull request Dec 5, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants