Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-6221] Add PrometheusReporter #3833

Closed
wants to merge 8 commits into from

Conversation

mbode
Copy link
Contributor

@mbode mbode commented May 6, 2017

No description provided.

@zentol
Copy link
Contributor

zentol commented May 6, 2017

I've glanced over this for now but it looks pretty good.

If possible though i would like to not use the DropwizardExports class, for 2 reasons:

  1. We now introduce this odd pattern of extending ScheduledDropwizardReporter without actually being scheduled nor creating an actual Dropwizard reporter.

  2. I've looked into the source code and it has the typical Dropwizard reporter problem where absolutely nothing gets cached and effectively constant objects are re-created again and again. We can make this way more efficient.

@mbode
Copy link
Contributor Author

mbode commented May 6, 2017

Thanks for looking at it so quickly! I somewhat had the same instinct as far as your first point is concerned and thought about pulling out a DropwizardReporter without Scheduling but decided against it to not have to touch too many places. I like your suggestion of converting metrics directly without using Dropwizard as an intermediate step and am going to try that.

Edit: Now that I think about it, this might also be advantageous as one could use Prometheus' multi-dimensional data model more flexibly that way.

@hadronzoo
Copy link
Contributor

@mbode thanks for working on this!

One thing that I've found useful when exporting Flink's statsd metrics to Prometheus is to make several of the metric fields tags: like job_name, task_name, operator_name, etc. This statsd-exporter mapping has tags that have worked well for me. I'm not tagging host names or IP addresses because Prometheus's Kubernetes support does that already, but that could be useful for people running standalone clusters.

@mbode
Copy link
Contributor Author

mbode commented May 13, 2017

@hadronzoo I tried to keep it as general as possible by exporting all variables available to the metric group as labels. I am not sure if this might lead to label overuse at some point, did you ever run into difficulties in that context? Unfortunately I do not have a lot of experience running Prometheus in a production environment yet.

@zentol
Copy link
Contributor

zentol commented May 13, 2017

As it stands a metric can have at most 10 labels, and there aren't any current plans to extend this set; according to the docs that's still ok. If this does indeed become a problem we can add a switch to rely on the scope formats instead, giving the user control as to how many labels there are.

Copy link
Contributor

@zentol zentol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this a lot more than the previous version.

I haven't found anything major so far; but it would be good to shade the dependencies and get rid of guava as we don't actually use it much.


@PublicEvolving
public class PrometheusReporter implements MetricReporter {
private static final Logger log = LoggerFactory.getLogger(PrometheusReporter.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by convention this should be upper-case.

</dependency>

<dependency>
<groupId>com.google.guava</groupId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We try to avoid guava unless absolutely necessary, and virtually all modules that do depend on it shade it away.

public class PrometheusReporter implements MetricReporter {
private static final Logger log = LoggerFactory.getLogger(PrometheusReporter.class);

static final String ARG_PORT = "port";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indentation here in inconsistent with the rest of the code base.

private static final char SCOPE_SEPARATOR = '_';

private PrometheusEndpoint prometheusEndpoint;
private Map<String, Collector> collectorsByMetricName = new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be final.

try {
prometheusEndpoint.start(NanoHTTPD.SOCKET_READ_TIMEOUT, true);
} catch (IOException e) {
log.error("Could not start PrometheusEndpoint on port " + port, e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to throw an exception here despite the current interface. Otherwise the reporter will still be notified of added/removed metrics, even though there's no benefit to it.

} else if (metric instanceof Meter) {
collector = createCounter((Meter) metric, validMetricName, metricIdentifier, dimensionKeys, dimensionValues);
} else {
log.error("Cannot add unknown metric type: {}. This indicates that the metric type is not supported by this reporter.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By convention classes that are part of the metric system never log something as errors but only warnings, the reason being that errors are something that interfere with the job being executed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I changed it. The MetricRegistry has a few error logs in it as well, logging things that probably don't interfere with job execution either (for example in the constructor, where it tries to open reporters). Do you want me to change them to warn as well?

Copy link
Contributor

@zentol zentol May 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be done in a separate issue.

.setChild(new io.prometheus.client.Counter.Child() {
@Override
public double get() {
return (double) meter.getCount();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also export the rate of the meter as a prometheus gauge.

Personally i think we can drop this method completely; I'm not too big of a fan of exporting the count of a meter; if that would be of interest there should always be a separate counter for it. (and as far as Flink is concerned, there is)

return HOST_NAME + SCOPE_SEPARATOR + "taskmanager" + SCOPE_SEPARATOR + TASK_MANAGER + SCOPE_SEPARATOR + metricName;
}

private static MetricRegistry prepareMetricRegistry() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is a bit overkill for a one-liner that is called once.

@Rule
public ExpectedException thrown = ExpectedException.none();

private MetricRegistry registry = prepareMetricRegistry();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be final.

this.histogram = histogram;
this.metricName = metricName;
this.metricIdentifier = metricIdentifier;
labelNamesWithQuantile = ImmutableList.<String>builder().addAll(labelNames).add("quantile").build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be prefixed with this as well for consistency.

mbode added a commit to mbode/flink that referenced this pull request May 14, 2017
…ements.

- Fix code style
- Only log warn's in metric system, no errors
- Avoid using guava
- Throw exception on failed startup
- Export rate for meters instead of count
- Add application prefix to exported metrics
@mbode
Copy link
Contributor Author

mbode commented May 14, 2017

Okay, guava is not used anymore. I am not sure about the shading part. Would you expect either prometheus clients or nanohttpd to be used in Flink user code? Or are there other advantages to shading? If so, could you point me to a module I could copy the "Flink way of shading" from?

}

private Collector createGauge(final Meter meter, final String name, final String identifier, final List<String> labelNames, final List<String> labelValues) {
return newGauge(name + "_rate", identifier, labelNames, labelValues, new io.prometheus.client.Gauge.Child() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meter metrics should contain "rate" in their name already, so we don't have to append "_rate" here.

@zentol
Copy link
Contributor

zentol commented May 14, 2017

Yu can find an example on how to shade here: https://github.com/apache/flink/blob/master/flink-metrics/flink-metrics-datadog/pom.xml.

Shading dependencies in reporters/connectors has become a safety-precaution form our side.

It is not that unlikely that user-code contains the same dependencies. For one user-code also includes other reporters, so by the very existence of this reporter there is a precedent :) Besides that something that pops up from time to time on the mailing lists is users talking to REST endpoints in their functions/source/sinks, which may also rely on http-related dependencies.

mbode added a commit to mbode/flink that referenced this pull request May 14, 2017
- Shade dependencies
- Do not append "_rate" to meters
- Add information about exported labels to docs
mbode added a commit to mbode/flink that referenced this pull request May 14, 2017
- Shade dependencies
- Do not append "_rate" to meters
- Add information about exported labels to docs
@mbode
Copy link
Contributor Author

mbode commented May 18, 2017

@zentol Would you mind checking that I got the shading right?

@zentol
Copy link
Contributor

zentol commented May 19, 2017

I'll take a look later today.

@zentol
Copy link
Contributor

zentol commented May 19, 2017

Shading looks correct; dependencies are included and the correct jar is included in flink-dist.

<parent>
<groupId>org.apache.flink</groupId>
<artifactId>flink-metrics</artifactId>
<version>1.3-SNAPSHOT</version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that's outdated by now. Sorry :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 😃

mbode added 6 commits June 1, 2017 18:12
…ements.

- Fix code style
- Only log warn's in metric system, no errors
- Avoid using guava
- Throw exception on failed startup
- Export rate for meters instead of count
- Add application prefix to exported metrics
- Shade dependencies
- Do not append "_rate" to meters
- Add information about exported labels to docs
@mbode
Copy link
Contributor Author

mbode commented Jun 2, 2017

Oh, I broke the stricter checkstyle. To me, it feels a bit weird to have to put javadoc on tests, is that intended?

@zentol
Copy link
Contributor

zentol commented Jun 2, 2017

I think it is intended. The majority of tests can be (and are) commented with "Tests for the {@link }.", which is totally OK because it tells us that this is the general test battery for that class and not something else. These are also easy to add so there's not a real overhead in doing so.

But this doesn't apply to all of them; and you only really notice the missing javadoc when you stumble on a test and you spend 10 minutes trying to figure out what it is doing because the method names aren't helping at all.

The widely-accepted notion that the name of a class and test method are supposed to be enough documentation for test is quite questionable IMO; and by enforcing a javadoc on the class we give a little nudge to make the developer hopefully think about whether something needs to be documented or not.

@zentol
Copy link
Contributor

zentol commented Jun 2, 2017

Another thing to consider is that this rule also covers the myriad of testing utilities that so far did not have to be documented.

@mbode
Copy link
Contributor Author

mbode commented Jun 9, 2017

I see, makes sense. Finally got around to fixing the dependency and getting a Green Travis.

@zentol
Copy link
Contributor

zentol commented Jun 26, 2017

Finally found time to try this out, and it works great, merging.

zentol pushed a commit to zentol/flink that referenced this pull request Jun 26, 2017
zentol pushed a commit to zentol/flink that referenced this pull request Jun 27, 2017
zentol pushed a commit to zentol/flink that referenced this pull request Jun 27, 2017
zentol pushed a commit to zentol/flink that referenced this pull request Jun 28, 2017
zentol pushed a commit to zentol/flink that referenced this pull request Jun 28, 2017
zentol pushed a commit to zentol/flink that referenced this pull request Jun 28, 2017
zentol pushed a commit to zentol/flink that referenced this pull request Jun 29, 2017
zentol pushed a commit to zentol/flink that referenced this pull request Jun 29, 2017
zentol pushed a commit to zentol/flink that referenced this pull request Jun 30, 2017
@asfgit asfgit closed this in bb9505d Jul 1, 2017
@mbode mbode deleted the PrometheusReporter branch August 24, 2017 13:54
skidder pushed a commit to muxinc/flink that referenced this pull request Nov 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants