-
Notifications
You must be signed in to change notification settings - Fork 13.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-6221] Add PrometheusReporter #3833
Conversation
I've glanced over this for now but it looks pretty good. If possible though i would like to not use the DropwizardExports class, for 2 reasons:
|
Thanks for looking at it so quickly! I somewhat had the same instinct as far as your first point is concerned and thought about pulling out a Edit: Now that I think about it, this might also be advantageous as one could use Prometheus' multi-dimensional data model more flexibly that way. |
@mbode thanks for working on this! One thing that I've found useful when exporting Flink's statsd metrics to Prometheus is to make several of the metric fields tags: like |
@hadronzoo I tried to keep it as general as possible by exporting all variables available to the metric group as labels. I am not sure if this might lead to label overuse at some point, did you ever run into difficulties in that context? Unfortunately I do not have a lot of experience running Prometheus in a production environment yet. |
As it stands a metric can have at most 10 labels, and there aren't any current plans to extend this set; according to the docs that's still ok. If this does indeed become a problem we can add a switch to rely on the scope formats instead, giving the user control as to how many labels there are. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this a lot more than the previous version.
I haven't found anything major so far; but it would be good to shade the dependencies and get rid of guava as we don't actually use it much.
|
||
@PublicEvolving | ||
public class PrometheusReporter implements MetricReporter { | ||
private static final Logger log = LoggerFactory.getLogger(PrometheusReporter.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by convention this should be upper-case.
</dependency> | ||
|
||
<dependency> | ||
<groupId>com.google.guava</groupId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We try to avoid guava unless absolutely necessary, and virtually all modules that do depend on it shade it away.
public class PrometheusReporter implements MetricReporter { | ||
private static final Logger log = LoggerFactory.getLogger(PrometheusReporter.class); | ||
|
||
static final String ARG_PORT = "port"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indentation here in inconsistent with the rest of the code base.
private static final char SCOPE_SEPARATOR = '_'; | ||
|
||
private PrometheusEndpoint prometheusEndpoint; | ||
private Map<String, Collector> collectorsByMetricName = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be final.
try { | ||
prometheusEndpoint.start(NanoHTTPD.SOCKET_READ_TIMEOUT, true); | ||
} catch (IOException e) { | ||
log.error("Could not start PrometheusEndpoint on port " + port, e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to throw an exception here despite the current interface. Otherwise the reporter will still be notified of added/removed metrics, even though there's no benefit to it.
} else if (metric instanceof Meter) { | ||
collector = createCounter((Meter) metric, validMetricName, metricIdentifier, dimensionKeys, dimensionValues); | ||
} else { | ||
log.error("Cannot add unknown metric type: {}. This indicates that the metric type is not supported by this reporter.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By convention classes that are part of the metric system never log something as errors but only warnings, the reason being that errors are something that interfere with the job being executed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I changed it. The MetricRegistry
has a few error
logs in it as well, logging things that probably don't interfere with job execution either (for example in the constructor, where it tries to open reporters). Do you want me to change them to warn
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should be done in a separate issue.
.setChild(new io.prometheus.client.Counter.Child() { | ||
@Override | ||
public double get() { | ||
return (double) meter.getCount(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also export the rate of the meter as a prometheus gauge.
Personally i think we can drop this method completely; I'm not too big of a fan of exporting the count of a meter; if that would be of interest there should always be a separate counter for it. (and as far as Flink is concerned, there is)
return HOST_NAME + SCOPE_SEPARATOR + "taskmanager" + SCOPE_SEPARATOR + TASK_MANAGER + SCOPE_SEPARATOR + metricName; | ||
} | ||
|
||
private static MetricRegistry prepareMetricRegistry() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is a bit overkill for a one-liner that is called once.
@Rule | ||
public ExpectedException thrown = ExpectedException.none(); | ||
|
||
private MetricRegistry registry = prepareMetricRegistry(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be final.
this.histogram = histogram; | ||
this.metricName = metricName; | ||
this.metricIdentifier = metricIdentifier; | ||
labelNamesWithQuantile = ImmutableList.<String>builder().addAll(labelNames).add("quantile").build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be prefixed with this
as well for consistency.
…ements. - Fix code style - Only log warn's in metric system, no errors - Avoid using guava - Throw exception on failed startup - Export rate for meters instead of count - Add application prefix to exported metrics
Okay, guava is not used anymore. I am not sure about the shading part. Would you expect either prometheus clients or nanohttpd to be used in Flink user code? Or are there other advantages to shading? If so, could you point me to a module I could copy the "Flink way of shading" from? |
} | ||
|
||
private Collector createGauge(final Meter meter, final String name, final String identifier, final List<String> labelNames, final List<String> labelValues) { | ||
return newGauge(name + "_rate", identifier, labelNames, labelValues, new io.prometheus.client.Gauge.Child() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meter metrics should contain "rate" in their name already, so we don't have to append "_rate" here.
Yu can find an example on how to shade here: https://github.com/apache/flink/blob/master/flink-metrics/flink-metrics-datadog/pom.xml. Shading dependencies in reporters/connectors has become a safety-precaution form our side. It is not that unlikely that user-code contains the same dependencies. For one user-code also includes other reporters, so by the very existence of this reporter there is a precedent :) Besides that something that pops up from time to time on the mailing lists is users talking to REST endpoints in their functions/source/sinks, which may also rely on http-related dependencies. |
- Shade dependencies - Do not append "_rate" to meters - Add information about exported labels to docs
- Shade dependencies - Do not append "_rate" to meters - Add information about exported labels to docs
@zentol Would you mind checking that I got the shading right? |
I'll take a look later today. |
Shading looks correct; dependencies are included and the correct jar is included in flink-dist. |
<parent> | ||
<groupId>org.apache.flink</groupId> | ||
<artifactId>flink-metrics</artifactId> | ||
<version>1.3-SNAPSHOT</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that's outdated by now. Sorry :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 😃
…ough ScheduledDropwizardReporter.
…ements. - Fix code style - Only log warn's in metric system, no errors - Avoid using guava - Throw exception on failed startup - Export rate for meters instead of count - Add application prefix to exported metrics
- Shade dependencies - Do not append "_rate" to meters - Add information about exported labels to docs
Oh, I broke the stricter checkstyle. To me, it feels a bit weird to have to put javadoc on tests, is that intended? |
I think it is intended. The majority of tests can be (and are) commented with "Tests for the {@link }.", which is totally OK because it tells us that this is the general test battery for that class and not something else. These are also easy to add so there's not a real overhead in doing so. But this doesn't apply to all of them; and you only really notice the missing javadoc when you stumble on a test and you spend 10 minutes trying to figure out what it is doing because the method names aren't helping at all. The widely-accepted notion that the name of a class and test method are supposed to be enough documentation for test is quite questionable IMO; and by enforcing a javadoc on the class we give a little nudge to make the developer hopefully think about whether something needs to be documented or not. |
Another thing to consider is that this rule also covers the myriad of testing utilities that so far did not have to be documented. |
I see, makes sense. Finally got around to fixing the dependency and getting a Green Travis. |
Finally found time to try this out, and it works great, merging. |
This closes apache#3833.
This closes apache#3833.
This closes apache#3833.
This closes apache#3833.
This closes apache#3833.
This closes apache#3833.
This closes apache#3833.
This closes apache#3833.
This closes apache#3833.
This closes apache#3833.
No description provided.