Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22343][core] Add support for publishing Spark metrics into Prometheus #19775

Closed

Conversation

Projects
None yet
@matyix
Copy link
Member

matyix commented Nov 17, 2017

What changes were proposed in this pull request?

Originally this PR was submitted to the Spark on K8S fork here but has been advised to resend it upstream by @erikerlandson and @foxish. K8S specific items were removed from the PR and been reworked for the Apache version.

Publishing Spark metrics into Prometheus - as highlighted in the JIRA. Implemented a metrics sink that publishes Spark metrics into Prometheus via Prometheus Pushgateway. Metrics data published by Spark is based on Dropwizard. The format of Spark metrics is not supported natively by Prometheus thus these are converted using DropwizardExports prior pushing metrics to the pushgateway.

Also the default Prometheus pushgateway client API implementation does not support metrics timestamp thus the client API has been ehanced to enrich metrics data with timestamp.

How was this patch tested?

This PR is not affecting the existing code base and not altering the functionality. Nevertheless, I have executed all unit and integration tests. Also this setup has been deployed and been monitored via Prometheus (Prometheus 1.7.1 + Pushgateway 0.3.1).

Manual testing through deploying a Spark cluster, Prometheus server, Pushgateway and ran SparkPi.

@matyix matyix changed the title Add support for publishing Spark metrics into Prometheus [SPARK-22343][core] Add support for publishing Spark metrics into Prometheus Nov 17, 2017

@erikerlandson

This comment has been minimized.

Copy link
Contributor

erikerlandson commented Nov 17, 2017

@matyix thanks for re-submitting!

@jerryshao

This comment has been minimized.

Copy link
Contributor

jerryshao commented Nov 24, 2017

Do we have to put this in Spark, is it a necessary part of k8s? I think if we pull in that PR(#11994), then this can be stayed out of Spark as a package. Even without #11994 , I believe users can still add their own Metrics source/sink via exposed SparkEnv/MetricsSystem. My concern is that this unnecessarily increases the code base of spark core.

@matyix

This comment has been minimized.

Copy link
Member Author

matyix commented Nov 24, 2017

@jerryshao this PR is not Kubernetes specific, it's an extension of the Spark Metrics system which is part of the core already. We could externalize if the PR #11994 above ever gets merged. Re-factoring and externalizing it after (for all the other metrics subsystem is not a big work). Although I submitted this PR first on the K8S fork, actually this feature might be beneficial for all using the (becoming) de-facto monitoring solution, Prometheus.

@felixcheung

This comment has been minimized.

Copy link
Member

felixcheung commented Nov 24, 2017

I agree this is useful to have, @jerryshao is probably right though, it is likely better to add extensibility into the Metrics system.

@jerryshao I'll review your PR
@matyix could you also review #11994 to see if that suits the need to build everything you have here to connect to Prometheus, but external to Spark and on top of #11994? I think your feedback will be very valuable.

We can then come back to this PR as needed.

@matyix

This comment has been minimized.

Copy link
Member Author

matyix commented Nov 27, 2017

Hello @felixcheung @jerryshao

The PR #11994 generally looks good for adding extensibility into the Metrics system. This PR (Prometheus) works with the changes proposed in PR #11994 out of the box, so no modification needed on our side.

As Prometheus is (becoming) a widely used monitoring system personally I see a lot of value including this into the Spark code base, same as the current metrics. I see PR #11994 rather a solution for those who build custom specialised metrics sink.

@AmplabJenkins

This comment has been minimized.

Copy link

AmplabJenkins commented Jan 18, 2018

Can one of the admins verify this patch?

@matyix

This comment has been minimized.

Copy link
Member Author

matyix commented Feb 8, 2018

Hello @erikerlandson @felixcheung @jerryshao - any feedback on this PR? Shall I close it and not worry about this being merged upstream anymore? We've been using this in production for the last 3 months and it's a bit awkward that our CI/CD system needs to patch the upstream version all the time but we can live with that (since it's automated). Please advise. Happy to help to get it merge or eventually just close it.

* This uses the PUT HTTP method.
* @deprecated use {@link #push(CollectorRegistry, String, Map)}
*/
@Deprecated

This comment has been minimized.

Copy link
@smurakozi

smurakozi Feb 8, 2018

Contributor

This is a new class, why should we include these deprecated methods?

This comment has been minimized.

Copy link
@stoader

stoader Feb 9, 2018

Removed deprecated methods.

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class TextFormatWithTimestamp {

This comment has been minimized.

Copy link
@smurakozi

smurakozi Feb 8, 2018

Contributor

No doc.

This comment has been minimized.

Copy link
@stoader
delete(job, Collections.singletonMap("instance", instance));
}

void doRequest(CollectorRegistry registry, String job, Map<String,

This comment has been minimized.

Copy link
@smurakozi

smurakozi Feb 8, 2018

Contributor

It should be private.

This comment has been minimized.

Copy link
@stoader

stoader Feb 9, 2018

Changed it to private.


private static StringBuilder jsonMessageLogBuilder = new StringBuilder();

public static void write004(Writer writer,

This comment has been minimized.

Copy link
@smurakozi

smurakozi Feb 8, 2018

Contributor

No doc

This comment has been minimized.

Copy link
@stoader
/* See http://prometheus.io/docs/instrumenting/exposition_formats/
* for the output format specification. */
while(mfs.hasMoreElements()) {
Collector.MetricFamilySamples metricFamilySamples = mfs.nextElement();

This comment has been minimized.

Copy link
@smurakozi

smurakozi Feb 8, 2018

Contributor

I think for(Collector.MetricFamilySamples s: Collections.list(mfs)) { would be nicer.

This comment has been minimized.

Copy link
@smurakozi

smurakozi Feb 8, 2018

Contributor

Also, method body is not indented well.

This comment has been minimized.

Copy link
@stoader

stoader Feb 9, 2018

This class has been refactored.

case _ => Map("role" -> role)
}


This comment has been minimized.

Copy link
@smurakozi

smurakozi Feb 8, 2018

Contributor

Nit: extra line

This comment has been minimized.

Copy link
@stoader

stoader Feb 9, 2018

Empty line removed.


override def start(): Unit = {
sparkMetricExports.register(pushRegistry)

This comment has been minimized.

Copy link
@smurakozi

smurakozi Feb 8, 2018

Contributor

Nit: extra line

This comment has been minimized.

Copy link
@stoader

stoader Feb 9, 2018

Empty line removed.

private[spark] class PrometheusSink(
val property: Properties,
val registry: MetricRegistry,
securityMgr: SecurityManager)

This comment has been minimized.

Copy link
@smurakozi

smurakozi Feb 8, 2018

Contributor

securityMgr is never used

This comment has been minimized.

Copy link
@stoader

stoader Feb 9, 2018

The parameter list for a Sink is imposed byMetricsSystem which instantiates the configured sinks (see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala#L199). PrometheusSink doesn't need SecurityManager this is why securityMgr is not used (similar to CsvSink, ConsoleSink).

} catch (Exception ex) {
logger.error("Sending metrics failed due to: ", ex);
}

This comment has been minimized.

Copy link
@smurakozi

smurakozi Feb 8, 2018

Contributor

Nit: extra line

This comment has been minimized.

Copy link
@stoader

stoader Feb 9, 2018

Removed extra line.

}

finally {
connection.disconnect();

This comment has been minimized.

Copy link
@smurakozi

smurakozi Feb 8, 2018

Contributor

connection can be null if new URL(url).openConnection() at line 272 threw an exception.

This comment has been minimized.

Copy link
@stoader

stoader Feb 9, 2018

The new URL(url).openConnection() is outside of the try-catch-block this in case it throws an exception it exist doRequest() before reaching finally { connection.disconnect();

@jerryshao

This comment has been minimized.

Copy link
Contributor

jerryshao commented Feb 9, 2018

My original intention is to expose MetricsSystem related interface in #11994 , so that users can leverage such interface to build their own metrics sink/source out of Spark. Unfortunately I'm stuck on the #11994 , but still I think it is better to leave this as a package out of Spark, pulling to much dependencies for non-core functionalities seems not so reasonable (just my thoughts).

@vanzin

This comment has been minimized.

Copy link
Contributor

vanzin commented Feb 9, 2018

Regardless of the discussion about whether this should live in Spark, does it need to live in core?

Can it be kept in a separate module like the Ganglia stuff (even though that one is for licensing reasons)?

@stoader

This comment has been minimized.

Copy link

stoader commented Feb 12, 2018

@smurakozi thank you for reviewing. The PR has been updated based on your comments.

@erikerlandson

This comment has been minimized.

Copy link
Contributor

erikerlandson commented Feb 14, 2018

I agree w/ @jerryshao that adding new deps to core isn't ideal. (Also that having #11994 would be really nice)
New deps on a sub-project seems more palatable, but interested in what other Sparkers think.

@erikerlandson

This comment has been minimized.

Copy link
Contributor

erikerlandson commented Feb 14, 2018

Although this is not kube-specific, kubernetes deployment is a major prometheus use case. Has it been tested in a kube environment?

@stoader

This comment has been minimized.

@GaalDornick

This comment has been minimized.

Copy link

GaalDornick commented Mar 8, 2018

So, where did we land on the discussion for this PR. Is this change getting in or is it out?

@matyix

This comment has been minimized.

Copy link
Member Author

matyix commented Mar 8, 2018

@GaalDornick @erikerlandson @jerryshao @felixcheung et all

We gave up this - we have made the requested changes several times and I am not willing to put more time on this and get in the middle of a debate which is not my concern. Currently the Spark monitoring architecture it is how it is - and we have made the PR to align with the current architecture of the existing sinks and metrics subsystem. What did happen is that now the debate is not about that this is good, needed or not but whether it should be part of Spark core, be pluggable, we should refactor the whole metrics subsystem, etc. Most likely this will still be the case later and once these will be changed, nailed down or agreed by all parties I can rework and resend the PR...

Anyways, we (and our customers) are using this in production for months - we have externalized this into a separate jar which we put it on the CP and does not need to be part of Spark (though it should I believe, as Prometheus is one of the best open source monitoring framework).

Should anybody need help to use this sink drop me a mail at janos@banzaicloud.com happy to help if interested in using Prometheus with Spark. We do some advanced scenarios with this sink and the code is all open sourced - you can read more about Monitoring Spark with Prometheus and Federated monitoring of multiple Spark clusters.

Thanks for all the support.
Janos

@matyix matyix closed this Mar 8, 2018

@matyix

This comment has been minimized.

Copy link
Member Author

matyix commented Mar 10, 2018

For those who are still interested using Prometheus you can get the standalone package and source code from here: https://github.com/banzaicloud/spark-metrics . Happy monitoring, try to catch the issues and avoid those PagerDuty notifications beforehand :).

@andrusha

This comment has been minimized.

Copy link
Contributor

andrusha commented Apr 24, 2018

You can also try using https://github.com/andrusha/dropwizard-prometheus, which implements pull instead of push.

@lony

This comment has been minimized.

Copy link

lony commented May 16, 2018

@andrusha Do you have a tutorial how to set this up. From my understanding if I have multiple executers pulling it is harder, as prometheus has to have all the hostnames. Am I right or wrong?

@andrusha

This comment has been minimized.

Copy link
Contributor

andrusha commented May 24, 2018

@mageru

This comment has been minimized.

Copy link

mageru commented Aug 3, 2018

Did we really miss out on Prometheus metric functionality because people couldn't just be cool about it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.