Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add deregistering of metrics #929

Closed
wants to merge 2 commits into from

Conversation

mr-celo
Copy link
Contributor

@mr-celo mr-celo commented May 2, 2024

Upon job end, the driver may call a method to both clean the tracking hashmap and deregister the source from the MetricsSystem.

The new method will remove all metrics belonging to a given job UUID.

@mr-celo mr-celo requested a review from a team as a code owner May 2, 2024 13:46
@mr-celo mr-celo requested a review from kornelione May 2, 2024 13:46
@github-actions github-actions bot requested a review from dmivankov May 2, 2024 13:46
def removeJobMetrics(metricNamespace: String, jobId: String): Unit = {
val key = s"$metricNamespace.$jobId"
val removed =
metricsMap.keys().asScala.filter(k => k.startsWith(key)).map(k => (k, metricsMap.remove(k)))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tuple here is created just for logging purposes. I intend to remove the logging (and simplify this) after a bit of validation

Upon job end, the driver may call a method to both clean the tracking
hashmap and deregister the source from the MetricsSystem.

The new method will remove all metrics belonging to a given job UUID.
* This method will deregister the metric from Spark's org.apache.spark.metrics.MetricsSystem
* and stops tracking that it was published
*/
def removeJobMetrics(metricNamespace: String, jobId: String): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def removeJobMetrics(metricNamespace: String, jobId: String): Unit = {
def removeJobMetrics(metricNamespace: String): Unit = {

I think in datasource and this class metricNamespace is being actually metricsPrefix spark parameter and doesn't need to append jobId. Further hint at that is that jobId wasn't mentioned in this class before

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah.. This in not super clear and it was a bit of a pain to stitch together (the jdbc sink adds a bunch of parameters, and it was hard to follow where each came from)
Essentially, it is true that from the point-of-view of the data source, there is no jobId. But the metricNamespace already contains the jobId when called. So, from the point of view of the driver, it does have it.

But this is a very good point. There is no reason to expose this to the datasource. I will remove it here, and let the backend do its thing with the knowledge of what the driver does. Will do 👍

https://github.com/cognitedata/jetfire-backend/blob/93088e6bfc03784137e41aea20a63ff9be89405d/driver/src/main/scala/com/cognite/jetfire/driver/TransformService.scala#L112

.foreach({
case (key, source) => {
logger.info(s"Deleting and removing counter for $key")
metricsSystem.removeSource(source)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Dashboard/Spark_dropwizard_metrics_info.md

Spark is instrumented with the Dropwizard/Codahale metrics library. Several components of Spark are instrumented, notable for this work several metrics generating from Spark driver and executors components can be instrumented.

An important architectural details is that the metrics are sent directly from the sources to the sink. This is important when running on a cluster, as each executor will communicate directly to the sink, for better scalability.

So it seems like when we'll call this from driver it will not cleanup metrics on executors, but didn't try to verify what actually happens when there's driver and separate executors. Also would be good to double-check if all registered metrics continue to be exported. Just found https://github.com/cognitedata/spark-jdbc/blob/master/src/main/scala/org/apache/spark/metrics/sink/JdbcSink.scala#L89 that seems to try pruning them a little bit (but only the counter type)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why it wouldn't remove from the executors based on this.
Is it because the instance that is removing the sources and the counter will be different in the executor? Meaning that none will be found in the map, and therefore none will be remove as a source?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that even though metrics registry is placed in sparkenv, it might still be isolated per-jvm

Copy link

codecov bot commented May 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.01%. Comparing base (0b3eb30) to head (c0f0ba9).

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #929      +/-   ##
==========================================
+ Coverage   80.96%   81.01%   +0.05%     
==========================================
  Files          46       46              
  Lines        3057     3066       +9     
  Branches      123      120       -3     
==========================================
+ Hits         2475     2484       +9     
  Misses        582      582              
Files Coverage Δ
...rc/main/scala/cognite/spark/v1/MetricsSource.scala 100.00% <100.00%> (ø)

Copy link

github-actions bot commented May 6, 2024

This pull request seems a bit stale.. Shall we invite more to the party?

@mr-celo mr-celo closed this Jun 11, 2024
@mr-celo mr-celo deleted the celo/CDF-21249/deregister-metrics branch June 11, 2024 12:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants