Add deregistering of metrics #929

mr-celo · 2024-05-02T13:46:45Z

Upon job end, the driver may call a method to both clean the tracking hashmap and deregister the source from the MetricsSystem.

The new method will remove all metrics belonging to a given job UUID.

mr-celo · 2024-05-02T13:51:28Z

src/main/scala/cognite/spark/v1/MetricsSource.scala

+  def removeJobMetrics(metricNamespace: String, jobId: String): Unit = {
+    val key = s"$metricNamespace.$jobId"
+    val removed =
+      metricsMap.keys().asScala.filter(k => k.startsWith(key)).map(k => (k, metricsMap.remove(k)))


This tuple here is created just for logging purposes. I intend to remove the logging (and simplify this) after a bit of validation

Upon job end, the driver may call a method to both clean the tracking hashmap and deregister the source from the MetricsSystem. The new method will remove all metrics belonging to a given job UUID.

dmivankov · 2024-05-02T14:20:40Z

src/main/scala/cognite/spark/v1/MetricsSource.scala

+    * This method will deregister the metric from Spark's org.apache.spark.metrics.MetricsSystem
+    * and stops tracking that it was published
+    */
+  def removeJobMetrics(metricNamespace: String, jobId: String): Unit = {


Suggested change

def removeJobMetrics(metricNamespace: String, jobId: String): Unit = {

def removeJobMetrics(metricNamespace: String): Unit = {

I think in datasource and this class metricNamespace is being actually metricsPrefix spark parameter and doesn't need to append jobId. Further hint at that is that jobId wasn't mentioned in this class before

Yeah.. This in not super clear and it was a bit of a pain to stitch together (the jdbc sink adds a bunch of parameters, and it was hard to follow where each came from)
Essentially, it is true that from the point-of-view of the data source, there is no jobId. But the metricNamespace already contains the jobId when called. So, from the point of view of the driver, it does have it.

But this is a very good point. There is no reason to expose this to the datasource. I will remove it here, and let the backend do its thing with the knowledge of what the driver does. Will do 👍

https://github.com/cognitedata/jetfire-backend/blob/93088e6bfc03784137e41aea20a63ff9be89405d/driver/src/main/scala/com/cognite/jetfire/driver/TransformService.scala#L112

dmivankov · 2024-05-02T14:34:53Z

src/main/scala/cognite/spark/v1/MetricsSource.scala

+        .foreach({
+          case (key, source) => {
+            logger.info(s"Deleting and removing counter for $key")
+            metricsSystem.removeSource(source)


https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Dashboard/Spark_dropwizard_metrics_info.md

Spark is instrumented with the Dropwizard/Codahale metrics library. Several components of Spark are instrumented, notable for this work several metrics generating from Spark driver and executors components can be instrumented.

An important architectural details is that the metrics are sent directly from the sources to the sink. This is important when running on a cluster, as each executor will communicate directly to the sink, for better scalability.

So it seems like when we'll call this from driver it will not cleanup metrics on executors, but didn't try to verify what actually happens when there's driver and separate executors. Also would be good to double-check if all registered metrics continue to be exported. Just found https://github.com/cognitedata/spark-jdbc/blob/master/src/main/scala/org/apache/spark/metrics/sink/JdbcSink.scala#L89 that seems to try pruning them a little bit (but only the counter type)

I don't understand why it wouldn't remove from the executors based on this.
Is it because the instance that is removing the sources and the counter will be different in the executor? Meaning that none will be found in the map, and therefore none will be remove as a source?

I suspect that even though metrics registry is placed in sparkenv, it might still be isolated per-jvm

codecov · 2024-05-03T08:08:53Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.01%. Comparing base (0b3eb30) to head (c0f0ba9).

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #929      +/-   ##
==========================================
+ Coverage   80.96%   81.01%   +0.05%     
==========================================
  Files          46       46              
  Lines        3057     3066       +9     
  Branches      123      120       -3     
==========================================
+ Hits         2475     2484       +9     
  Misses        582      582

Files	Coverage Δ
...rc/main/scala/cognite/spark/v1/MetricsSource.scala	`100.00% <100.00%> (ø)`

github-actions · 2024-05-06T08:00:27Z

This pull request seems a bit stale.. Shall we invite more to the party?

mr-celo requested a review from a team as a code owner May 2, 2024 13:46

mr-celo requested a review from kornelione May 2, 2024 13:46

github-actions bot requested a review from dmivankov May 2, 2024 13:46

mr-celo commented May 2, 2024

View reviewed changes

Add deregistering of metrics

0f6d45e

Upon job end, the driver may call a method to both clean the tracking hashmap and deregister the source from the MetricsSystem. The new method will remove all metrics belonging to a given job UUID.

mr-celo force-pushed the celo/CDF-21249/deregister-metrics branch from d6679ad to 0f6d45e Compare May 2, 2024 13:52

mr-celo had a problem deploying to CI May 2, 2024 13:52 — with GitHub Actions Failure

dmivankov reviewed May 2, 2024

View reviewed changes

Dont expose jobId to the datasource

c0f0ba9

mr-celo temporarily deployed to CI May 3, 2024 07:54 — with GitHub Actions Inactive

github-actions bot requested review from dmivankov and silvavelosa May 6, 2024 08:00

github-actions bot added the Stale label May 6, 2024

mr-celo closed this Jun 11, 2024

mr-celo deleted the celo/CDF-21249/deregister-metrics branch June 11, 2024 12:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add deregistering of metrics #929

Add deregistering of metrics #929

mr-celo commented May 2, 2024

mr-celo May 2, 2024

dmivankov May 2, 2024

mr-celo May 3, 2024

dmivankov May 2, 2024

mr-celo May 3, 2024

dmivankov May 3, 2024

codecov bot commented May 3, 2024 •

edited

Loading

github-actions bot commented May 6, 2024

	def removeJobMetrics(metricNamespace: String, jobId: String): Unit = {
	def removeJobMetrics(metricNamespace: String): Unit = {

Add deregistering of metrics #929

Add deregistering of metrics #929

Conversation

mr-celo commented May 2, 2024

mr-celo May 2, 2024

Choose a reason for hiding this comment

dmivankov May 2, 2024

Choose a reason for hiding this comment

mr-celo May 3, 2024

Choose a reason for hiding this comment

dmivankov May 2, 2024

Choose a reason for hiding this comment

mr-celo May 3, 2024

Choose a reason for hiding this comment

dmivankov May 3, 2024

Choose a reason for hiding this comment

codecov bot commented May 3, 2024 • edited Loading

Codecov Report

github-actions bot commented May 6, 2024

codecov bot commented May 3, 2024 •

edited

Loading