[SPARK-31711][CORE] Register the executor source with the metrics system when running in local mode. #28528

LucaCanali · 2020-05-14T14:11:42Z

What changes were proposed in this pull request?

This PR proposes to register the executor source with the Spark metrics system when running in local mode.

Why are the changes needed?

The Apache Spark metrics system provides many useful insights on the Spark workload.
In particular, the executor source metrics provide detailed info, including the number of active tasks, I/O metrics, and several task metrics details. The executor source metrics, contrary to other sources (for example ExecutorMetrics source), is not available when running in local mode.
Having executor metrics in local mode can be useful when testing and troubleshooting Spark workloads in a development environment. The metrics can be fed to a dashboard to see the evolution of resource usage and can be used to troubleshoot performance,
as in this example.
Currently users will have to deploy on a cluster to be able to collect executor source metrics, while the possibility of having them in local mode is handy for testing.

Does this PR introduce any user-facing change?

This PR exposes executor source metrics data when running in local mode.

How was this patch tested?

Manually tested by running in local mode and inspecting the metrics listed in http://localhost:4040/metrics/json/
Also added a test in SourceConfigSuite

dongjoon-hyun · 2020-05-14T17:19:59Z

ok to test

dongjoon-hyun · 2020-05-14T17:22:00Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

@@ -803,6 +803,13 @@ package object config {
      .booleanConf
      .createWithDefault(true)

+  private[spark] val METRICS_EXECUTOR_SOURCE_ENABLED =
+    ConfigBuilder("spark.metrics.executorSource.enabled")


Hi, @LucaCanali .
This is a completely orthogonal topic from the PR title, Register the executor source with the metrics system when running in local mode. Could you make a separate PR for this first?

Technically, spark.metrics.executorSource.enabled=false doesn't make senses to me. Do you need to disable this?

Thanks @dongjoon-hyun for the quick reaction on this. My reasoning for adding the possibility of turning the executor source off in this PR, was that I thought it could help adoption, in the sense that it could be a safety net for those who may be impacted by the addition of this list of new driver metrics (in local mode). Also it can be seen as a "simmetric partner" of spark.metrics.executorMetricsSource.enabled. I actually have no use case that needs to turn the executor source metrics off, so I can for sure split this part out.

~~Can we reuse spark.metrics.executorMetricsSource.enabled in the local mode?~~ Never mind.

ExecutorMetricsSource is a new one in 3.0, but ExecutorSource is very old one since Spark 1.x. If we don't have a use case, let's not add this configuration.

SparkQA · 2020-05-14T17:37:50Z

Test build #122624 has finished for PR 28528 at commit a6e23f1.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-14T20:06:39Z

Test build #122631 has finished for PR 28528 at commit 52c22ce.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-05-14T22:02:32Z

FYI, the dependency failure will be fixed by the following.

[SPARK-31713][INFRA] Make test-dependencies.sh detect version string correctly #28532

dongjoon-hyun · 2020-05-15T05:27:49Z

Retest this please.

SparkQA · 2020-05-15T07:05:02Z

Test build #122651 has finished for PR 28528 at commit 52c22ce.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

LucaCanali · 2020-05-18T19:14:14Z

Can we test this again please?

dongjoon-hyun · 2020-05-26T02:06:19Z

Sorry for being late.

dongjoon-hyun · 2020-05-26T02:06:24Z

Retest this please.

SparkQA · 2020-05-26T05:30:47Z

Test build #123102 has finished for PR 28528 at commit 52c22ce.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-05-26T06:13:38Z

core/src/main/scala/org/apache/spark/executor/Executor.scala

@@ -121,7 +121,7 @@ private[spark] class Executor(
  // create. The map key is a task id.
  private val taskReaperForTask: HashMap[Long, TaskReaper] = HashMap[Long, TaskReaper]()

-  val executorMetricsSource =
+  private val executorMetricsSource =


This seems to be irrelevant to this PR.

dongjoon-hyun · 2020-05-26T06:14:10Z

core/src/main/scala/org/apache/spark/executor/Executor.scala

  }

+


Please remove this.

dongjoon-hyun · 2020-05-26T06:21:48Z

core/src/main/scala/org/apache/spark/executor/Executor.scala

@@ -134,8 +134,11 @@ private[spark] class Executor(
    env.metricsSystem.registerSource(new JVMCPUSource())
    executorMetricsSource.foreach(_.register(env.metricsSystem))
    env.metricsSystem.registerSource(env.blockManager.shuffleMetricsSource)
+  } else {
+    Executor.executorSource = executorSource


What happens if we call env.metricsSystem.registerSource(executorSource) here?

Good question, this is the actual pain point this PR tries to solve: one cannot simply register metrics on env here when running in local mode, or otherwise the appId would not be available, so the resulting output would not be clearly usable (missing a key piece of info as the appId). That's why I propose to register the metrics in the Spark Context, together with other driver metrics. As you can see other metrics namespaces use a similar strategy.
To get around the issue, for local mode, I propose using the workaround of storing the executorSource in the object so it can be read later, I hope this is acceptable.

I can think of other ways to do this more cleanly but it would be quite a bit more code change. I do think we should explicitly make a comment about the appId so that someone who comes and looks at this later realizes that

core/src/test/scala/org/apache/spark/metrics/source/SourceConfigSuite.scala

dongjoon-hyun · 2020-05-26T06:26:04Z

@LucaCanali . Could you add more concrete examples into the PR description for the following claim, please?

Having executor metrics in local mode can be very useful when testing and troubleshooting Spark workloads in a development environment.

LucaCanali · 2020-05-26T12:42:26Z

I have added to the PR description some additional context and a short explanation of why I think Spark users can find this useful.
@dongjoon-hyun Do you have further comments or ideas for improvements?

SparkQA · 2020-05-26T15:17:34Z

Test build #123122 has finished for PR 28528 at commit e2ebe65.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-06-20T03:08:57Z

Retest this please.

SparkQA · 2020-06-20T03:14:32Z

Test build #124309 has finished for PR 28528 at commit e2ebe65.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

LucaCanali · 2020-06-24T06:41:55Z

It looks like there was a problem with the Jenkins build system?

SparkQA · 2020-07-20T13:02:09Z

Test build #126177 has finished for PR 28528 at commit 70d55ed.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

LucaCanali · 2020-07-29T06:46:30Z

Hi @dongjoon-hyun, do you have further comments or suggestions for improvements on this?

tgravescs · 2020-09-14T13:53:55Z

docs/monitoring.md

@@ -1153,6 +1153,11 @@ This is the component with the largest amount of instrumented metrics
 - namespace=JVMCPU
  - jvmCpuTime

+- namespace=executor


this looks the same as the ExecutorMetrics but I think they are actually different in that this doesn't give you the JVM metrics - correct? Perhaps we need to update the one below.

I agree that the naming could be improved, in particular metrics under namespace=executor and namespace=ExecutorMetrics are similar in scope, however the implementation goes via quite different paths [[ExecutorSource]] vs. [[ExecutoMetricsSource]]. Merging the two could be the subject for future refactoring.

Just to clarify: metrics in the namespace=ExecutorMetrics are already available in local mode. The goal of this PR is to make matrics in the namespace="executor" also available in local mode.

tgravescs · 2020-09-14T14:01:50Z

core/src/main/scala/org/apache/spark/SparkContext.scala

@@ -623,6 +623,9 @@ class SparkContext(config: SparkConf) extends Logging {

    // Post init
    _taskScheduler.postStartHook()
+    if (isLocal) {
+      _env.metricsSystem.registerSource(Executor.executorSource)


So I haven't went and looked but if I don't configure the metrics system files does this cause any overhead?

I have not directly measured it, but I'd say the overhead from computing the metrics in the executor namespace is small. Also, we are talking about a change that only affects Spark running in local mode.
However, one important point related to the user-impact of this PR is a consequence of the fact that by default metrics are sunk using the MetricsServelet, which attaches to the WebUI this is because of the default value for "*.sink.servlet.class" = "org.apache.spark.metrics.sink.MetricsServlet" and "*.sink.servlet.path" = "/metrics/json" (see https://spark.apache.org/docs/latest/monitoring.html#metrics ). This means that executor metrics in local mode will be visible under the WebUI too (http://localhost:4040/metrics/json/)
If we want to be extra cautious about this we can introduce a config (this has been discussed above in this PR and discussion there pointed to not have the extra config).

tgravescs · 2020-09-14T14:18:14Z

core/src/main/scala/org/apache/spark/executor/Executor.scala

@@ -135,6 +135,8 @@ private[spark] class Executor(
    env.metricsSystem.registerSource(new JVMCPUSource())
    executorMetricsSource.foreach(_.register(env.metricsSystem))
    env.metricsSystem.registerSource(env.blockManager.shuffleMetricsSource)
+  } else {
+    Executor.executorSource = executorSource


we now have 2 executorSource variables, one in the object and one in the class. That seems a bit weird/confusing. I see the reason why but perhaps we can rename the object one to have LocalModeOnly in the name or something as to not accidentally be used in non-local mode.

tgravescs · 2020-09-14T14:27:21Z

core/src/main/scala/org/apache/spark/executor/Executor.scala

@@ -134,8 +134,11 @@ private[spark] class Executor(
    env.metricsSystem.registerSource(new JVMCPUSource())
    executorMetricsSource.foreach(_.register(env.metricsSystem))
    env.metricsSystem.registerSource(env.blockManager.shuffleMetricsSource)
+  } else {
+    Executor.executorSource = executorSource


I can think of other ways to do this more cleanly but it would be quite a bit more code change. I do think we should explicitly make a comment about the appId so that someone who comes and looks at this later realizes that

SparkQA · 2020-09-14T15:05:27Z

Test build #128652 has finished for PR 28528 at commit 1cf7b52.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-14T15:19:23Z

Test build #128650 has finished for PR 28528 at commit ae1b206.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

… object following review.

SparkQA · 2020-09-17T12:57:22Z

Test build #128813 has finished for PR 28528 at commit c8ce5a3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

LucaCanali · 2020-10-02T14:15:14Z

@tgraves, @dongjoon-hyun, thanks for the reviews and comments. Do you think this is getting ready or more work and changes are needed?

tgravescs

Looks fine to me

tgravescs · 2020-10-02T15:14:05Z

Jenkins, test this please

SparkQA · 2020-10-02T15:16:14Z

Test build #129351 has started for PR 28528 at commit c8ce5a3.

SparkQA · 2020-10-02T16:04:48Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33962/

SparkQA · 2020-10-02T16:23:01Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33962/

tgravescs · 2020-10-19T14:48:06Z

test this please

SparkQA · 2020-10-19T15:31:40Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34615/

SparkQA · 2020-10-19T16:00:42Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34615/

SparkQA · 2020-10-19T17:21:28Z

Test build #130008 has finished for PR 28528 at commit c8ce5a3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2020-11-04T20:03:47Z

test this please

SparkQA · 2020-11-04T20:45:50Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35218/

SparkQA · 2020-11-04T21:13:53Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35218/

SparkQA · 2020-11-04T22:44:10Z

Test build #130617 has finished for PR 28528 at commit c8ce5a3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2020-11-04T22:46:55Z

sorry for my delay on this, I'm going to merge

tgravescs · 2020-11-04T22:51:04Z

merged to master, thanks @LucaCanali

LucaCanali · 2020-11-05T08:20:35Z

Thank you @tgravescs

zsxwing · 2020-11-28T17:59:27Z

core/src/main/scala/org/apache/spark/executor/Executor.scala

@@ -979,4 +984,7 @@ private[spark] object Executor {
  // task is fully deserialized. When possible, the TaskContext.getLocalProperty call should be
  // used instead.
  val taskDeserializationProps: ThreadLocal[Properties] = new ThreadLocal[Properties]
+
+  // Used to store executorSource, for local mode only
+  var executorSourceLocalModeOnly: ExecutorSource = null


Nit: Looks like this one is never been cleaned. It would be great to avoid using a global executorSourceLocalModeOnly to save a state of a specific executor. Can we move this to SparkEnv so that a state of one test won't be leaked to other tests?

Thanks @zsxwing , I'll have a look at it.

probot-autolabeler bot added CORE DOCS labels May 14, 2020

dongjoon-hyun reviewed May 14, 2020

View reviewed changes

LucaCanali force-pushed the metricsWithLocalMode branch from a6e23f1 to 52c22ce Compare May 14, 2020 19:49

dongjoon-hyun reviewed May 26, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/executor/Executor.scala Outdated

}

Copy link

Member

dongjoon-hyun May 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this.

dongjoon-hyun reviewed May 26, 2020

View reviewed changes

core/src/test/scala/org/apache/spark/metrics/source/SourceConfigSuite.scala Outdated Show resolved Hide resolved

LucaCanali force-pushed the metricsWithLocalMode branch from e2ebe65 to 70d55ed Compare July 20, 2020 10:09

LucaCanali added 2 commits September 14, 2020 14:35

Register executor source in local mode

d1a527a

Addressed review comments.

ae1b206

LucaCanali force-pushed the metricsWithLocalMode branch from 70d55ed to ae1b206 Compare September 14, 2020 12:36

restore small change to doc that does not belong with this PRwq

1cf7b52

tgravescs reviewed Sep 14, 2020

View reviewed changes

Renamed executorSource to executorSourceLocalModeOnly in the Executor…

c8ce5a3

… object following review.

tgravescs approved these changes Oct 2, 2020

View reviewed changes

asfgit closed this in b7fff03 Nov 4, 2020

zsxwing reviewed Nov 28, 2020

View reviewed changes

LucaCanali mentioned this pull request Dec 5, 2020

[SPARK-31711][CORE][FOLLOWUP] Move executorSourceLocalModeOnly to sparkEnv #30619

Closed

[SPARK-31711][CORE] Register the executor source with the metrics system when running in local mode. #28528

[SPARK-31711][CORE] Register the executor source with the metrics system when running in local mode. #28528

Conversation

LucaCanali commented May 14, 2020 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

dongjoon-hyun commented May 14, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun May 14, 2020 • edited

Choose a reason for hiding this comment

dongjoon-hyun May 14, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 14, 2020

SparkQA commented May 14, 2020

dongjoon-hyun commented May 14, 2020 • edited

dongjoon-hyun commented May 15, 2020

SparkQA commented May 15, 2020

LucaCanali commented May 18, 2020

dongjoon-hyun commented May 26, 2020

dongjoon-hyun commented May 26, 2020

SparkQA commented May 26, 2020

dongjoon-hyun May 26, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun May 26, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented May 26, 2020

LucaCanali commented May 26, 2020 • edited

SparkQA commented May 26, 2020

dongjoon-hyun commented Jun 20, 2020

SparkQA commented Jun 20, 2020

LucaCanali commented Jun 24, 2020

SparkQA commented Jul 20, 2020

LucaCanali commented Jul 29, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LucaCanali Sep 17, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 14, 2020

SparkQA commented Sep 14, 2020

SparkQA commented Sep 17, 2020

LucaCanali commented Oct 2, 2020

tgravescs left a comment

Choose a reason for hiding this comment

tgravescs commented Oct 2, 2020

SparkQA commented Oct 2, 2020

SparkQA commented Oct 2, 2020

SparkQA commented Oct 2, 2020

tgravescs commented Oct 19, 2020

SparkQA commented Oct 19, 2020

SparkQA commented Oct 19, 2020

SparkQA commented Oct 19, 2020

tgravescs commented Nov 4, 2020

SparkQA commented Nov 4, 2020

SparkQA commented Nov 4, 2020

SparkQA commented Nov 4, 2020

tgravescs commented Nov 4, 2020

tgravescs commented Nov 4, 2020

LucaCanali commented Nov 5, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LucaCanali commented May 14, 2020 •

edited

dongjoon-hyun May 14, 2020 •

edited

dongjoon-hyun May 14, 2020 •

edited

dongjoon-hyun commented May 14, 2020 •

edited

dongjoon-hyun May 26, 2020 •

edited

dongjoon-hyun May 26, 2020 •

edited

LucaCanali commented May 26, 2020 •

edited

LucaCanali Sep 17, 2020 •

edited