[SPARK-10620][WIP] Migrate TaskMetrics to accumulators #10717

andrewor14 · 2016-01-12T03:09:36Z

There exist two mechanisms to pass metrics from executors to drivers: accumulators and TaskMetrics. Currently we send both things to the driver using two separate code paths. This is an unnecessary maintenance burden and makes the code more difficult to follow.

This patch proposes that we send only accumulator updates to the driver. Additionally, it reimplements TaskMetrics using accumulators such that the new TaskMetrics serves mainly as a syntactic sugar to increment and access the values of the underlying accumulators. It migrates the rest of the metrics to adopt the code path already used by the existing PEAK_EXECUTION_MEMORY.

While an effort has been made to preserve as much of the public API as possible, there were a few known breaking @DeveloperApi changes that would be very awkward to maintain. These are:

TaskMetrics#hostname field was removed; event log was the only consumer
ExceptionFailure#taskMetrics field was replaced with accumUpdates
SparkListenerExecutorMetricsUpdate#taskMetrics field was replaced with accumUpdates
AccumulableInfo#update changed from Option[String] to Option[Any]
AccumulableInfo#value changed from String to Option[Any]

The following event log elements are changed:

AccumulableInfo "Update" field changed from Option[String] to Option[Any]
AccumulableInfo "Value" field changed from String to Option[Any]

This is WIP because I would like to add tests for some of the intricate cases that I ran into while implementing this.

I broke this down into several smaller PRs:

[SPARK-12884] Move classes to their own files for readability #10810 - move classes to their own files for readability
[SPARK-12885] [MINOR] Rename 3 fields in ShuffleWriteMetrics #10811 - rename 3 fields in ShuffleWriteMetrics
[SPARK-12887] Do not expose var's in TaskMetrics #10815 - refactor TaskMetrics to make it not use public var's
[SPARK-12895][SPARK-12896] Migrate TaskMetrics to accumulators #10835 - implement TaskMetrics using accumulators
[SPARK-12896] [WIP] Send only accumulator updates to driver, not TaskMetrics #10857 - send only accumulator updates to driver, not TaskMetrics

There are a bunch of decrement X methods that were not used. Also, there are a few set X methods that could have easily just been increment X. The latter change is more in line with accumulators.

This commit uses the existing PEAK_EXECUTION_MEMORY mechanism to bring a few other fields in TaskMetrics to use accumulators.

This commit ports ShuffleReadMetrics to use accumulators, preserving as much of the existing semantics as possible. It also introduces a nicer way to organize all the internal accumulators by namespacing them.

This commit was a little tricky because it ripped the bytes read callback from TaskMetrics and related classes. It does change behavior in the sense that now we periodically update the number of bytes read (every 1000 records) instead of doing it every time we send an executor heartbeat. The advantage here is code simplicity.

Tests are still failing as of this commit. E.g. SortShuffleSuite.

Tests were previously failing because we end up double counting metrics in local mode. This is because each TaskContext shares the same list of accumulators, so they end up updating the metrics on top of each other. The fix is to ensure TaskContext clears any existing values on the accumulators before passing them on.

The exception was harmless because it didn't actually fail the test. However, the test harness was actually badly written. We used to always assume that the first job will have an ID of 0, but there could very well be other tests sharing the same SparkContext. This is now fixed and we no longer see the exception. As of this commit, all known test failures have been fixed. I'm sure there will be more...

Instead of passing in a callback, we can just return the accumulator values directly, which we have. "We" here refers to TaskMetrics.

This commit addresses outstanding TODO's and makes the deprecated APIs DeveloperApi instead. This allows us to deal with how to do the deprecation properly later. This commit also reverts a few unnecessary changes to reduce the size of the diff.

…-accums Conflicts: core/src/main/scala/org/apache/spark/TaskContextImpl.scala sql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLListenerSuite.scala

Instead, send only accumulator updates. As of this commit TaskMetrics is only used as a syntactic sugar on the executor side to modify accumulator values by names. Now we no longer send the same thing in two different codepaths. Now that we never send TaskMetrics from executors to the driver, we also never send accumulators that way. Then we can revert some of the accumulator changes.

In the previous commit, we made accumulator communication one-way again, which is the same as before this patch, so we restored all the semantics involved in serializing accumulators as before. Note: tests are still failing because of a duplicate accumulator name in some SQL things. Run `DataFrameCallbackSuite` for more detail.

Currently we still get values for tasks that fail. We should keep this semantics in the new accumulator updates as well.

There are a few places where we passed in empty internal accumulators to TaskContextImpl, so the TaskMetrics creation would fail. These are now fixed.

Before this commit the SQL UI would not display any accumulators. This is because it is powered by the SQLListener, which reads accumulators from TaskMetrics. However, we did not update the accumulator values before posting the TaskMetrics, so the UI never saw the updates from the tasks. This commit also fixes a few related test failures.

Now internal accumulators no longer need to have unique names. This was an unnecessary hack for the SQL accumulators that can be reverted through some clean ups.

for readability.

A few bugs: (1) In Executor.scala, we updated TaskMetrics after collecting the accumulator values. We should do it the other order. (2) The test utility method of verifying whether peak execution memory is set imposed this requirement on every single job run in the test body. This does not apply for SQL's external sort, however, because one of the jobs does a sample and so does not update peak execution memory. (3) We were getting accumulators from executors that were not registered on the driver. Not exactly sure what the cause is but it could very well have to do with GC on the driver since we use weak references there. We shouldn't crash the scheduler if this happens.

Such that downstream listeners can access their values. This commit also generalizes the internal accumulator type from Long to anything, since we need to store the read and write methods of InputMetrics and OutputMetrics respectively.

This fixes a bug where when we reconstruct TaskMetrics we just pass in mutable accumulators, such that when new tasks come in they change the values of the old tasks. A more subtle bug here is that we were passing in the accumulated values instead of the local task values. Both are now fixed. TODO: write a test for all of these please.

The fake accumulator values should no longer all be Longs. Ugh.

SparkQA · 2016-01-15T20:56:37Z

Test build #49482 has started for PR 10717 at commit 23af334.

SparkQA · 2016-01-15T20:59:21Z

Test build #49480 has started for PR 10717 at commit 23af334.

…-accums Conflicts: core/src/test/scala/org/apache/spark/storage/StorageStatusListenerSuite.scala core/src/test/scala/org/apache/spark/ui/storage/StorageTabSuite.scala core/src/test/scala/org/apache/spark/util/JsonProtocolSuite.scala

andrewor14 · 2016-01-15T21:31:47Z

retest this please

nongli · 2016-01-15T21:52:18Z

This patch is really difficult to review since it combines simple refactoring and more subtle logic. Is it possible to break this up? If there were patches that did simple clean up, those can get reviewed and merged very quickly.

andrewor14 · 2016-01-15T22:04:23Z

Yeah it's still WIP but once it's in a more ready state I'll try to break it up.

SparkQA · 2016-01-15T23:25:42Z

Test build #49485 has finished for PR 10717 at commit b58f2e6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-16T01:41:07Z

Test build #49509 has finished for PR 10717 at commit 00a12a4.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-18T20:17:47Z

Test build #49601 has finished for PR 10717 at commit ed9de9c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2016-01-18T20:58:51Z

OK, I broke this down into smaller PRs:

#10810 - move classes to their own files for readability (MERGED)
#10811 - rename 3 fields in ShuffleWriteMetrics (MERGED)
#10815 - refactor TaskMetrics to make it not use public var's (MERGED)
#10835 - implement TaskMetrics using accumulators
#10857 - send only accumulator updates to driver, not TaskMetrics

This is a small step in implementing SPARK-10620, which migrates `TaskMetrics` to accumulators. This patch is strictly a cleanup patch and introduces no change in functionality. It literally just moves classes to their own files to avoid having single monolithic ones that contain 10 different classes. Parent PR: #10717 Author: Andrew Or <andrew@databricks.com> Closes #10810 from andrewor14/move-things.

andrewor14 · 2016-01-18T23:30:38Z

By the end of all the smaller patches there will be too much conflict to resolve, so I'm closing this.

andrewor14 · 2016-01-18T23:30:50Z

Note to self: DO NOT DELETE THIS BRANCH!!

This is a small step in implementing SPARK-10620, which migrates TaskMetrics to accumulators. This patch is strictly a cleanup patch and introduces no change in functionality. It literally just renames 3 fields for consistency. Today we have: ``` inputMetrics.recordsRead outputMetrics.bytesWritten shuffleReadMetrics.localBlocksFetched ... shuffleWriteMetrics.shuffleRecordsWritten shuffleWriteMetrics.shuffleBytesWritten shuffleWriteMetrics.shuffleWriteTime ``` The shuffle write ones are kind of redundant. We can drop the `shuffle` part in the method names. I added backward compatible (but deprecated) methods with the old names. Parent PR: #10717 Author: Andrew Or <andrew@databricks.com> Closes #10811 from andrewor14/rename-things.

This is a step in implementing SPARK-10620, which migrates TaskMetrics to accumulators. TaskMetrics has a bunch of var's, some are fully public, some are `private[spark]`. This is bad coding style that makes it easy to accidentally overwrite previously set metrics. This has happened a few times in the past and caused bugs that were difficult to debug. Instead, we should have get-or-create semantics, which are more readily understandable. This makes sense in the case of TaskMetrics because these are just aggregated metrics that we want to collect throughout the task, so it doesn't matter who's incrementing them. Parent PR: apache#10717 Author: Andrew Or <andrew@databricks.com> Author: Josh Rosen <joshrosen@databricks.com> Author: andrewor14 <andrew@databricks.com> Closes apache#10815 from andrewor14/get-or-create-metrics.

The high level idea is that instead of having the executors send both accumulator updates and TaskMetrics, we should have them send only accumulator updates. This eliminates the need to maintain both code paths since one can be implemented in terms of the other. This effort is split into two parts: **SPARK-12895: Implement TaskMetrics using accumulators.** TaskMetrics is basically just a bunch of accumulable fields. This patch makes TaskMetrics a syntactic wrapper around a collection of accumulators so we don't need to send TaskMetrics from the executors to the driver. **SPARK-12896: Send only accumulator updates to the driver.** Now that TaskMetrics are expressed in terms of accumulators, we can capture all TaskMetrics values if we just send accumulator updates from the executors to the driver. This completes the parent issue SPARK-10620. While an effort has been made to preserve as much of the public API as possible, there were a few known breaking DeveloperApi changes that would be very awkward to maintain. I will gather the full list shortly and post it here. Note: This was once part of #10717. This patch is split out into its own patch from there to make it easier for others to review. Other smaller pieces of already been merged into master. Author: Andrew Or <andrew@databricks.com> Closes #10835 from andrewor14/task-metrics-use-accums.

Andrew Or added 30 commits January 4, 2016 10:43

Remove unused methods or replace them

8d57657

There are a bunch of decrement X methods that were not used. Also, there are a few set X methods that could have easily just been increment X. The latter change is more in line with accumulators.

Implement initial framework to migrate metrics to accums

1ad2868

This commit uses the existing PEAK_EXECUTION_MEMORY mechanism to bring a few other fields in TaskMetrics to use accumulators.

Migrate a few more easy metrics

a4ca6b2

ShuffleReadMetrics + namespacing accumulators

373898e

This commit ports ShuffleReadMetrics to use accumulators, preserving as much of the existing semantics as possible. It also introduces a nicer way to organize all the internal accumulators by namespacing them.

General code cleanup

e74632c

ShuffleWriteMetrics

7e74bf3

OutputMetrics

396088d

Fix JsonProtocol + JsonProtocolSuite

17becb0

Fix tests where TaskMetrics had no accumulators

809a93a

Tests are still failing as of this commit. E.g. SortShuffleSuite.

Rename a few shuffle write metrics for consistency

32ba9e3

Simplify internal accumulator update mechanism

6bd9c0a

Instead of passing in a callback, we can just return the accumulator values directly, which we have. "We" here refers to TaskMetrics.

Fix tests

ed29328

Clean up

2011912

This commit addresses outstanding TODO's and makes the deprecated APIs DeveloperApi instead. This allows us to deal with how to do the deprecation properly later. This commit also reverts a few unnecessary changes to reduce the size of the diff.

Merge branch 'master' of github.com:apache/spark into task-metrics-to…

c3de4f0

…-accums Conflicts: core/src/main/scala/org/apache/spark/TaskContextImpl.scala sql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLListenerSuite.scala

Fix semantics of accumulators when tasks fail

afe957c

Currently we still get values for tasks that fail. We should keep this semantics in the new accumulator updates as well.

Fix a few more tests

c7240f3

There are a few places where we passed in empty internal accumulators to TaskContextImpl, so the TaskMetrics creation would fail. These are now fixed.

Clean up: lift odd unique name requirement

361442e

Now internal accumulators no longer need to have unique names. This was an unnecessary hack for the SQL accumulators that can be reverted through some clean ups.

Move smaller metrics classes to their own files

5aa6aa1

for readability.

Remove unused hostname from TaskMetrics

176e91d

Fix DAGSchedulerSuite

b3c51dd

The fake accumulator values should no longer all be Longs. Ugh.

Simplify accumulator update code a little

d531f3f

andrewor14 force-pushed the task-metrics-to-accums branch from 23af334 to b58f2e6 Compare January 15, 2016 21:27

Andrew Or added 2 commits January 15, 2016 16:59

Add tests for TaskMetrics

28346e5

Add fine-grained test for collecting accumulators during failures

00a12a4

Fix style

ed9de9c

This was referenced Jan 18, 2016

[SPARK-12884] Move classes to their own files for readability #10810

Closed

[SPARK-12885] [MINOR] Rename 3 fields in ShuffleWriteMetrics #10811

Closed

andrewor14 mentioned this pull request Jan 18, 2016

[SPARK-12887] Do not expose var's in TaskMetrics #10815

Closed

andrewor14 closed this Jan 18, 2016

This was referenced Jan 19, 2016

[SPARK-12895][SPARK-12896] Migrate TaskMetrics to accumulators #10835

Closed

[SPARK-12896] [WIP] Send only accumulator updates to driver, not TaskMetrics #10857

Closed

andrewor14 deleted the task-metrics-to-accums branch February 3, 2016 00:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-10620][WIP] Migrate TaskMetrics to accumulators #10717

[SPARK-10620][WIP] Migrate TaskMetrics to accumulators #10717

andrewor14 commented Jan 12, 2016

SparkQA commented Jan 15, 2016

SparkQA commented Jan 15, 2016

andrewor14 commented Jan 15, 2016

nongli commented Jan 15, 2016

andrewor14 commented Jan 15, 2016

SparkQA commented Jan 15, 2016

SparkQA commented Jan 16, 2016

SparkQA commented Jan 18, 2016

andrewor14 commented Jan 18, 2016

andrewor14 commented Jan 18, 2016

andrewor14 commented Jan 18, 2016

[SPARK-10620][WIP] Migrate TaskMetrics to accumulators #10717

[SPARK-10620][WIP] Migrate TaskMetrics to accumulators #10717

Conversation

andrewor14 commented Jan 12, 2016

SparkQA commented Jan 15, 2016

SparkQA commented Jan 15, 2016

andrewor14 commented Jan 15, 2016

nongli commented Jan 15, 2016

andrewor14 commented Jan 15, 2016

SparkQA commented Jan 15, 2016

SparkQA commented Jan 16, 2016

SparkQA commented Jan 18, 2016

andrewor14 commented Jan 18, 2016

andrewor14 commented Jan 18, 2016

andrewor14 commented Jan 18, 2016