-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12896] [WIP] Send only accumulator updates to driver, not TaskMetrics #10857
Conversation
commit 269031f Author: Andrew Or <andrew@databricks.com> Date: Mon Jan 18 16:33:12 2016 -0800 Remove unused method commit c04b5df Author: Andrew Or <andrew@databricks.com> Date: Mon Jan 18 16:13:08 2016 -0800 Review comments commit d2e4e23 Author: Andrew Or <andrew@databricks.com> Date: Mon Jan 18 14:42:19 2016 -0800 One more commit 202d48e Merge: e99b9af 4f11e3f Author: Andrew Or <andrew@databricks.com> Date: Mon Jan 18 14:27:47 2016 -0800 Merge branch 'master' of github.com:apache/spark into get-or-create-metrics commit e99b9af Merge: 34c7ce5 b8cb548 Author: Andrew Or <andrew@databricks.com> Date: Mon Jan 18 13:56:41 2016 -0800 Merge branch 'master' of github.com:apache/spark into get-or-create-metrics Conflicts: core/src/main/scala/org/apache/spark/CacheManager.scala core/src/main/scala/org/apache/spark/memory/StorageMemoryPool.scala core/src/test/scala/org/apache/spark/CacheManagerSuite.scala commit 34c7ce5 Author: Andrew Or <andrew@databricks.com> Date: Mon Jan 18 12:46:42 2016 -0800 Hide updatedBlocks commit ad094f0 Author: Andrew Or <andrew@databricks.com> Date: Mon Jan 18 12:30:59 2016 -0800 Clean up JsonProtocol This commit collapsed 10 methods into 2. The 8 that were inlined were only used in 1 place each, and the body of each was quite small. The additional level of abstraction did not add much value and made the code verbose. commit 0785984 Author: Andrew Or <andrew@databricks.com> Date: Mon Jan 18 12:20:28 2016 -0800 Replace set with register JsonProtocol remains the only place where we still call set on each of the *Metrics classes. commit b9d7fbf Author: Andrew Or <andrew@databricks.com> Date: Mon Jan 18 12:10:17 2016 -0800 Clean up places where we set OutputMetrics Note: there's one remaining place, which is JsonProtocol. commit 62c96e1 Author: Andrew Or <andrew@databricks.com> Date: Mon Jan 18 11:50:04 2016 -0800 Add register* methods (get or create)
Previously we would always zero out an accumulator when we deserialize it. Certainly we don't want to do that on the driver. The changes in this commit are temporary and will be reverted in SPARK-12896.
Tests don't pass yet, obviously...
+ fix tests, which are still failing
... by setting it to zero on the executors, always.
…e-accums Conflicts: core/src/main/scala/org/apache/spark/CacheManager.scala core/src/main/scala/org/apache/spark/executor/Executor.scala core/src/main/scala/org/apache/spark/executor/ShuffleWriteMetrics.scala core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala core/src/main/scala/org/apache/spark/util/JsonProtocol.scala core/src/test/scala/org/apache/spark/ui/jobs/JobProgressListenerSuite.scala core/src/test/scala/org/apache/spark/util/JsonProtocolSuite.scala
Also add some useful error printing if things don't match.
We never updated the original ShuffleReadMetrics accumulators; we were updating a copy.
… into dont-send-task-metrics
As of this commit, we currently only send accumulator updates from the executors to the driver. Note that executors still send back TaskMetrics to the driver when they heartbeat. This will be resolved in a future commit. There are a number of API breaking changes in this commit. These will be detailed in a higher level audit of all public APIs. Tests do not compile yet.
These changes were introduced because SPARK-12895 makes us send accumulators both ways between drivers and executors. This is no longer true as of the previous commit, so we can remove these TODOs and the code associated with them.
Test build #49835 has finished for PR 10857 at commit
|
Test build #49842 has finished for PR 10857 at commit
|
While I realize that this has the potential to create merge conflicts, it would be super helpful if you could squash all of the commits which came from #10835 into a single commit here in order to make it easier to spot the delta from that PR to this one. |
Seq[String]( | ||
output.WRITE_METHOD, | ||
output.BYTES_WRITTEN, | ||
output.RECORDS_WRITTEN).map(create) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These will count on failures, not consistent than SQL ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I'll add a comment.
I just merged this into #10835 |
This patch builds on top of #10835, so the actual diff should be much smaller. This would complete SPARK-10620, which replaces much of the logic in TaskMetrics with accumulators. This was once part of #10717. It was split into its own patch so it's more reviewable.
The high level idea is that instead of having the executors send both accumulator updates and TaskMetrics, we should have them send only accumulator updates. This is because TaskMetrics can be represented as a collection of accumulators (as we have done so in #10835) and we can just reconstruct them on the driver based on the accumulator updates.
While an effort has been made to preserve as much of the public API as possible, there were a few known breaking @DeveloperAPI changes that would be very awkward to maintain. I will gather the full list shortly and post it here.
Lastly, this is still WIP because more tests are pending.