[SPARK-20659][Core] Removing sc.getExecutorStorageStatus and making StorageStatus private #20546

attilapiros · 2018-02-08T14:21:08Z

What changes were proposed in this pull request?

In this PR StorageStatus is made to private and simplified a bit moreover SparkContext.getExecutorStorageStatus method is removed. The reason of keeping StorageStatus is that it is usage from SparkContext.getRDDStorageInfo.

Instead of the method SparkContext.getExecutorStorageStatus executor infos are extended with additional memory metrics such as usedOnHeapStorageMemory, usedOffHeapStorageMemory, totalOnHeapStorageMemory, totalOffHeapStorageMemory.

How was this patch tested?

By running existing unit tests.

attilapiros · 2018-02-08T14:44:18Z

If this change goes into the 2.3 branch then MimaExcludes.scala should be changed accordingly.

vanzin · 2018-02-08T16:11:26Z

This won't go into 2.3.

Also, please don't copy & paste the bug title in your PR. Explain what you're doing instead. The current title does not explain what the change does.

SparkQA · 2018-02-08T17:25:44Z

Test build #87216 has finished for PR 20546 at commit 3f5dba6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin

This is in the right direction but I think it's worth to try to re-write getRDDStorageInfo using the data from the status store instead. That might allow more code to go away.

vanzin · 2018-02-08T21:18:14Z

core/src/main/scala/org/apache/spark/SparkContext.scala

    getRDDStorageInfo(_ => true)
  }

  private[spark] def getRDDStorageInfo(filter: RDD[_] => Boolean): Array[RDDInfo] = {


There's a single call to this method outside of tests, in RDD.toDebugString. That to me makes it another candidate to go away and be replaced with information from the AppStatusStore. Then maybe you can remove more code from StorageStatus.

Have you taken a look at that?

I have found something and I am not sure whether is it a bug and where to look for regarding its correction:
Using rddStorageInfo.numCachedPartitions gives back a different value then the old storage util computed/updated rddInfo.numCachedPartitions.

This is why I changed the assert in org.apache.spark.repl.SingletonReplSuite at "replicating blocks of object with class defined in repl".

Is it a good idea to open a new jira issue if the bug is in the existing rddStorageInfo.numCachedPartitions calculation?

If those values differ then it's probably a bug in the new code. Or maybe a bug in the old code, although that's less likely. It would be good to investigate why they differ.

The old code considered the replication factor. I have created a separate Jira issue: https://issues.apache.org/jira/browse/SPARK-23394.

vanzin · 2018-02-08T21:21:04Z

core/src/main/scala/org/apache/spark/storage/StorageUtils.scala

-@DeveloperApi
-@deprecated("This class may be removed or made private in a future release.", "2.2.0")
-class StorageStatus(
+private [spark] class StorageStatus(


nit: no space after private

vanzin · 2018-02-08T21:23:44Z

core/src/main/scala/org/apache/spark/storage/StorageUtils.scala

   *
   * We store RDD blocks and non-RDD blocks separately to allow quick retrievals of RDD blocks.
-   * These collections should only be mutated through the add/update/removeBlock methods.
+   * These collections should only be mutated through the addBlock method.


I think this is pretty out of date now. I don't see any calls to addBlock outside of this class.

vanzin · 2018-02-08T21:24:56Z

core/src/test/scala/org/apache/spark/DistributedSuite.scala

    val data = sc.parallelize(1 to 1000, 10)
    val cachedData = data.persist(storageLevel)
    assert(cachedData.count === 1000)
-    assert(sc.getExecutorStorageStatus.map(_.rddBlocksById(cachedData.id).size).sum ===


You could replace these with code based on sc.statusStore.

Using sc.statusStore here would also cause the bug I mentioned above (rddStorageInfo.numCachedPartitions difference). In many cases as testCaching method is called several times and this is why I left untouched.

vanzin · 2018-02-08T21:25:39Z

core/src/test/scala/org/apache/spark/deploy/StandaloneDynamicAllocationSuite.scala

   * we submit a request to kill them. This must be called before each kill request.
   */
  private def syncExecutors(sc: SparkContext): Unit = {
-    val driverExecutors = sc.getExecutorStorageStatus


You could replace this with code based on sc.statusStore.

I have tried to use "sc.statusStore.executorList(true)" instead of sc.env.blockManager.master.getStorageStatus but the test failed.

Failed how? The list kept by the block manager and by the status store should be the same, so if they differ, there's a problem somewhere.

As only registered executors can be killed this part synchronised the executors known by the master and the driver (the missing executors from the driver was registered with some mock data). The reason behind was some performance.

I have run tried to change the code to wait for the executors and it get very slow (one test took even 50 seconds long).

SparkQA · 2018-02-09T00:55:44Z

Test build #87232 has finished for PR 20546 at commit 8544380.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-09T16:17:28Z

Test build #87265 has finished for PR 20546 at commit 5505235.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-09T19:58:11Z

Test build #87266 has finished for PR 20546 at commit a797e04.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-12T19:41:52Z

Test build #87336 has finished for PR 20546 at commit 9194acf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-12T22:53:00Z

Test build #87345 has finished for PR 20546 at commit 049a065.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-13T07:40:07Z

Test build #87361 has finished for PR 20546 at commit 543caf8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2018-02-13T08:19:46Z

retest this please

SparkQA · 2018-02-13T12:11:08Z

Test build #87378 has finished for PR 20546 at commit 543caf8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2018-02-13T14:53:50Z

LGTM, merging to master.

I filed SPARK-23411 to deprecate the other API I missed before.

jiangxb1987 · 2018-02-14T10:43:59Z

core/src/main/java/org/apache/spark/SparkExecutorInfo.java

  int port();
  long cacheSize();
  int numRunningTasks();
+  long usedOnHeapStorageMemory();


Maybe I'm missing something here, but do we already have a real use case for the added memory metrics here?

This information was exposed by the public method being removed by this PR, so it makes sense to add these so that people still have a way to get that data.

If you are referring to StorageStatus.onHeapMemUsed/offHeapMemUsed/onHeapMemRemaining/offHeapMemRemaining, it makes great sense then. But I still don't quite get a few things:

Why do we use a different name format, while we could have let them be the same as in StorageStatus?

Are these information all we need to add to SparkExecutorInfo ? After this change, how do we expose the information of disk usage?

That follows the names in the public MemoryMetrics class from the REST API.

We could add that, just as we could add a whole lot of other things. At some point we should look at exposing the REST API types directly through SparkStatusTracker instead of having these mirror types.

I'm not disagreeing with you on making such changes, but I'm also worrying about users could have to change their code a lot because of the changes we made. If you don't mind, may I submit a follow up PR to minimize the gap between the SparkExecutorInfo and StorageStatus?

Sure. But users put themselves at those kind of risks by using @DeveloperApi methods, especially ones that have been deprecated.

Agreed, the changes should be a minor issue. Thanks for explanations!

initial version

3f5dba6

attilapiros changed the title ~~[SPARK-20659][Core] Remove StorageStatus, or make it private.~~ [SPARK-20659][Core] Removing sc.getExecutorStorageStatus and making StorageStatus private Feb 8, 2018

correcting repl test

8544380

vanzin reviewed Feb 8, 2018

View reviewed changes

Applying review comments 1.0

5505235

fix mima

a797e04

applying review comments 2.0

9194acf

Merge branch 'master' into SPARK-20659

049a065

Correcting unit test

543caf8

asfgit closed this in 116c581 Feb 13, 2018

jiangxb1987 reviewed Feb 14, 2018

View reviewed changes

attilapiros deleted the SPARK-20659 branch April 26, 2018 20:07

[SPARK-20659][Core] Removing sc.getExecutorStorageStatus and making StorageStatus private #20546

[SPARK-20659][Core] Removing sc.getExecutorStorageStatus and making StorageStatus private #20546

Uh oh!

Conversation

attilapiros commented Feb 8, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

attilapiros commented Feb 8, 2018

Uh oh!

vanzin commented Feb 8, 2018

Uh oh!

SparkQA commented Feb 8, 2018

Uh oh!

vanzin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

attilapiros Feb 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

attilapiros Feb 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 9, 2018

Uh oh!

SparkQA commented Feb 9, 2018

Uh oh!

SparkQA commented Feb 9, 2018

Uh oh!

SparkQA commented Feb 12, 2018

Uh oh!

SparkQA commented Feb 12, 2018

Uh oh!

SparkQA commented Feb 13, 2018

Uh oh!

vanzin commented Feb 13, 2018

Uh oh!

SparkQA commented Feb 13, 2018

Uh oh!

vanzin commented Feb 13, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vanzin Feb 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

attilapiros Feb 9, 2018 •

edited

Loading

attilapiros Feb 9, 2018 •

edited

Loading

vanzin Feb 14, 2018 •

edited

Loading