[SPARK-12062] [CORE] Change Master to asyc rebuild UI when application completes #10284

BryanCutler · 2015-12-13T23:11:28Z

This change builds the event history of completed apps asynchronously so the RPC thread will not be blocked and allow new workers to register/remove if the event log history is very large and takes a long time to rebuild.

… thread

BryanCutler · 2015-12-13T23:18:05Z

I tested this by making an artificially large event log file, then tried to stop the worker and make sure it could re-register before rebuilding the UI was completed, which worked fine for me on a local cluster at least.

SparkQA · 2015-12-14T00:47:43Z

Test build #47630 has finished for PR 10284 at commit 456c806.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class AttachCompletedRebuildUI(appId: String)\n

BryanCutler · 2015-12-14T00:53:00Z

Jenkins, retest this please

SparkQA · 2015-12-14T02:46:56Z

Test build #47631 has finished for PR 10284 at commit 456c806.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class AttachCompletedRebuildUI(appId: String)\n

andrewor14 · 2015-12-14T21:23:42Z

core/src/main/scala/org/apache/spark/deploy/master/Master.scala

@@ -78,7 +88,7 @@ private[deploy] class Master(
  private val addressToApp = new HashMap[RpcAddress, ApplicationInfo]
  private val completedApps = new ArrayBuffer[ApplicationInfo]
  private var nextAppNumber = 0
-  private val appIdToUI = new HashMap[String, SparkUI]
+  private val appIdToUI = new ConcurrentHashMap[String, SparkUI]


this doesn't have to change if you move more things under case AttachCompletedRebuildUI because everything is synchronous there

I was hesitant to do this because that would mean adding the SparkUI to a message and sending the object through the RPC layer, and I wasn't sure if it would be copied or serialized there. If the event logs are large then that would be a lot of copying. Is this not a problem since it's only being sent to itself?

I see, that's a valid point

we should at least a comment explaining why this needs to be a ConcurrentHashMap

andrewor14 · 2015-12-14T21:45:38Z

@BryanCutler thanks for fixing this. Even though the bigger feature will be removed eventually it's only going to happen in future branches, not branch-1.6, which could still use this fix. I think the high level approach is sound. I made some suggestions on simplifying the code a little.

…xception

BryanCutler · 2015-12-15T01:20:20Z

Thanks for the quick feedback @andrewor14! I simplified things with your suggestions, I'm still looking into removing the ConcurrentHashMap which I think would require passing the SparkUI along with the app id in the RPC message..

SparkQA · 2015-12-15T03:19:11Z

Test build #47701 has finished for PR 10284 at commit 90402d7.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class AttachCompletedRebuildUI(appId: String)\n

BryanCutler · 2015-12-15T19:42:07Z

Hi @andrewor14, it looks like the default RPC NettyRpcEnv will not serialize a local message, so if I were to send the message AttachCompletedRebuildUI(appId: String, ui: SparkUI) it should be fine.

However, from what I can tell Akka will serialize the message so if someone configured the master RPC to be AkkaRpcEnv, and created a large event log, I think trying to serialize a large object would cause issues.

I'd be happy to take the task of removing all of this in SPARK-12299 since I'm pretty familiar with the code now. That way I can make sure that if the ConcurrentHashMap is used, it will be properly reverted.

andrewor14 · 2015-12-15T21:11:44Z

LGTM. This is merge-able as is, though it would be best if we document why a ConcurrentHashMap is used.

SparkQA · 2015-12-15T23:59:33Z

Test build #47755 has finished for PR 10284 at commit 04501a0.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class AttachCompletedRebuildUI(appId: String)\n

andrewor14 · 2015-12-16T02:27:19Z

Merging into master and 1.6. Thanks for your work @BryanCutler. Feel free to start work on SPARK-12299 any time to remove all the work done here. :) That one I'll only merge in master.

… completes This change builds the event history of completed apps asynchronously so the RPC thread will not be blocked and allow new workers to register/remove if the event log history is very large and takes a long time to rebuild. Author: Bryan Cutler <bjcutler@us.ibm.com> Closes #10284 from BryanCutler/async-MasterUI-SPARK-12062. (cherry picked from commit c5b6b39) Signed-off-by: Andrew Or <andrew@databricks.com>

tedyu · 2015-12-19T23:18:30Z

I think the following exception seen in unit test run was related to this PR:

[info] - Simple replay (70 milliseconds)
java.lang.NullPointerException
    at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:982)
    at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:980)
    at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117)
    at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115)
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
    at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
    at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133)
    at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
    at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
    at scala.concurrent.Promise$class.complete(Promise.scala:55)
    at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:23)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

andrewor14 · 2015-12-21T18:37:27Z

@tedyu would you mind pointing me to the jenkins page?

tedyu · 2015-12-21T18:40:47Z

See https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48119/consoleFull

andrewor14 · 2015-12-21T19:47:38Z

I've opened #10417 to fix it.

``` [info] ReplayListenerSuite: [info] - Simple replay (58 milliseconds) java.lang.NullPointerException at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:982) at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:980) ``` https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-SBT/4316/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/consoleFull This was introduced in apache#10284. It's harmless because the NPE is caused by a race that occurs mainly in `local-cluster` tests (but don't actually fail the tests). Tested locally to verify that the NPE is gone. Author: Andrew Or <andrew@databricks.com> Closes apache#10417 from andrewor14/fix-harmless-npe.

``` [info] ReplayListenerSuite: [info] - Simple replay (58 milliseconds) java.lang.NullPointerException at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:982) at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:980) ``` https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-SBT/4316/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/consoleFull This was introduced in #10284. It's harmless because the NPE is caused by a race that occurs mainly in `local-cluster` tests (but don't actually fail the tests). Tested locally to verify that the NPE is gone. Author: Andrew Or <andrew@databricks.com> Closes #10417 from andrewor14/fix-harmless-npe. (cherry picked from commit d655d37) Signed-off-by: Andrew Or <andrew@databricks.com>

BryanCutler added 7 commits December 8, 2015 10:12

[SPARK-12062] Changed Master rebuildSparkUI to run async from the RPC…

fe67dd7

… thread

rebuildUI thread was not being shutdown

b748dc1

remove line that was accidentally left in for testing

7d60de1

cleanup of log file opening logic

af20e77

minor cleanup

54089aa

changed catching Exception to NonFatal, which is better

8007637

fixed indentation

456c806

andrewor14 reviewed Dec 14, 2015
View reviewed changes

Simplified by returning Future[Option[SparkUI]] and removing custom e…

75923d8

…xception

fixed indentation

90402d7

Added note to explain use of ConcurrentHashMap

04501a0

asfgit closed this in c5b6b39 Dec 16, 2015

andrewor14 mentioned this pull request Dec 21, 2015

[SPARK-12466] Fix harmless NPE in tests #10417

Closed

BryanCutler deleted the async-MasterUI-SPARK-12062 branch January 3, 2017 22:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-12062] [CORE] Change Master to asyc rebuild UI when application completes #10284

[SPARK-12062] [CORE] Change Master to asyc rebuild UI when application completes #10284

BryanCutler commented Dec 13, 2015

BryanCutler commented Dec 13, 2015

SparkQA commented Dec 14, 2015

BryanCutler commented Dec 14, 2015

SparkQA commented Dec 14, 2015

andrewor14 Dec 14, 2015

BryanCutler Dec 14, 2015

andrewor14 Dec 15, 2015

andrewor14 Dec 15, 2015

andrewor14 commented Dec 14, 2015

BryanCutler commented Dec 15, 2015

SparkQA commented Dec 15, 2015

BryanCutler commented Dec 15, 2015

andrewor14 commented Dec 15, 2015

SparkQA commented Dec 15, 2015

andrewor14 commented Dec 16, 2015

tedyu commented Dec 19, 2015

andrewor14 commented Dec 21, 2015

tedyu commented Dec 21, 2015

andrewor14 commented Dec 21, 2015

[SPARK-12062] [CORE] Change Master to asyc rebuild UI when application completes #10284

[SPARK-12062] [CORE] Change Master to asyc rebuild UI when application completes #10284

Conversation

BryanCutler commented Dec 13, 2015

BryanCutler commented Dec 13, 2015

SparkQA commented Dec 14, 2015

BryanCutler commented Dec 14, 2015

SparkQA commented Dec 14, 2015

andrewor14 Dec 14, 2015

Choose a reason for hiding this comment

BryanCutler Dec 14, 2015

Choose a reason for hiding this comment

andrewor14 Dec 15, 2015

Choose a reason for hiding this comment

andrewor14 Dec 15, 2015

Choose a reason for hiding this comment

andrewor14 commented Dec 14, 2015

BryanCutler commented Dec 15, 2015

SparkQA commented Dec 15, 2015

BryanCutler commented Dec 15, 2015

andrewor14 commented Dec 15, 2015

SparkQA commented Dec 15, 2015

andrewor14 commented Dec 16, 2015

tedyu commented Dec 19, 2015

andrewor14 commented Dec 21, 2015

tedyu commented Dec 21, 2015

andrewor14 commented Dec 21, 2015