Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12062] [CORE] Change Master to asyc rebuild UI when application completes #10284

Closed

Conversation

BryanCutler
Copy link
Member

This change builds the event history of completed apps asynchronously so the RPC thread will not be blocked and allow new workers to register/remove if the event log history is very large and takes a long time to rebuild.

@BryanCutler
Copy link
Member Author

I tested this by making an artificially large event log file, then tried to stop the worker and make sure it could re-register before rebuilding the UI was completed, which worked fine for me on a local cluster at least.

@SparkQA
Copy link

SparkQA commented Dec 14, 2015

Test build #47630 has finished for PR 10284 at commit 456c806.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class AttachCompletedRebuildUI(appId: String)\n

@BryanCutler
Copy link
Member Author

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Dec 14, 2015

Test build #47631 has finished for PR 10284 at commit 456c806.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class AttachCompletedRebuildUI(appId: String)\n

@@ -78,7 +88,7 @@ private[deploy] class Master(
private val addressToApp = new HashMap[RpcAddress, ApplicationInfo]
private val completedApps = new ArrayBuffer[ApplicationInfo]
private var nextAppNumber = 0
private val appIdToUI = new HashMap[String, SparkUI]
private val appIdToUI = new ConcurrentHashMap[String, SparkUI]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't have to change if you move more things under case AttachCompletedRebuildUI because everything is synchronous there

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hesitant to do this because that would mean adding the SparkUI to a message and sending the object through the RPC layer, and I wasn't sure if it would be copied or serialized there. If the event logs are large then that would be a lot of copying. Is this not a problem since it's only being sent to itself?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that's a valid point

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should at least a comment explaining why this needs to be a ConcurrentHashMap

@andrewor14
Copy link
Contributor

@BryanCutler thanks for fixing this. Even though the bigger feature will be removed eventually it's only going to happen in future branches, not branch-1.6, which could still use this fix. I think the high level approach is sound. I made some suggestions on simplifying the code a little.

@BryanCutler
Copy link
Member Author

Thanks for the quick feedback @andrewor14! I simplified things with your suggestions, I'm still looking into removing the ConcurrentHashMap which I think would require passing the SparkUI along with the app id in the RPC message..

@SparkQA
Copy link

SparkQA commented Dec 15, 2015

Test build #47701 has finished for PR 10284 at commit 90402d7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class AttachCompletedRebuildUI(appId: String)\n

@BryanCutler
Copy link
Member Author

Hi @andrewor14, it looks like the default RPC NettyRpcEnv will not serialize a local message, so if I were to send the message AttachCompletedRebuildUI(appId: String, ui: SparkUI) it should be fine.

However, from what I can tell Akka will serialize the message so if someone configured the master RPC to be AkkaRpcEnv, and created a large event log, I think trying to serialize a large object would cause issues.

I'd be happy to take the task of removing all of this in SPARK-12299 since I'm pretty familiar with the code now. That way I can make sure that if the ConcurrentHashMap is used, it will be properly reverted.

@andrewor14
Copy link
Contributor

LGTM. This is merge-able as is, though it would be best if we document why a ConcurrentHashMap is used.

@SparkQA
Copy link

SparkQA commented Dec 15, 2015

Test build #47755 has finished for PR 10284 at commit 04501a0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class AttachCompletedRebuildUI(appId: String)\n

@andrewor14
Copy link
Contributor

Merging into master and 1.6. Thanks for your work @BryanCutler. Feel free to start work on SPARK-12299 any time to remove all the work done here. :) That one I'll only merge in master.

asfgit pushed a commit that referenced this pull request Dec 16, 2015
… completes

This change builds the event history of completed apps asynchronously so the RPC thread will not be blocked and allow new workers to register/remove if the event log history is very large and takes a long time to rebuild.

Author: Bryan Cutler <bjcutler@us.ibm.com>

Closes #10284 from BryanCutler/async-MasterUI-SPARK-12062.

(cherry picked from commit c5b6b39)
Signed-off-by: Andrew Or <andrew@databricks.com>
@asfgit asfgit closed this in c5b6b39 Dec 16, 2015
@tedyu
Copy link
Contributor

tedyu commented Dec 19, 2015

I think the following exception seen in unit test run was related to this PR:

[info] - Simple replay (70 milliseconds)
java.lang.NullPointerException
    at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:982)
    at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:980)
    at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117)
    at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115)
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
    at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
    at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133)
    at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
    at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
    at scala.concurrent.Promise$class.complete(Promise.scala:55)
    at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:23)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

@andrewor14
Copy link
Contributor

@tedyu would you mind pointing me to the jenkins page?

@tedyu
Copy link
Contributor

tedyu commented Dec 21, 2015

@andrewor14
Copy link
Contributor

I've opened #10417 to fix it.

ghost pushed a commit to dbtsai/spark that referenced this pull request Dec 21, 2015
```
[info] ReplayListenerSuite:
[info] - Simple replay (58 milliseconds)
java.lang.NullPointerException
	at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:982)
	at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:980)
```
https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-SBT/4316/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/consoleFull

This was introduced in apache#10284. It's harmless because the NPE is caused by a race that occurs mainly in `local-cluster` tests (but don't actually fail the tests).

Tested locally to verify that the NPE is gone.

Author: Andrew Or <andrew@databricks.com>

Closes apache#10417 from andrewor14/fix-harmless-npe.
asfgit pushed a commit that referenced this pull request Dec 21, 2015
```
[info] ReplayListenerSuite:
[info] - Simple replay (58 milliseconds)
java.lang.NullPointerException
	at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:982)
	at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:980)
```
https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-SBT/4316/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/consoleFull

This was introduced in #10284. It's harmless because the NPE is caused by a race that occurs mainly in `local-cluster` tests (but don't actually fail the tests).

Tested locally to verify that the NPE is gone.

Author: Andrew Or <andrew@databricks.com>

Closes #10417 from andrewor14/fix-harmless-npe.

(cherry picked from commit d655d37)
Signed-off-by: Andrew Or <andrew@databricks.com>
@BryanCutler BryanCutler deleted the async-MasterUI-SPARK-12062 branch January 3, 2017 22:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants