Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-11801][CORE] Notify driver when OOM is thrown before executor … #9866

Closed
wants to merge 1 commit into from

Conversation

vundela
Copy link

@vundela vundela commented Nov 20, 2015

…JVM is killed

This fix try to make sure that task which caught OOM will update its status to driver so that driver logs will have enough information why the tasks are lost or executor is lost. This fix does the following

  1. Registers a shutdown hook for executor which does the following
    a) Synchronizes with OOM handler thread (Assumption is that OOM thread is still running and gets the lock prior to the shutdown hook thread. I thought of introducing some delay, but my runs with fix several times didn't get to that situation.)
    b) Kill all the remaining tasks running in the current container( I thought it would be good to clean the task properly, so that they wont do any job which might throw unwanted error/exceptions)
    c) Sleeps some time so that OOM handler status is flushed to driver(No sleeping causes the status message lost)

  2. Separate handler for OOM, so that we can send proper message to driver.

@vundela
Copy link
Author

vundela commented Nov 20, 2015

Here is snippet of the messages in driver logs

15/11/19 16:31:23 INFO YarnAllocator: Canceling requests for 1 executor containers
15/11/19 16:31:23 WARN TaskSetManager: Lost task 6.0 in stage 0.0 (TID 6, vsr-4.vpc.cloudera.com): TaskOutOfMemory (task caught OutOfMemoryError)
15/11/19 16:31:23 INFO TaskSetManager: Starting task 6.1 in stage 0.0 (TID 14, vsr-4.vpc.cloudera.com, partition 6,PROCESS_LOCAL, 1983 bytes)
15/11/19 16:31:23 INFO TaskSetManager: Starting task 14.0 in stage 0.0 (TID 15, vsr-4.vpc.cloudera.com, partition 14,PROCESS_LOCAL, 1987 bytes)
15/11/19 16:31:23 WARN TaskSetManager: Lost task 7.0 in stage 0.0 (TID 7, vsr-4.vpc.cloudera.com): TaskKilled (killed intentionally)
15/11/19 16:31:24 INFO ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. vsr-4.vpc.cloudera.com:41617
15/11/19 16:31:24 INFO ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. vsr-4.vpc.cloudera.com:41617
15/11/19 16:31:24 ERROR YarnClusterScheduler: Lost executor 2 on vsr-4.vpc.cloudera.com: remote Rpc client disassociated

@@ -24,6 +24,7 @@ import java.nio.ByteBuffer
import java.util.concurrent.{ConcurrentHashMap, TimeUnit}

import scala.collection.JavaConverters._
import scala.collection.JavaConversions._
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JavaConversions is discouraged (its hard for the reader to see where the conversion is happening). stick to JavaConverters and explicitly call asScala / asJava

@squito
Copy link
Contributor

squito commented Nov 30, 2015

Hi @vundela, I left a few comments on the code, but there are actually some larger design issues which I think need to be discussed first. I'd like to move that discussion to the jira (so its better archived) before moving forward with this further.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants