Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-15606][core] Use non-blocking removeExecutor call to avoid deadlocks #13355

Closed
wants to merge 7 commits into from
Closed

[SPARK-15606][core] Use non-blocking removeExecutor call to avoid deadlocks #13355

wants to merge 7 commits into from

Conversation

robbinspg
Copy link
Member

What changes were proposed in this pull request?

Set minimum number of dispatcher threads to 3 to avoid deadlocks on machines with only 2 cores

How was this patch tested?

Spark test builds

@robbinspg robbinspg changed the title Use a minimum of 3 dispatcher threads to avoid deadlocks [SPARK-15606][core] Use a minimum of 3 dispatcher threads to avoid deadlocks May 27, 2016
@robbinspg
Copy link
Member Author

Although this patch resolves this particular issue I would echo the comment in #11728 by @zsxwing

{quote}
However, the root cause is there are blocking calls in the event loops but not enough threads. This could happen in other places (such as netty, akka). Ideally, we should avoid blocking calls in all event loops. However, it's hard to figure out all of them in the huge code bases :(
{quote}

@srowen
Copy link
Member

srowen commented May 27, 2016

I think that's fine. CC @zsxwing @yonran

@srowen
Copy link
Member

srowen commented May 27, 2016

Jenkins add to whitelist

@srowen
Copy link
Member

srowen commented May 27, 2016

Jenkins test this please

@SparkQA
Copy link

SparkQA commented May 27, 2016

Test build #59494 has finished for PR 13355 at commit 139e875.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member

zsxwing commented May 27, 2016

This configuration is for people setting a custom thread number for their special environments. I don't want to keep increasing this number when someone complains the default thread number is too small.

Instead of this fix, I prefer to fix the real issue you mentioned in JIRA. You can just make removeExecutor in #13055 asynchronous.

@robbinspg
Copy link
Member Author

agreed. I'll take a look.

@robbinspg
Copy link
Member Author

@zsxwing Do you mean change BlockManagerMaster.removeExecutor to send the message using send (fire and forget) rather than askWithRetry?

@zsxwing
Copy link
Member

zsxwing commented May 31, 2016

@zsxwing Do you mean change BlockManagerMaster.removeExecutor to send the message using send (fire and forget) rather than askWithRetry?

Yes. You probably need to add a new removeExecutor method.

@robbinspg
Copy link
Member Author

OK, that's what I tried but it threw up some errors in some other tests which I'm investigating.

@zsxwing
Copy link
Member

zsxwing commented May 31, 2016

OK, that's what I tried but it threw up some errors in some other tests which I'm investigating.

Ah, it should be ask instead of send. You can just ignore the returned Future from ask

@robbinspg
Copy link
Member Author

reverted original fix and replaced with using non-blocking call in BlockManagerMaster.removeExecutor.

Also added a new test suite to run Distributed suite forcing the number of dispatcher threads to 2. This suite will fail without the fix.

@SparkQA
Copy link

SparkQA commented May 31, 2016

Test build #59686 has finished for PR 13355 at commit c991d70.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -38,7 +38,8 @@ class BlockManagerMaster(

/** Remove a dead executor from the driver endpoint. This is only called on the driver side. */
def removeExecutor(execId: String) {
tell(RemoveExecutor(execId))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is used by other places. It's better to add a new method instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so I've added a new removeExecutorAsync method to minimise side effects

@SparkQA
Copy link

SparkQA commented Jun 1, 2016

Test build #59726 has finished for PR 13355 at commit 6a95c50.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 1, 2016

Test build #59728 has finished for PR 13355 at commit eb900d8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@robbinspg robbinspg changed the title [SPARK-15606][core] Use a minimum of 3 dispatcher threads to avoid deadlocks [SPARK-15606][core] Use non-blocking removeExecutor call to avoid deadlocks Jun 1, 2016
@@ -0,0 +1,42 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove this file? This is one of the slowest tests. It's not worth to run it for this issue.

@zsxwing
Copy link
Member

zsxwing commented Jun 1, 2016

@robbinspg LGTM after you remove DistributedSuiteMinTHreads.

@robbinspg
Copy link
Member Author

Test suite removed

@SparkQA
Copy link

SparkQA commented Jun 1, 2016

Test build #59762 has finished for PR 13355 at commit 4d3fc6f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@robbinspg
Copy link
Member Author

@zsxwing ok to merge now?

@zsxwing
Copy link
Member

zsxwing commented Jun 2, 2016

LGTM. Thanks, merging to master and 2.0

@asfgit asfgit closed this in 7c07d17 Jun 2, 2016
asfgit pushed a commit that referenced this pull request Jun 2, 2016
…dlocks

## What changes were proposed in this pull request?
Set minimum number of dispatcher threads to 3 to avoid deadlocks on machines with only 2 cores

## How was this patch tested?

Spark test builds

Author: Pete Robbins <robbinspg@gmail.com>

Closes #13355 from robbinspg/SPARK-13906.

(cherry picked from commit 7c07d17)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
zzcclp added a commit to zzcclp/spark that referenced this pull request Jun 3, 2016
@robbinspg robbinspg deleted the SPARK-13906 branch June 16, 2016 12:54
asfgit pushed a commit that referenced this pull request Jun 21, 2016
…dlocks

## What changes were proposed in this pull request?
Set minimum number of dispatcher threads to 3 to avoid deadlocks on machines with only 2 cores

## How was this patch tested?

Spark test builds

Author: Pete Robbins <robbinspg@gmail.com>

Closes #13355 from robbinspg/SPARK-13906.
zzcclp pushed a commit to zzcclp/spark that referenced this pull request Jun 22, 2016
…dlocks

Set minimum number of dispatcher threads to 3 to avoid deadlocks on machines with only 2 cores

Spark test builds

Author: Pete Robbins <robbinspg@gmail.com>

Closes apache#13355 from robbinspg/SPARK-13906.

(cherry picked from commit d98fb19)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants