Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26265][Core] Fix deadlock in BytesToBytesMap.MapIterator when locking both BytesToBytesMap.MapIterator and TaskMemoryManager #23272

Closed
wants to merge 5 commits into from

Conversation

Projects
None yet
5 participants
@viirya
Copy link
Contributor

commented Dec 10, 2018

What changes were proposed in this pull request?

In BytesToBytesMap.MapIterator.advanceToNextPage, We will first lock this MapIterator and then TaskMemoryManager when going to free a memory page by calling freePage. At the same time, it is possibly that another memory consumer first locks TaskMemoryManager and then this MapIterator when it acquires memory and causes spilling on this MapIterator.

So it ends with the MapIterator object holds lock to the MapIterator object and waits for lock on TaskMemoryManager, and the other consumer holds lock to TaskMemoryManager and waits for lock on the MapIterator object.

To avoid deadlock here, this patch proposes to keep reference to the page to free and free it after releasing the lock of MapIterator.

How was this patch tested?

Added test and manually test by running the test 100 times to make sure there is no deadlock.

@viirya

This comment has been minimized.

Copy link
Contributor Author

commented Dec 10, 2018

@@ -667,4 +668,53 @@ public void testPeakMemoryUsed() {
}
}

@Test
public void avoidDeadlock() throws InterruptedException {

This comment has been minimized.

Copy link
@dongjoon-hyun

dongjoon-hyun Dec 10, 2018

Member

Hi, @viirya . Since this test case reproduces Deadlock situation, we need a timeout logic. Otherwise, it will hang (instead of failures) when we hit this issue later.

This comment has been minimized.

Copy link
@viirya

viirya Dec 10, 2018

Author Contributor

I've tried several ways to set a timeout logic, but don't work. The deadlock always hangs the test and timeout logic.

@cloud-fan

This comment has been minimized.

Copy link
Contributor

commented Dec 10, 2018

have you seen any bug report caused by this dead lock?

@SparkQA

This comment has been minimized.

Copy link

commented Dec 10, 2018

Test build #99900 has finished for PR 23272 at commit 25e8e06.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@viirya

This comment has been minimized.

Copy link
Contributor Author

commented Dec 10, 2018

have you seen any bug report caused by this dead lock?

The original reporter of the JIRA ticket SPARK-26265 has hit with this bug in their workload.

assertFalse(iter.hasNext());
} finally {
map.free();
thread.join();

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Dec 10, 2018

Contributor

Is this line where the test hangs without the fix?

This comment has been minimized.

Copy link
@viirya

viirya Dec 10, 2018

Author Contributor

When without this line, the test still hangs. The test thread hangs on the deadlock with the other thread of running memoryConsumer.

This comment has been minimized.

Copy link
@viirya

viirya Dec 10, 2018

Author Contributor

This line just makes sure memoryConsumer to end and free acquired memory.

@SparkQA

This comment has been minimized.

Copy link

commented Dec 10, 2018

Test build #99914 has finished for PR 23272 at commit 4c621d2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Copy link

commented Dec 10, 2018

Test build #99915 has finished for PR 23272 at commit 9d52320.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@viirya

This comment has been minimized.

Copy link
Contributor Author

commented Dec 11, 2018


try {
int i;
for (i = 0; i < 1024; i++) {

This comment has been minimized.

Copy link
@dongjoon-hyun

dongjoon-hyun Dec 11, 2018

Member

Let's use for (int i = 0; ... here and line 708 because int i is not referenced outside of for loop.
Never mind. I found that this is the convention in this test suite.

@viirya

This comment has been minimized.

Copy link
Contributor Author

commented Dec 11, 2018

@viirya

This comment has been minimized.

Copy link
Contributor Author

commented Dec 11, 2018

@cloud-fan

This comment has been minimized.

Copy link
Contributor

commented Dec 11, 2018

LGTM. last question: does the test always reproduce the bug? Or it has some randomness?

@SparkQA

This comment has been minimized.

Copy link

commented Dec 11, 2018

Test build #99956 has finished for PR 23272 at commit 0405527.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@viirya

This comment has been minimized.

Copy link
Contributor Author

commented Dec 11, 2018

last question: does the test always reproduce the bug? Or it has some randomness?

If without the change, as I tried it locally 10 times, the test can reproduce the bug 10 times. But I'm not sure if it is 100% to reproduce the bug. I think we can't always to reproduce a deadlock like this.

@cloud-fan

This comment has been minimized.

Copy link
Contributor

commented Dec 11, 2018

thanks, merging to master!

Can you send a new PR for 2.4 without the synchronized move around?

@viirya

This comment has been minimized.

Copy link
Contributor Author

commented Dec 11, 2018

Can you send a new PR for 2.4 without the synchronized move around?

Ok. Thanks.

@asfgit asfgit closed this in a3bbca9 Dec 11, 2018

@SparkQA

This comment has been minimized.

Copy link

commented Dec 11, 2018

Test build #99966 has finished for PR 23272 at commit 0849083.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@viirya

This comment has been minimized.

Copy link
Contributor Author

commented Dec 11, 2018

I think the failed tests are unrelated. cc @cloud-fan

asfgit pushed a commit that referenced this pull request Dec 15, 2018

[SPARK-26265][CORE][FOLLOWUP] Put freePage into a finally block
## What changes were proposed in this pull request?

Based on the [comment](#23272 (comment)), it seems to be better to put `freePage` into a `finally` block. This patch as a follow-up to do so.

## How was this patch tested?

Existing tests.

Closes #23294 from viirya/SPARK-26265-followup.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 1b604c1)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

asfgit pushed a commit that referenced this pull request Dec 15, 2018

[SPARK-26265][CORE][FOLLOWUP] Put freePage into a finally block
## What changes were proposed in this pull request?

Based on the [comment](#23272 (comment)), it seems to be better to put `freePage` into a `finally` block. This patch as a follow-up to do so.

## How was this patch tested?

Existing tests.

Closes #23294 from viirya/SPARK-26265-followup.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

holdenk added a commit to holdenk/spark that referenced this pull request Jan 5, 2019

[SPARK-26265][CORE][FOLLOWUP] Put freePage into a finally block
## What changes were proposed in this pull request?

Based on the [comment](apache#23272 (comment)), it seems to be better to put `freePage` into a `finally` block. This patch as a follow-up to do so.

## How was this patch tested?

Existing tests.

Closes apache#23294 from viirya/SPARK-26265-followup.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

stczwd added a commit to stczwd/spark that referenced this pull request Feb 18, 2019

[SPARK-26265][CORE] Fix deadlock in BytesToBytesMap.MapIterator when …
…locking both BytesToBytesMap.MapIterator and TaskMemoryManager

## What changes were proposed in this pull request?

In `BytesToBytesMap.MapIterator.advanceToNextPage`, We will first lock this `MapIterator` and then `TaskMemoryManager` when going to free a memory page by calling `freePage`. At the same time, it is possibly that another memory consumer first locks `TaskMemoryManager` and then this `MapIterator` when it acquires memory and causes spilling on this `MapIterator`.

So it ends with the `MapIterator` object holds lock to the `MapIterator` object and waits for lock on `TaskMemoryManager`, and the other consumer holds lock to `TaskMemoryManager` and waits for lock on the `MapIterator` object.

To avoid deadlock here, this patch proposes to keep reference to the page to free and free it after releasing the lock of `MapIterator`.

## How was this patch tested?

Added test and manually test by running the test 100 times to make sure there is no deadlock.

Closes apache#23272 from viirya/SPARK-26265.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

stczwd added a commit to stczwd/spark that referenced this pull request Feb 18, 2019

[SPARK-26265][CORE][FOLLOWUP] Put freePage into a finally block
## What changes were proposed in this pull request?

Based on the [comment](apache#23272 (comment)), it seems to be better to put `freePage` into a `finally` block. This patch as a follow-up to do so.

## How was this patch tested?

Existing tests.

Closes apache#23294 from viirya/SPARK-26265-followup.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.