Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21907][CORE] oom during spill #19181

Closed
wants to merge 13 commits into from

Conversation

eyalfa
Copy link

@eyalfa eyalfa commented Sep 10, 2017

What changes were proposed in this pull request?

  1. a test reproducing SPARK-21907
  2. a fix for the root cause of the issue.

org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill calls org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset which may trigger another spill,
when this happens the array member is already de-allocated but still referenced by the code, this causes the nested spill to fail with an NPE in org.apache.spark.memory.TaskMemoryManager.getPage.
This patch introduces a reproduction in a test case and a fix, the fix simply sets the in-mem sorter's array member to an empty array before actually performing the allocation. This prevents the spilling code from 'touching' the de-allocated array.

How was this patch tested?

introduced a new test case: org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorterSuite#testOOMDuringSpill.

@eyalfa
Copy link
Author

eyalfa commented Sep 10, 2017

@ericl , @davies ,
you guys seem to be the last ones to edit this area of the code, I'd appreciate if you could take a look.

@hvanhovell
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Sep 12, 2017

Test build #81672 has finished for PR 19181 at commit ae7fbc4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -170,6 +170,10 @@ public void free() {
public void reset() {
if (consumer != null) {
consumer.freeArray(array);
array = LongArray.empty;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hvanhovell ,
I'm starting to have second thoughts about the special empty instance here, I'm afraid the the nested call might trigger freeArray or something similar on it.
perhaps using null is a better option here?

@@ -85,7 +85,7 @@
private final LinkedList<UnsafeSorterSpillWriter> spillWriters = new LinkedList<>();

// These variables are reset after spilling:
@Nullable private volatile UnsafeInMemorySorter inMemSorter;
@VisibleForTesting @Nullable volatile UnsafeInMemorySorter inMemSorter;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is slightly hair raising. Can you try to find a different - less tricky path - to test this?

Copy link
Author

@eyalfa eyalfa Sep 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hvanhovell , is it acceptable to introduce a method indicating if the next inset is going to spill? Can make this @VisibleForTest and prefixed with something like testOnly

array = null;
}
}

public void reset() {
if (consumer != null) {
consumer.freeArray(array);
array = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document why this is needed.

@hvanhovell
Copy link
Contributor

cc @juliuszsompolski can you take a look, or perhaps take this for a test drive?

// and ended up with a failed assertion.
// we also expect the location of the OOM to be org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset
memoryManager.markConseqOOM(2);
exceptions.expect(OutOfMemoryError.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just catch the error? Seems a bit easier.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm used to scalatest's Matchers, things could be so much easier if this test was written in Scala...
Will rewrite...

@SparkQA
Copy link

SparkQA commented Sep 13, 2017

Test build #81734 has finished for PR 19181 at commit 2fd8e91.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 14, 2017

Test build #81765 has finished for PR 19181 at commit 57f20b7.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@eyalfa
Copy link
Author

eyalfa commented Sep 14, 2017

@hvanhovell, @juliuszsompolski,
failure seem unrelated to my work, can you please request a retest?

@hvanhovell
Copy link
Contributor

@eyalfa you should be able to retrigger tests yourself: jenkins retest this please

@hvanhovell
Copy link
Contributor

retest this please

@juliuszsompolski
Copy link
Contributor

LGTM.
It'd be nicer if there was an invariant that a nested spill just never can happen, i.e. there are no places that can spill on the code path that already spills.
But this fix does seem to render nested spill harmless in this particular case.

@hvanhovell In my original use case I hit this bug after hours of running on a big cluster, so reproducing it on that case is quite unpractical for me. I'll see if I'll observe it again.

@eyalfa
Copy link
Author

eyalfa commented Sep 14, 2017

@juliuszsompolski,
if you comment the few added lines in the reset() method, you'd see that the test fails with a stack frame very similar to the one you pasted in JIRA.

@SparkQA
Copy link

SparkQA commented Sep 14, 2017

Test build #81782 has finished for PR 19181 at commit 57f20b7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@juliuszsompolski
Copy link
Contributor

@eyalfa Thanks. I agree that this is a good fix to this issue and lgtm.
I'm just worried that there are more lurking cases where a nested spill can trigger and cause something unexpected, and that finding and reproducing such other similar issues may be difficult.

@eyalfa
Copy link
Author

eyalfa commented Sep 15, 2017 via email

@Test
public void testOOMDuringSpill() throws Exception {
final UnsafeExternalSorter sorter = newSorter();
for (int i = 0; sorter.hasSpaceForAnotherRecord(); ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this make the test case takes quite a long time to finish?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to Jenkins: 7ms, I guess we're ok than 😎

@eyalfa
Copy link
Author

eyalfa commented Sep 20, 2017

@hvanhovell , PTAL
please let me know if there's anything that requires fixing here.

@SparkQA
Copy link

SparkQA commented Sep 20, 2017

Test build #81998 has finished for PR 19181 at commit 48e947d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@eyalfa
Copy link
Author

eyalfa commented Sep 26, 2017

@hvanhovell ?

@@ -85,7 +85,7 @@
private final LinkedList<UnsafeSorterSpillWriter> spillWriters = new LinkedList<>();

// These variables are reset after spilling:
@Nullable private volatile UnsafeInMemorySorter inMemSorter;
private @Nullable volatile UnsafeInMemorySorter inMemSorter;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unnecessary change.

@@ -162,14 +162,25 @@ private int getUsableCapacity() {
*/
public void free() {
if (consumer != null) {
consumer.freeArray(array);
if (null != array) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: RHS literal (array != null)

array = null;
}
}

public void reset() {
if (consumer != null) {
consumer.freeArray(array);
// this is needed to prevent a 'nested' spill,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it doesn't prevent a nested spill, it only renders it harmless
remove this line - the rest of the comment is true.

}
// todo: this might actually not be zero if pageSize is somehow configured differently,
// so we actually have to compute the expected spill size according to the configured page size
assertEquals(0, sorter.getSpillSize());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this might actually not be zero, maybe don't test this assertion?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, its a tricky one...
as far as I can track the construction and logic of the relevant classes here, and given that we're sorting Ints here, we're not supposed to spill before expandPointersArray requests additional memory, it is however very difficult and 'involved' to actually check this during test.
I can set the memoryManager to choke before the first loop, this way if my assumption breaks and the sorter attempts to allocate memory we'd get an OOM sooner than expected, effectively failing the test.

@juliuszsompolski , do you find this approach better? it'd still need a comment describing the assumption about memory usage by the sorter.

insertNumber(sorter, 1024);
}
// we expect an OutOfMemoryError here, anything else (i.e the original NPE is a failure)
catch( OutOfMemoryError oom ){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ws: catch (OutOfMemoryError oom) {

try {
sorter.reset();
} catch( OutOfMemoryError oom ) {
//as expected
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ws: // as expected

}
// this currently fails on NPE at org.apache.spark.memory.MemoryConsumer.freeArray(MemoryConsumer.java:108)
sorter.free();
//simulate a 'back to back' free.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ws: // simulate ...

@@ -58,11 +58,15 @@ class TestMemoryManager(conf: SparkConf)

override def maxOffHeapStorageMemory: Long = 0L

private var oomOnce = false
private var conseqOOM = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: conseq -> consequent

markConseqOOM(1)
}

def markConseqOOM( n : Int) : Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: markConsequentOOM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ws: (n: Int): Unit

import org.hamcrest.Matchers;
import org.hamcrest.TypeSafeMatcher;
import org.junit.Rule;
import org.junit.rules.ExpectedException;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think you don't use most of these imports anymore.


@Test
public void freeAfterOOM() {
final TestMemoryManager testMemoryManager = new TestMemoryManager(new SparkConf().set("spark.memory.offHeap.enabled", "false"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Is it better to wrap this line since this is too long.

@SparkQA
Copy link

SparkQA commented Oct 1, 2017

Test build #82372 has finished for PR 19181 at commit e3da029.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@juliuszsompolski
Copy link
Contributor

Looks good to me.
What do you think @hvanhovell ?

insertNumber(sorter, 1024);
}
// we expect an OutOfMemoryError here, anything else (i.e the original NPE is a failure)
catch (OutOfMemoryError oom ){
Copy link
Contributor

@hvanhovell hvanhovell Oct 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit weird spacing: catch (OutOfMemoryError oom) {

expectedOOM = oom;
}

assertNotNull("expected OutOfMmoryError but it seems operation surprisingly succeeded"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: just move this into the catch. Just fail if you reach this point, or fail in the try statement.

try {
sorter.reset();
} catch (OutOfMemoryError oom) {
// as expected
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure you fail on an unexpected success

@Test
public void freeAfterOOM() {
final SparkConf sparkConf =
new SparkConf()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT style

@Test
public void testOOMDuringSpill() throws Exception {
final UnsafeExternalSorter sorter = newSorter();
// we assume that given default configuration, the size of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT NIT NIT NIT NIT NIT This comment is pretty good, but it is kind of hard to read because of the weird white space usage. Can you just make lines of 100 characters

Copy link
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few minor comments, other than that LGTM.

@hvanhovell
Copy link
Contributor

I will merge this when it passes tests.

@eyalfa
Copy link
Author

eyalfa commented Oct 10, 2017

@hvanhovell , thanks 👍

@SparkQA
Copy link

SparkQA commented Oct 10, 2017

Test build #82594 has finished for PR 19181 at commit 6b901ee.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

Merging to master. Thanks!

Can you create a backport for Spark-2.2?

@asfgit asfgit closed this in 2028e5a Oct 10, 2017
eyalfa pushed a commit to eyalfa/spark that referenced this pull request Oct 12, 2017
1. a test reproducing [SPARK-21907](https://issues.apache.org/jira/browse/SPARK-21907)
2. a fix for the root cause of the issue.

`org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill` calls `org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset` which may trigger another spill,
when this happens the `array` member is already de-allocated but still referenced by the code, this causes the nested spill to fail with an NPE in `org.apache.spark.memory.TaskMemoryManager.getPage`.
This patch introduces a reproduction in a test case and a fix, the fix simply sets the in-mem sorter's array member to an empty array before actually performing the allocation. This prevents the spilling code from 'touching' the de-allocated array.

introduced a new test case: `org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorterSuite#testOOMDuringSpill`.

Author: Eyal Farago <eyal@nrgene.com>

Closes apache#19181 from eyalfa/SPARK-21907__oom_during_spill.
asfgit pushed a commit that referenced this pull request Oct 12, 2017
back-port #19181 to branch-2.2.

## What changes were proposed in this pull request?
1. a test reproducing [SPARK-21907](https://issues.apache.org/jira/browse/SPARK-21907)
2. a fix for the root cause of the issue.

`org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill` calls `org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset` which may trigger another spill,
when this happens the `array` member is already de-allocated but still referenced by the code, this causes the nested spill to fail with an NPE in `org.apache.spark.memory.TaskMemoryManager.getPage`.
This patch introduces a reproduction in a test case and a fix, the fix simply sets the in-mem sorter's array member to an empty array before actually performing the allocation. This prevents the spilling code from 'touching' the de-allocated array.

## How was this patch tested?
introduced a new test case: `org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorterSuite#testOOMDuringSpill`.

Author: Eyal Farago <eyal@nrgene.com>

Closes #19481 from eyalfa/SPARK-21907__oom_during_spill__BACKPORT-2.2.
MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
back-port apache#19181 to branch-2.2.

## What changes were proposed in this pull request?
1. a test reproducing [SPARK-21907](https://issues.apache.org/jira/browse/SPARK-21907)
2. a fix for the root cause of the issue.

`org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill` calls `org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset` which may trigger another spill,
when this happens the `array` member is already de-allocated but still referenced by the code, this causes the nested spill to fail with an NPE in `org.apache.spark.memory.TaskMemoryManager.getPage`.
This patch introduces a reproduction in a test case and a fix, the fix simply sets the in-mem sorter's array member to an empty array before actually performing the allocation. This prevents the spilling code from 'touching' the de-allocated array.

## How was this patch tested?
introduced a new test case: `org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorterSuite#testOOMDuringSpill`.

Author: Eyal Farago <eyal@nrgene.com>

Closes apache#19481 from eyalfa/SPARK-21907__oom_during_spill__BACKPORT-2.2.
sumwale pushed a commit to TIBCOSoftware/snappy-spark that referenced this pull request Oct 13, 2021
back-port apache#19181 to branch-2.2.

1. a test reproducing [SPARK-21907](https://issues.apache.org/jira/browse/SPARK-21907)
2. a fix for the root cause of the issue.

`org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill` calls `org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset` which may trigger another spill,
when this happens the `array` member is already de-allocated but still referenced by the code, this causes the nested spill to fail with an NPE in `org.apache.spark.memory.TaskMemoryManager.getPage`.
This patch introduces a reproduction in a test case and a fix, the fix simply sets the in-mem sorter's array member to an empty array before actually performing the allocation. This prevents the spilling code from 'touching' the de-allocated array.

introduced a new test case: `org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorterSuite#testOOMDuringSpill`.

Author: Eyal Farago <eyal@nrgene.com>

Closes apache#19481 from eyalfa/SPARK-21907__oom_during_spill__BACKPORT-2.2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants