-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21907][CORE] oom during spill #19181
Conversation
…e test by requiring OOM exception thrown from the reset() method.
ok to test |
Test build #81672 has finished for PR 19181 at commit
|
@@ -170,6 +170,10 @@ public void free() { | |||
public void reset() { | |||
if (consumer != null) { | |||
consumer.freeArray(array); | |||
array = LongArray.empty; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hvanhovell ,
I'm starting to have second thoughts about the special empty
instance here, I'm afraid the the nested call might trigger freeArray
or something similar on it.
perhaps using null is a better option here?
…ySorter that demonstrates an NPE when free() is called after an OOM.
…ee when called after an OOM.
@@ -85,7 +85,7 @@ | |||
private final LinkedList<UnsafeSorterSpillWriter> spillWriters = new LinkedList<>(); | |||
|
|||
// These variables are reset after spilling: | |||
@Nullable private volatile UnsafeInMemorySorter inMemSorter; | |||
@VisibleForTesting @Nullable volatile UnsafeInMemorySorter inMemSorter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is slightly hair raising. Can you try to find a different - less tricky path - to test this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hvanhovell , is it acceptable to introduce a method indicating if the next inset is going to spill? Can make this @VisibleForTest and prefixed with something like testOnly
array = null; | ||
} | ||
} | ||
|
||
public void reset() { | ||
if (consumer != null) { | ||
consumer.freeArray(array); | ||
array = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document why this is needed.
cc @juliuszsompolski can you take a look, or perhaps take this for a test drive? |
// and ended up with a failed assertion. | ||
// we also expect the location of the OOM to be org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset | ||
memoryManager.markConseqOOM(2); | ||
exceptions.expect(OutOfMemoryError.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just catch the error? Seems a bit easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm used to scalatest's Matchers, things could be so much easier if this test was written in Scala...
Will rewrite...
Test build #81734 has finished for PR 19181 at commit
|
…isibleForTesting
…plfy exception related test.
Test build #81765 has finished for PR 19181 at commit
|
@hvanhovell, @juliuszsompolski, |
@eyalfa you should be able to retrigger tests yourself: |
retest this please |
LGTM. @hvanhovell In my original use case I hit this bug after hours of running on a big cluster, so reproducing it on that case is quite unpractical for me. I'll see if I'll observe it again. |
@juliuszsompolski, |
Test build #81782 has finished for PR 19181 at commit
|
@eyalfa Thanks. I agree that this is a good fix to this issue and lgtm. |
I agree, but 'it is what it is'😎
We can probably come up with some mechanism that detects such scenarios and
avoids invoking the spill method on an object whose already 'on the stack',
I think this requires more involvement from the committers, perhaps in a
separate Jira?
…On Sep 15, 2017 14:27, "Juliusz Sompolski" ***@***.***> wrote:
@eyalfa <https://github.com/eyalfa> Thanks. I agree that this is a good
fix to this issue and lgtm.
I'm just worried that there are more lurking cases where a nested spill
can trigger and cause something unexpected, and that finding and
reproducing such other similar issues may be difficult.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19181 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABFFOSIoc8O29CLcpfn2hPWJV_5neQjqks5sil8WgaJpZM4PSVvS>
.
|
@Test | ||
public void testOOMDuringSpill() throws Exception { | ||
final UnsafeExternalSorter sorter = newSorter(); | ||
for (int i = 0; sorter.hasSpaceForAnotherRecord(); ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this make the test case takes quite a long time to finish?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
according to Jenkins: 7ms, I guess we're ok than 😎
@hvanhovell , PTAL |
Test build #81998 has finished for PR 19181 at commit
|
@@ -85,7 +85,7 @@ | |||
private final LinkedList<UnsafeSorterSpillWriter> spillWriters = new LinkedList<>(); | |||
|
|||
// These variables are reset after spilling: | |||
@Nullable private volatile UnsafeInMemorySorter inMemSorter; | |||
private @Nullable volatile UnsafeInMemorySorter inMemSorter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: unnecessary change.
@@ -162,14 +162,25 @@ private int getUsableCapacity() { | |||
*/ | |||
public void free() { | |||
if (consumer != null) { | |||
consumer.freeArray(array); | |||
if (null != array) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: RHS literal (array != null)
array = null; | ||
} | ||
} | ||
|
||
public void reset() { | ||
if (consumer != null) { | ||
consumer.freeArray(array); | ||
// this is needed to prevent a 'nested' spill, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it doesn't prevent a nested spill, it only renders it harmless
remove this line - the rest of the comment is true.
} | ||
// todo: this might actually not be zero if pageSize is somehow configured differently, | ||
// so we actually have to compute the expected spill size according to the configured page size | ||
assertEquals(0, sorter.getSpillSize()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this might actually not be zero, maybe don't test this assertion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, its a tricky one...
as far as I can track the construction and logic of the relevant classes here, and given that we're sorting Ints here, we're not supposed to spill before expandPointersArray requests additional memory, it is however very difficult and 'involved' to actually check this during test.
I can set the memoryManager to choke before the first loop, this way if my assumption breaks and the sorter attempts to allocate memory we'd get an OOM sooner than expected, effectively failing the test.
@juliuszsompolski , do you find this approach better? it'd still need a comment describing the assumption about memory usage by the sorter.
insertNumber(sorter, 1024); | ||
} | ||
// we expect an OutOfMemoryError here, anything else (i.e the original NPE is a failure) | ||
catch( OutOfMemoryError oom ){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ws: catch (OutOfMemoryError oom) {
try { | ||
sorter.reset(); | ||
} catch( OutOfMemoryError oom ) { | ||
//as expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ws: // as expected
} | ||
// this currently fails on NPE at org.apache.spark.memory.MemoryConsumer.freeArray(MemoryConsumer.java:108) | ||
sorter.free(); | ||
//simulate a 'back to back' free. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ws: // simulate ...
@@ -58,11 +58,15 @@ class TestMemoryManager(conf: SparkConf) | |||
|
|||
override def maxOffHeapStorageMemory: Long = 0L | |||
|
|||
private var oomOnce = false | |||
private var conseqOOM = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: conseq -> consequent
markConseqOOM(1) | ||
} | ||
|
||
def markConseqOOM( n : Int) : Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: markConsequentOOM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ws: (n: Int): Unit
import org.hamcrest.Matchers; | ||
import org.hamcrest.TypeSafeMatcher; | ||
import org.junit.Rule; | ||
import org.junit.rules.ExpectedException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think you don't use most of these imports anymore.
|
||
@Test | ||
public void freeAfterOOM() { | ||
final TestMemoryManager testMemoryManager = new TestMemoryManager(new SparkConf().set("spark.memory.offHeap.enabled", "false")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Is it better to wrap this line since this is too long.
Test build #82372 has finished for PR 19181 at commit
|
Looks good to me. |
insertNumber(sorter, 1024); | ||
} | ||
// we expect an OutOfMemoryError here, anything else (i.e the original NPE is a failure) | ||
catch (OutOfMemoryError oom ){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit weird spacing: catch (OutOfMemoryError oom) {
expectedOOM = oom; | ||
} | ||
|
||
assertNotNull("expected OutOfMmoryError but it seems operation surprisingly succeeded" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: just move this into the catch. Just fail if you reach this point, or fail in the try statement.
try { | ||
sorter.reset(); | ||
} catch (OutOfMemoryError oom) { | ||
// as expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure you fail on an unexpected success
@Test | ||
public void freeAfterOOM() { | ||
final SparkConf sparkConf = | ||
new SparkConf() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT style
@Test | ||
public void testOOMDuringSpill() throws Exception { | ||
final UnsafeExternalSorter sorter = newSorter(); | ||
// we assume that given default configuration, the size of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT NIT NIT NIT NIT NIT This comment is pretty good, but it is kind of hard to read because of the weird white space usage. Can you just make lines of 100 characters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few minor comments, other than that LGTM.
I will merge this when it passes tests. |
@hvanhovell , thanks 👍 |
Test build #82594 has finished for PR 19181 at commit
|
Merging to master. Thanks! Can you create a backport for Spark-2.2? |
1. a test reproducing [SPARK-21907](https://issues.apache.org/jira/browse/SPARK-21907) 2. a fix for the root cause of the issue. `org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill` calls `org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset` which may trigger another spill, when this happens the `array` member is already de-allocated but still referenced by the code, this causes the nested spill to fail with an NPE in `org.apache.spark.memory.TaskMemoryManager.getPage`. This patch introduces a reproduction in a test case and a fix, the fix simply sets the in-mem sorter's array member to an empty array before actually performing the allocation. This prevents the spilling code from 'touching' the de-allocated array. introduced a new test case: `org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorterSuite#testOOMDuringSpill`. Author: Eyal Farago <eyal@nrgene.com> Closes apache#19181 from eyalfa/SPARK-21907__oom_during_spill.
back-port #19181 to branch-2.2. ## What changes were proposed in this pull request? 1. a test reproducing [SPARK-21907](https://issues.apache.org/jira/browse/SPARK-21907) 2. a fix for the root cause of the issue. `org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill` calls `org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset` which may trigger another spill, when this happens the `array` member is already de-allocated but still referenced by the code, this causes the nested spill to fail with an NPE in `org.apache.spark.memory.TaskMemoryManager.getPage`. This patch introduces a reproduction in a test case and a fix, the fix simply sets the in-mem sorter's array member to an empty array before actually performing the allocation. This prevents the spilling code from 'touching' the de-allocated array. ## How was this patch tested? introduced a new test case: `org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorterSuite#testOOMDuringSpill`. Author: Eyal Farago <eyal@nrgene.com> Closes #19481 from eyalfa/SPARK-21907__oom_during_spill__BACKPORT-2.2.
back-port apache#19181 to branch-2.2. ## What changes were proposed in this pull request? 1. a test reproducing [SPARK-21907](https://issues.apache.org/jira/browse/SPARK-21907) 2. a fix for the root cause of the issue. `org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill` calls `org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset` which may trigger another spill, when this happens the `array` member is already de-allocated but still referenced by the code, this causes the nested spill to fail with an NPE in `org.apache.spark.memory.TaskMemoryManager.getPage`. This patch introduces a reproduction in a test case and a fix, the fix simply sets the in-mem sorter's array member to an empty array before actually performing the allocation. This prevents the spilling code from 'touching' the de-allocated array. ## How was this patch tested? introduced a new test case: `org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorterSuite#testOOMDuringSpill`. Author: Eyal Farago <eyal@nrgene.com> Closes apache#19481 from eyalfa/SPARK-21907__oom_during_spill__BACKPORT-2.2.
back-port apache#19181 to branch-2.2. 1. a test reproducing [SPARK-21907](https://issues.apache.org/jira/browse/SPARK-21907) 2. a fix for the root cause of the issue. `org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill` calls `org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset` which may trigger another spill, when this happens the `array` member is already de-allocated but still referenced by the code, this causes the nested spill to fail with an NPE in `org.apache.spark.memory.TaskMemoryManager.getPage`. This patch introduces a reproduction in a test case and a fix, the fix simply sets the in-mem sorter's array member to an empty array before actually performing the allocation. This prevents the spilling code from 'touching' the de-allocated array. introduced a new test case: `org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorterSuite#testOOMDuringSpill`. Author: Eyal Farago <eyal@nrgene.com> Closes apache#19481 from eyalfa/SPARK-21907__oom_during_spill__BACKPORT-2.2.
What changes were proposed in this pull request?
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill
callsorg.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset
which may trigger another spill,when this happens the
array
member is already de-allocated but still referenced by the code, this causes the nested spill to fail with an NPE inorg.apache.spark.memory.TaskMemoryManager.getPage
.This patch introduces a reproduction in a test case and a fix, the fix simply sets the in-mem sorter's array member to an empty array before actually performing the allocation. This prevents the spilling code from 'touching' the de-allocated array.
How was this patch tested?
introduced a new test case:
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorterSuite#testOOMDuringSpill
.