-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-13980] Incrementally serialize blocks while unrolling them in MemoryStore #11791
Conversation
Test build #53456 has finished for PR 11791 at commit
|
7dc3623
to
5489748
Compare
Test build #53496 has finished for PR 11791 at commit
|
/cc @rxin @andrewor14, this is the next most important patch to review towards off-heap caching. After these changes get in, we'll be able to use off-heap memory for the unroll memory in off-heap caching, greatly simplifying things. Without this change, the on-heap unroll array needs to be accounted properly even if the final cache destination is off-heap, making the caching more OOM-prone and complicating the accounting logic (since it then becomes different between the two modes). |
Still WIP? |
This is no longer WIP and should be ready for review now. |
@@ -129,10 +136,9 @@ private[spark] class MemoryStore( | |||
* iterator or call `close()` on it in order to free the storage memory consumed by the | |||
* partially-unrolled block. | |||
*/ | |||
private[storage] def putIterator( | |||
private[storage] def putIteratorAsValues( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case it isn't obvious from the diff, the main change in this file is to split putIterator
into two separate methods, putIteratorAsValues
and putIteratorAsBytes
.
It's possible that there's some opportunity to reduce code duplication here, but unless we can come up with an obvious and simple approach I would prefer to defer cleanup to followup patches.
Test build #53700 has finished for PR 11791 at commit
|
Jenkins, retest this please. |
Test build #53704 has finished for PR 11791 at commit
|
if (keepUnrolling) { | ||
unrollMemoryUsedByThisBlock += amountToRequest | ||
} | ||
unrollMemoryUsedByThisBlock += amountToRequest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont understand why you add this twice in some cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think this case is a mistake which might have been introduced while repairing a merge conflict. We should only increment this if keepUnrolling == true
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has now been fixed.
Test build #53742 has finished for PR 11791 at commit
|
Test build #53741 has finished for PR 11791 at commit
|
Doing a bit of refactoring now in order to make it easier to write proper unit tests for this. Therefore I'd hold off on the final review pass here for a little bit and review my SPARK-3000 patch instead. |
Test build #53957 has finished for PR 11791 at commit
|
Alright, just added a few more tests to |
Test build #54057 has finished for PR 11791 at commit
|
Jenkins, retest this please. |
Test build #54070 has finished for PR 11791 at commit
|
|
||
// Make sure that we have enough memory to store the block. By this point, it is possible that | ||
// the block's actual memory usage has exceeded the unroll memory by a small amount, so we | ||
// perform one final call to attempt to allocate additional memory if necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because of the call to close? That can use more memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
Jenkins, retest this please. |
LGTM |
} else { | ||
iteratorFromFailedMemoryStorePut = Some(partiallySerializedValues.valuesIterator) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
about my previous comment about duplicate code, never mind. It can't actually be abstracted cleanly.
Looks good. |
Test build #54094 has finished for PR 11791 at commit
|
Merging to master. |
When a block is persisted in the MemoryStore at a serialized storage level, the current MemoryStore.putIterator() code will unroll the entire iterator as Java objects in memory, then will turn around and serialize an iterator obtained from the unrolled array. This is inefficient and doubles our peak memory requirements.
Instead, I think that we should incrementally serialize blocks while unrolling them.
A downside to incremental serialization is the fact that we will need to deserialize the partially-unrolled data in case there is not enough space to unroll the block and the block cannot be dropped to disk. However, I'm hoping that the memory efficiency improvements will outweigh any performance losses as a result of extra serialization in that hopefully-rare case.