New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25422][CORE] Don't memory map blocks streamed to disk. #22511
Conversation
After data has been streamed to disk, the buffers are inserted into the memory store in some cases (eg., with broadcast blocks). But broadcast code also disposes of those buffers when the data has been read, to ensure that we don't leave mapped buffers using up memory, which then leads to garbage data in the memory store.
Test build #96394 has finished for PR 22511 at commit
|
retest this please |
cc @jiangxb1987 |
this seems like a big change, will we hit perf regression? |
is this a long-standing bug? |
Test build #96408 has finished for PR 22511 at commit
|
Not vs. 2.3. It only effects things when stream-to-disk is enabled, and when it is enabled, for reading remote cached blocks, this is actually going back to the old behavior. See https://github.com/apache/spark/pull/19476/files#diff-2b643ea78c1add0381754b1f47eec132R692 -- This change will make things slower, but that's vs. other changes that are only in 2.4. There are TODOs in the code for ways to improve this further but that should not go into 2.4.
the change here is not fixing a long-standing bug, its just updating new changes for 2.4. however, I'm really wondering why TorrentBroadcast calls dispose on the blocks. For regular buffers, its a no-op, so it hasn't mattered, but I can't come up with a reason that you do want to dispose those blocks. Secondly, it seems there is an implicit assumption that you never add memory-mapped byte buffers to the MemoryStore. Maybe that makes sense ... its kind of messing with the Memory / Disk management spark has. But the MemoryStore never checks that you don't add a mapped buffer, you'll just get weird behavior like this later on. Seems there should be a check at the very least, to avoid this kind of issue. As neither of those things are new to 2.4, I don't think we should deal w/ them here. The major motivation for memory mapping the file was not for broadcast blocks, it was for reading large cached blocks. But it actually makes more sense to change the interfaces in BlockManager to allow us to just get the managedBuffer, instead of a ChunkedByteBuffer (thats this TODO: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L728) |
Test build #4346 has started for PR 22511 at commit |
Retest this please. |
Also cc @zsxwing @JoshRosen |
Test build #96497 has finished for PR 22511 at commit
|
Retest this please |
Test build #96512 has finished for PR 22511 at commit
|
Retest this please. |
LGTM. I went back and took a look at the related changes, and agree with Imran that this is basically the same thing that 2.3 did; so no perf regression, just higher memory usage than in the mmap case (which didn't exist before anyway). |
@squito This PR is directly heading to |
Test build #96526 has finished for PR 22511 at commit
|
The analysis makes sense to me. The thing I'm not sure is, how can we hit it? The "fetch block to temp file" code path is only enabled for big blocks (> 2GB). |
a possible approach: can we just not dispose the data in |
The failing tests cases "with replication as stream" turned on fetch to disk for all data:
yes I considered this, but I don't feel confident about making that change for 2.4. I need to spend some more time understanding that (seems it came from SPARK-19556 / b56ad2b). I think this change is the right one for the moment. |
YEs good point, sorry I opened this against 2.4 just for testing in case the errors were more likely in 2.4 for some reason. Closing this and opened #22546 |
I think the deal with the dispose in TorrentBroadcast is that it's definitely needed in the local read case, but may need adjustments in the remote read case. The local read case ( |
BTW it can be argued that you don't need a dispose in that code path since |
After data has been streamed to disk, the buffers are inserted into the
memory store in some cases (eg., with broadcast blocks). But broadcast
code also disposes of those buffers when the data has been read, to
ensure that we don't leave mapped buffers using up memory, which then
leads to garbage data in the memory store.
How was this patch tested?
Ran the old failing test in a loop. Full tests on jenkins