-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit number of blocks buffered in memory during fetch #570
Limit number of blocks buffered in memory during fetch #570
Conversation
Thanks for the pull request! We'll review and test this and hopefully get it merged in. |
I haven't looked at the patch in detail yet, but I can confirm this is an issue. I uploaded a 2GB file, and then watched memory usage grow as I downloaded it with curl and the s3cmd mb s3://large
s3cmd put -P ~/Desktop/2GB s3://large
curl --limit-rate 128 localhost:8080/large/2GB > /dev/null |
@arekinath I've been testing your patch, and everything seems to be working as expected. I have a couple comments:
I've not been able to reproduce this. If I remove the
Interesting. I did find this a bit hard to follow in code, and in general I'd prefer to use the same backpressure mechanism the PUT fsm does, but your rationale is compelling. Maybe I'll cook up a patch without the timeout and see if I can find a performance difference. |
@reiddraper I ran some more controlled tests in our environment here, and I can reproduce a slight advantage in peak RSS with the If you want a version without the timeout step, try something like https://gist.github.com/arekinath/5667290 on top of this patch (maybe without the |
Ok cool, +1.
I tried this, and I'm seeing an even bigger throughput difference than you saw, namely:
Granted, this is just laptop micro-benchmarking. Curiously, I also noticed that both on develop and the no-timeout case, there is a big drop in performance right at the beginning of the download. I'm going to try and collect some data about this and get some graphs going. tl;dr: let's remove the explicit GC call, and then I think I'm +1 on this. |
I've not been able to reproduce the performance drop-off on a real cluster. I still see a 28% speed reduction using the no-timeout patch, so I think it's definitely worth keeping the code as-is. I'm also noticing that changing the |
This avoids excessive RAM usage when fetching very large files when the user's network link is much slower than the intra-cluster network links.
@reiddraper I've amended it to drop the |
+1 |
Limit number of blocks buffered in memory during fetch
We've been trying out Riak CS for storing large downloads for our users (usually 1-4GB files) and finding that our nodes were randomly falling over!
After some investigation, we found that the Riak CS processes were dying after hitting their zones' swap memory caps (this is on Solaris), and that this was very easily reproduced simply by downloading a 4GB file from multiple clients at once.
What's happening is that the
riak_cs_get_fsm
is managing to fetch most of the blocks for these huge files very very quickly, and then they sit in RAM (inside that process'got_block
orddict) for ages while waiting to be written out to the client. This seems to be due to the fact that our intra-cluster links are much much faster than the links to the clients downloading files. This is fine if the file is small, but on a 4GB file we are finding that that one dictionary can expand to 3 or 3.5GB in RAM for every single client trying to download it. This is a really big problem.I've put together a general idea of a solution here, which I've tested and is working fine for us. In
riak_cs_get_fsm:waiting_chunks/2
, instead of unconditionally fetching the next block if one is on the queue as soon as the last one is received, I check whether we already have "too many" blocks in the orddict (the definition of "too many" is controlled by a new config tweakable). If we do, then no new block is fetched at that point. Then, in the handler for theget_next_chunk
message, after sending out the block, we check if we are below the no-fetch threshold. If we are, we re-enter thewaiting_chunks
state with a 0 timeout. Then in thetimeout
message handler we start any fetch operations that we can.The reason why I defer the call to
read_chunks/1
in theget_next_chunk
handler using agen_fsm
timeout is to avoid any unnecessary delay when there lots ofget_next_chunk
messages queued. Fetching new chunks will wait until the current messages in the queue have all been handled. I found this to give better performance than callingread_chunks/1
straight away.As an aside, in the
timeout
handler, I have added a manual call toerlang:collect_garbage/0
, because I was finding that the GC would not run often enough on theriak_cs_get_fsm
s without it when they got very busy. This is quite common with processes that deal with relatively few messages/reductions that each move a lot of data (exactly what theget_fsm
does)