Limit number of blocks buffered in memory during fetch #570

arekinath · 2013-05-24T01:38:05Z

We've been trying out Riak CS for storing large downloads for our users (usually 1-4GB files) and finding that our nodes were randomly falling over!

After some investigation, we found that the Riak CS processes were dying after hitting their zones' swap memory caps (this is on Solaris), and that this was very easily reproduced simply by downloading a 4GB file from multiple clients at once.

What's happening is that the riak_cs_get_fsm is managing to fetch most of the blocks for these huge files very very quickly, and then they sit in RAM (inside that process' got_block orddict) for ages while waiting to be written out to the client. This seems to be due to the fact that our intra-cluster links are much much faster than the links to the clients downloading files. This is fine if the file is small, but on a 4GB file we are finding that that one dictionary can expand to 3 or 3.5GB in RAM for every single client trying to download it. This is a really big problem.

I've put together a general idea of a solution here, which I've tested and is working fine for us. In riak_cs_get_fsm:waiting_chunks/2, instead of unconditionally fetching the next block if one is on the queue as soon as the last one is received, I check whether we already have "too many" blocks in the orddict (the definition of "too many" is controlled by a new config tweakable). If we do, then no new block is fetched at that point. Then, in the handler for the get_next_chunk message, after sending out the block, we check if we are below the no-fetch threshold. If we are, we re-enter the waiting_chunks state with a 0 timeout. Then in the timeout message handler we start any fetch operations that we can.

The reason why I defer the call to read_chunks/1 in the get_next_chunk handler using a gen_fsm timeout is to avoid any unnecessary delay when there lots of get_next_chunk messages queued. Fetching new chunks will wait until the current messages in the queue have all been handled. I found this to give better performance than calling read_chunks/1 straight away.

As an aside, in the timeout handler, I have added a manual call to erlang:collect_garbage/0, because I was finding that the GC would not run often enough on the riak_cs_get_fsms without it when they got very busy. This is quite common with processes that deal with relatively few messages/reductions that each move a lot of data (exactly what the get_fsm does)

kellymclaughlin · 2013-05-24T06:59:30Z

Thanks for the pull request! We'll review and test this and hopefully get it merged in.

reiddraper · 2013-05-24T19:55:12Z

I haven't looked at the patch in detail yet, but I can confirm this is an issue. I uploaded a 2GB file, and then watched memory usage grow as I downloaded it with curl and the --limit-rate option. Something like this:

s3cmd mb s3://large
s3cmd put -P ~/Desktop/2GB s3://large
curl --limit-rate 128 localhost:8080/large/2GB > /dev/null

reiddraper · 2013-05-28T17:00:21Z

@arekinath I've been testing your patch, and everything seems to be working as expected. I have a couple comments:

As an aside, in the timeout handler, I have added a manual call to erlang:collect_garbage/0, because I was finding that the GC would not run often enough on the riak_cs_get_fsms without it when they got very busy.

I've not been able to reproduce this. If I remove the erlang:garbage_collect() call, I do not see memory usage growing. This could obviously be an environment issue, but I'm curious if you could double check. If we do indeed need this call, I'd be inclined to wrap it in a helper function, that would enable us to turn it into a noop with an app.config option.

The reason why I defer the call to read_chunks/1 in the get_next_chunk handler using a gen_fsm timeout is to avoid any unnecessary delay when there lots of get_next_chunk messages queued. Fetching new chunks will wait until the current messages in the queue have all been handled. I found this to give better performance than calling read_chunks/1 straight away.

Interesting. I did find this a bit hard to follow in code, and in general I'd prefer to use the same backpressure mechanism the PUT fsm does, but your rationale is compelling. Maybe I'll cook up a patch without the timeout and see if I can find a performance difference.

arekinath · 2013-05-29T01:35:29Z

@reiddraper I ran some more controlled tests in our environment here, and I can reproduce a slight advantage in peak RSS with the erlang:garbage_collect/0 call (usually 5-10MB per client). It's nothing like what I thought I saw before though (which was probably just lack of controls for testing while I was hacking away, so sorry about that). I think dropping the garbage_collect call is probably a good idea.

If you want a version without the timeout step, try something like https://gist.github.com/arekinath/5667290 on top of this patch (maybe without the garbage_collect call). I've done a bit of benchmarking with and without the timeout deferment here, and I get a 5-10% throughput increase by having it in. If you don't think that's worth it for making the code a bit weird compared to the other FSMs, I can redo the pull request without it.

reiddraper · 2013-05-29T17:48:38Z

I think dropping the garbage_collect call is probably a good idea.

Ok cool, +1.

If you want a version without the timeout step, try something like https://gist.github.com/arekinath/5667290 on top of this patch (maybe without the garbage_collect call). I've done a bit of benchmarking with and without the timeout deferment here, and I get a 5-10% throughput increase by having it in. If you don't think that's worth it for making the code a bit weird compared to the other FSMs, I can redo the pull request without it.

I tried this, and I'm seeing an even bigger throughput difference than you saw, namely:

this PR: roughly equivalent to the HEAD of the develop branch
no timeout, (patch applied from gist), 28% slower for a 2GB file

Granted, this is just laptop micro-benchmarking. Curiously, I also noticed that both on develop and the no-timeout case, there is a big drop in performance right at the beginning of the download. I'm going to try and collect some data about this and get some graphs going.

tl;dr: let's remove the explicit GC call, and then I think I'm +1 on this.

reiddraper · 2013-05-29T20:45:15Z

Granted, this is just laptop micro-benchmarking. Curiously, I also noticed that both on develop and the no-timeout case, there is a big drop in performance right at the beginning of the download. I'm going to try and collect some data about this and get some graphs going.

I've not been able to reproduce the performance drop-off on a real cluster. I still see a 28% speed reduction using the no-timeout patch, so I think it's definitely worth keeping the code as-is.

I'm also noticing that changing the fetch_concurrency isn't improving performance (on any branch) as much as I would have suspected, so that's another area to investigate (in another issue). So, I'm still +1 pending the removal of the erlang:garbage_collect() call. Nice work, much appreciated!

This avoids excessive RAM usage when fetching very large files when the user's network link is much slower than the intra-cluster network links.

arekinath · 2013-05-30T01:48:24Z

@reiddraper I've amended it to drop the garbage_collect, just re-pushed it

reiddraper · 2013-05-30T15:42:36Z

+1

Limit number of blocks buffered in memory during fetch

ghost assigned reiddraper May 28, 2013

Limit number of blocks buffered in memory during fetch

b14d86d

This avoids excessive RAM usage when fetching very large files when the user's network link is much slower than the intra-cluster network links.

reiddraper added a commit that referenced this pull request May 30, 2013

Merge pull request #570 from arekinath/bugfix/excessive_fetch_ram_usage

7920f5b

Limit number of blocks buffered in memory during fetch

reiddraper merged commit 7920f5b into basho:develop May 30, 2013

arekinath deleted the bugfix/excessive_fetch_ram_usage branch June 5, 2013 12:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit number of blocks buffered in memory during fetch #570

Limit number of blocks buffered in memory during fetch #570

arekinath commented May 24, 2013

kellymclaughlin commented May 24, 2013

reiddraper commented May 24, 2013

reiddraper commented May 28, 2013

arekinath commented May 29, 2013

reiddraper commented May 29, 2013

reiddraper commented May 29, 2013

arekinath commented May 30, 2013

reiddraper commented May 30, 2013

Limit number of blocks buffered in memory during fetch #570

Limit number of blocks buffered in memory during fetch #570

Conversation

arekinath commented May 24, 2013

kellymclaughlin commented May 24, 2013

reiddraper commented May 24, 2013

reiddraper commented May 28, 2013

arekinath commented May 29, 2013

reiddraper commented May 29, 2013

reiddraper commented May 29, 2013

arekinath commented May 30, 2013

reiddraper commented May 30, 2013