I'm marking this as non critical as we rarely seen problems about that, but actually in theory this is a critical bug.
What happens is that if for an error a Redis client (especially a Pub/Sub client, or a slave) is not albe to consume the output produced by the server fast enough, the output buffer for that client will grow more and more at the point that could crash the server for out of memory.
After a given limit is reached we should simply close the connection?
Should Pub/Sub handle this in a different way sending warning messages to the client when we are near to the limit?
Additional points: Also close slave (and monitors) connections if the output buffer gets too big.
The semantic is:
1) Close the connection if the client stays over the soft-limit for the specified amount of seconds.
2) Close the connection ASAP once the client reaches the hard-limit.
The class argument is used to tell Redis what clients are affected by the limit, and can be:
It will be possible to use the max-client-output-buffer statement multiple times to configure the limits for the three different classes of clients.
It looks like this is affecting us -- we've been having some short Trello outages where more or less simultaneously, all of our processes show this error, then die:
Error: Error: ERR command not allowed when used memory > 'maxmemory'
at Command.callback (/home/trellis/trellis/node_modules/redis/index.js:774:27)
at RedisClient.return_error (/home/trellis/trellis/node_modules/redis/index.js:382:25)
at RedisReplyParser. (/home/trellis/trellis/node_modules/redis/index.js:78:14)
at RedisReplyParser.emit (events.js:64:17)
at RedisClient.on_data (/home/trellis/trellis/node_modules/redis/index.js:358:27)
at Socket. (/home/trellis/trellis/node_modules/redis/index.js:93:14)
at Socket.emit (events.js:64:17)
at Socket._onReadable (net.js:672:14)
Redis is serving as a data structure cache and a pubsub server for us.
We run with
Monitoring Redis around one of these outages, we can see
It seems like what is happening is that one of our clients is temporarily not receiving pubsub, and that causes its output list to fill up, which then consumes a bunch of memory, driving all evictable DB keys out, then causing errors on write.
I can simulate this in test by having one healthy Redis client and one client occupied so that it cannot get free to receive pubsub, then throwing a bunch of data at a pubsub channel the occupied client is subscribed to.
While we're looking to mitigate and/or fix this on the client side, we'd love to fix it on the server, too. It sounds like you already have an idea for how you'd like to see it done. Can you offer guidance (or, of course, a patch, if you know exactly how you want to do it)?
Thanks! Redis has been a really great tool for Trello.
We discovered that actually the client that was getting the long output_list was making a lot of requests as part of a job that fired every 5 minutes, and that this always generated a long output_list for that client. This was fine, and was consumed in a few seconds unless the server was close to the configured maxmemory. In that case, the queue stayed large as the server evicted all of its keys. It seems like there was some sort of sympathetic response, where the server had queues building up because it could not evict keys quickly enough to keep up. Does that make sense?
I've seen this problem when a slow MONITOR client cause the server's memory grew rapidly and we had to
shut it down to let all clients connect to it's slave.(So lucky that we had a slave for that server). The instance
is running redis-2.0.4 so we have no way to identify or close the MONITOR client on the server side.
The length of client/slave's reply list and query buffer should be limited, and slow clients should be closed.
This sounds great to me. This has caused some production outages for us.
The proposed new feature is now implemented into the 'limits' branch: unstable...limits
Feedbacks/testing welcomed. I'm going to write tests for this code today.
That's great! I'll check it out - it looks way more sophisticated than what I had in mind, which I guess should not be a surprise.
Neat. This looks perfect for preventing the problems we were seeing; the first few we saw would have been caught by the defaults (they were because of non-responsive slaves), and we'll configure our normal clients with a limit as well - those event-driven servers are awesome at requesting too many things at once.
@brettkiefer very cool to know this will work for you! Thanks for the ACK
Tests added, merged into unstable, closing. Thank you.
Fixed a glitch in parsing of --gc-range. Closes #91.