Skip to content
This repository

Redis should close the connection when the client output buffer is too big #91

Closed
antirez opened this Issue September 20, 2011 · 9 comments

4 participants

Salvatore Sanfilippo brettkiefer jokea Matt Ranney
Salvatore Sanfilippo
Owner

I'm marking this as non critical as we rarely seen problems about that, but actually in theory this is a critical bug.

What happens is that if for an error a Redis client (especially a Pub/Sub client, or a slave) is not albe to consume the output produced by the server fast enough, the output buffer for that client will grow more and more at the point that could crash the server for out of memory.

After a given limit is reached we should simply close the connection?

Should Pub/Sub handle this in a different way sending warning messages to the client when we are near to the limit?

Additional points: Also close slave (and monitors) connections if the output buffer gets too big.

Plan

  • Add max-client-output-buffer <class> <hard-limit-bytes> <soft-limit-bytes> <seconds>

The semantic is:

1) Close the connection if the client stays over the soft-limit for the specified amount of seconds.
2) Close the connection ASAP once the client reaches the hard-limit.

The class argument is used to tell Redis what clients are affected by the limit, and can be:

  • pubsub (Clients subscribed to one or more Pub Sub channels)
  • slave (Slave instances or clients running the MONITOR command)
  • standard (Normal clients)

It will be possible to use the max-client-output-buffer statement multiple times to configure the limits for the three different classes of clients.

brettkiefer

It looks like this is affecting us -- we've been having some short Trello outages where more or less simultaneously, all of our processes show this error, then die:

Error: Error: ERR command not allowed when used memory > 'maxmemory'
at Command.callback (/home/trellis/trellis/node_modules/redis/index.js:774:27)
at RedisClient.return_error (/home/trellis/trellis/node_modules/redis/index.js:382:25)
at RedisReplyParser. (/home/trellis/trellis/node_modules/redis/index.js:78:14)
at RedisReplyParser.emit (events.js:64:17)
at RedisReplyParser.send_error (/home/trellis/trellis/node_modules/redis/lib/parser/javascript.js:265:14)
at RedisReplyParser.execute (/home/trellis/trellis/node_modules/redis/lib/parser/javascript.js:124:22)
at RedisClient.on_data (/home/trellis/trellis/node_modules/redis/index.js:358:27)
at Socket. (/home/trellis/trellis/node_modules/redis/index.js:93:14)
at Socket.emit (events.js:64:17)
at Socket._onReadable (net.js:672:14)

Redis is serving as a data structure cache and a pubsub server for us.
We run with
maxmemory 2G
maxmemory-policy allkeys-lru

Monitoring Redis around one of these outages, we can see
pubsub_channels:9665
client_longest_output_list:4490
client_biggest_input_buf:497
blocked_clients:0
used_memory:2020084624

It seems like what is happening is that one of our clients is temporarily not receiving pubsub, and that causes its output list to fill up, which then consumes a bunch of memory, driving all evictable DB keys out, then causing errors on write.

I can simulate this in test by having one healthy Redis client and one client occupied so that it cannot get free to receive pubsub, then throwing a bunch of data at a pubsub channel the occupied client is subscribed to.

While we're looking to mitigate and/or fix this on the client side, we'd love to fix it on the server, too. It sounds like you already have an idea for how you'd like to see it done. Can you offer guidance (or, of course, a patch, if you know exactly how you want to do it)?

Thanks! Redis has been a really great tool for Trello.

brettkiefer

We discovered that actually the client that was getting the long output_list was making a lot of requests as part of a job that fired every 5 minutes, and that this always generated a long output_list for that client. This was fine, and was consumed in a few seconds unless the server was close to the configured maxmemory. In that case, the queue stayed large as the server evicted all of its keys. It seems like there was some sort of sympathetic response, where the server had queues building up because it could not evict keys quickly enough to keep up. Does that make sense?

jokea

I've seen this problem when a slow MONITOR client cause the server's memory grew rapidly and we had to
shut it down to let all clients connect to it's slave.(So lucky that we had a slave for that server). The instance
is running redis-2.0.4 so we have no way to identify or close the MONITOR client on the server side.

The length of client/slave's reply list and query buffer should be limited, and slow clients should be closed.

Matt Ranney

This sounds great to me. This has caused some production outages for us.

Salvatore Sanfilippo
Owner

The proposed new feature is now implemented into the 'limits' branch: unstable...limits

Feedbacks/testing welcomed. I'm going to write tests for this code today.

brettkiefer

That's great! I'll check it out - it looks way more sophisticated than what I had in mind, which I guess should not be a surprise.

brettkiefer

Neat. This looks perfect for preventing the problems we were seeing; the first few we saw would have been caught by the defaults (they were because of non-responsive slaves), and we'll configure our normal clients with a limit as well - those event-driven servers are awesome at requesting too many things at once.

Salvatore Sanfilippo
Owner

@brettkiefer very cool to know this will work for you! Thanks for the ACK

Salvatore Sanfilippo
Owner

Tests added, merged into unstable, closing. Thank you.

Salvatore Sanfilippo antirez closed this January 25, 2012
Shuge Lee shuge referenced this issue from a commit May 28, 2012
Commit has since been removed from the repository and is no longer available.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.