Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

redis-server crash w/strack trace... #141

Closed
jzawodn opened this Issue · 53 comments

6 participants

Jeremy Zawodny Salvatore Sanfilippo jokea Baptiste Courtois anydot Pieter Noordhuis
Jeremy Zawodny

I haven't quite updated to 2.4.0 yet but was running 2.3.10 in production and lost one earlier.

[17466] 15 Oct 07:05:25 # ======= Ooops! Redis 2.3.10 got signal: -11- =======
[17466] 15 Oct 07:05:25 # redis_version:2.3.10^M
redis_git_sha1:00000000^M
redis_git_dirty:0^M
arch_bits:64^M
multiplexing_api:epoll^M
process_id:17466^M
uptime_in_seconds:1353855^M
uptime_in_days:15^M
lru_clock:1845328^M
used_cpu_sys:8926.23^M
used_cpu_user:12859.53^M
used_cpu_sys_children:70.14^M
used_cpu_user_children:22.71^M
connected_clients:1^M
connected_slaves:1^M
client_longest_output_list:0^M
client_biggest_input_buf:18446744071562068051^M
blocked_clients:0^M
used_memory:8472278192^M
used_memory_human:7.89G^M
used_memory_rss:6308675584^M
used_memory_peak:8472286720^M
used_memory_peak_human:7.89G^M
mem_fragmentation_ratio:0.74^M
mem_allocator:jemalloc-2.2.1^M
loading:0^M
aof_enabled:0^M
changes_since_last_save:210^M
bgsave_in_progress:0^M
last_save_time:1318611938^M
bgrewriteaof_in_progress:0^M

last_save_time:1318611938^M
bgrewriteaof_in_progress:0^M
total_connections_received:91901^M
total_commands_processed:284065681^M
expired_keys:0^M
evicted_keys:0^M
keyspace_hits:97704116^M
keyspace_misses:28609668^M
hash_max_zipmap_entries:512^M
hash_max_zipmap_value:512^M
pubsub_channels:0^M
pubsub_patterns:0^M
latest_fork_usec:242
[17466] 15 Oct 07:05:25 # /lib64/libc.so.6(memcpy+0xe1) [0x7fd179c63e21]
[17466] 15 Oct 07:05:25 # /lib64/libc.so.6(memcpy+0xe1) [0x7fd179c63e21]
[17466] 15 Oct 07:05:25 # /usr/bin/redis-server(sdscatlen+0x54) [0x412ce4]
[17466] 15 Oct 07:05:25 # /usr/bin/redis-server(readQueryFromClient+0x33) [0x418f93]
[17466] 15 Oct 07:05:25 # /usr/bin/redis-server(aeProcessEvents+0x16e) [0x40d13e]
[17466] 15 Oct 07:05:25 # /usr/bin/redis-server(aeMain+0x2e) [0x40d35e]
[17466] 15 Oct 07:05:25 # /usr/bin/redis-server(main+0xf9) [0x412529]
[17466] 15 Oct 07:05:25 # /lib64/libc.so.6(__libc_start_main+0xe6) [0x7fd179c02586]
[17466] 15 Oct 07:05:25 # /usr/bin/redis-server [0x40c649]

Salvatore Sanfilippo
Owner

Thank you Jeremy,

please if it is easy for you to obtain it could you send the exact git commit that you were using, and if possible even the redis-server executable itself? Thanks!

Jeremy Zawodny

Argh. I can get the binary, yes. But apparently there's a step in the build process I've been using (only when building from the git repo) that removes the git info along the way. That's fixed now but I can say exactly which 2.3.10 this was.

http://jeremy.zawodny.com/tmp/redis-server

That's the binary that was running.

Salvatore Sanfilippo
Owner

Thank you Jeremy, the binary will be helpful.
The bug is an extremely difficult one to track as it happens in a place where the stack trace generated is basically the same for every kind of command received, so there is very little state information we have.

However at a first glance it appears to be due to a corruption in the client structure output buffer.
I'm pretty sure I never saw something like that in 2.2, so I'll try to diff the two branches to check what we changed in the buffer handling, and how this may happen.

Please could you tell me if that instance is using blocking operations? In general a list of commands and features used would be extremely useful given that we have almost no other state.

Thanks!
Salvatore

Jeremy Zawodny

Interestingly, I found another instance that had crashed as well. The trace is below.

Other noteworthy item: this has only happened on slaves so far. Slaves that, typically, get no read traffic. They're effectively used as hot-sandby instances in case their master fails.

I'm inclined to deploy 2.4.1 today to see what that looks like, unless you'd prefer I keep running 2.3.10 to help track this down.

Trace:

[2618] 15 Oct 09:03:51 # ======= Ooops! Redis 2.3.10 got signal: -11- =======
[2618] 15 Oct 09:03:51 # redis_version:2.3.10
redis_git_sha1:00000000
redis_git_dirty:0
arch_bits:64
multiplexing_api:epoll
process_id:2618
uptime_in_seconds:749340
uptime_in_days:8
lru_clock:1846039
used_cpu_sys:3233.74
used_cpu_user:2282.14
used_cpu_sys_children:0.00
used_cpu_user_children:0.00
connected_clients:1
connected_slaves:0
client_longest_output_list:0
client_biggest_input_buf:18446744071562068008
blocked_clients:0
used_memory:7875902744
used_memory_human:7.34G
used_memory_rss:5669847040
used_memory_peak:7875911272
used_memory_peak_human:7.34G
mem_fragmentation_ratio:0.72
mem_allocator:jemalloc-2.2.1
loading:0
aof_enabled:0
changes_since_last_save:122674687
bgsave_in_progress:0
last_save_time:1317945291
bgrewriteaof_in_progress:0
total_connections_received:38705
total_commands_processed:120850097
expired_keys:25
evicted_keys:0
keyspace_hits:39590911
keyspace_misses:13297103
hash_max_zipmap_entries:512

hash_max_zipmap_value:512
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
[2618] 15 Oct 09:03:51 # /lib64/libc.so.6(memcpy+0xe1) [0x7f08a2197e21]
[2618] 15 Oct 09:03:51 # /lib64/libc.so.6(memcpy+0xe1) [0x7f08a2197e21]
[2618] 15 Oct 09:03:51 # /usr/bin/redis-server(sdscatlen+0x54) [0x412ce4]
[2618] 15 Oct 09:03:51 # /usr/bin/redis-server(readQueryFromClient+0x33) [0x418f93]
[2618] 15 Oct 09:03:51 # /usr/bin/redis-server(aeProcessEvents+0x16e) [0x40d13e]
[2618] 15 Oct 09:03:51 # /usr/bin/redis-server(aeMain+0x2e) [0x40d35e]
[2618] 15 Oct 09:03:51 # /usr/bin/redis-server(main+0xf9) [0x412529]
[2618] 15 Oct 09:03:51 # /lib64/libc.so.6(__libc_start_main+0xe6) [0x7f08a2136586]
[2618] 15 Oct 09:03:51 # /usr/bin/redis-server [0x40c649]

Salvatore Sanfilippo
Owner

That's interesting Jeremy! Not only slaves don't get read queries, but they have a single client for long time.
And this is happening on the client structure. This is definitely more info that can help tracing the bug.

Thanks! Now I've enough information to start a serious investigation tomorrow.

More news soon.

Salvatore Sanfilippo
Owner

p.s. interesting how apparently after the bug we can't see the full INFO output.
This should be fixed in 2.4 if I remember correctly, so upgrading is worth it even to get full stack traces ;)

Another interesting thing:

client_biggest_input_buf:18446744071562068008

Very similar to the previous instance, this huge number makes me feel that there is an overflow issue there, like subtracting something bigger than the previous value, leading to this huge value (it wraps from 0 to the other end, I guess).

This is definitely another big hint.

Jeremy Zawodny

Interestingly, there is a third instance in our environment that died in the same fashion. So whatever happened has at least been reproduced 3 times in our environment. We've tightened up our monitoring so that we'll see these more quickly in the future as well.

To put the 3 in perspective, we have a total of 80 nodes, 40 in each data center. There are 20 hosts involved and each one runs 4 instances (2 masters and 2 slaves).

In the meantime, I'm staying at 2.3.10 but building a 2.4.x package that's ready to deploy in case it makes sense to upgrade some or all of the instances.

And so far this only affects slaves, as I mentioned before.

Jeremy Zawodny

Your earlier question about operations we commonly use. I can say that we use a lot of ZSETs (most of them are small and stay in the compact representation). If you have a way I can privately send you some data, I've been re-testing AOF on one of our nodes and could give you a representative sample of a .aof file. That might be useful (or not). Feel free to email me directly (Jeremy@Zawodny.com) if you'd like and I can make it available.

Salvatore Sanfilippo
Owner

Hello Jeremy, thank you for your help.

I analyzed the issue and this is my first guess:

1) For some reason the slave client gets into a "blocking" state. like when you use BLPOP or alike. So the input is no longer processed.
2) Because of "1" the input buffer reaches the 2 GB of size, that is the limit for an sds string, and segfaults.

Fortunately I think that thanks your big Redis installation we can verify if the above is true.
Basically it takes some time for this to happen I guess, as there is to get 2GB of writes that's not little.
So my guess is that if I'm correct checking the Redis slaves with INFO you'll discover some slave that has a client_biggest_input_buf that is different megabytes in size (even if it is just 500k there is something wrong).

If we find such an instance we can start speculating why it is blocked, and the simplest way to check this is perhaps:

CLIENT LIST

That will provide us with info about the state of the client.

Thank you,
Salvatore

Salvatore Sanfilippo
Owner

p.s. all the above seems somewhat strange as you don't use blocking commands AFAIK, so perhaps I'm completely wrong but currently it's my best guess... so sorry in advance if after the check you'll not find nothing strange about input buffers... ;)

Salvatore Sanfilippo
Owner

Another small remark: that strange output we see in INFO about this huge buffer is not casual, it seems like it is about 2^31 smaller than 2^64. So this is why I was thinking to an overflow or alike...

Jeremy Zawodny

Hmm. OK, I'll do some more poking around and see what I can find. In fact, I can add that field as a metric to our monitoring system easily so we'll know which servers are acting up (and when).

Salvatore Sanfilippo
Owner

Thanks Jeremy, if no one of your slaves have currently an high value I think my hypotesis should be considered non valid and we can start again from scratch...

I'm trying to understand if I can add a few asserts in Redis 2.4.2 that may help us with this issue.

Jeremy Zawodny

I'm deploying some new monitoring code now so we can collect and watch this value. If they're increasing on any of our slaves, I should know soon...

Salvatore Sanfilippo
Owner

Great! Thanks.

jokea

From the info of the first crashed instance, I think it's a master not slave:
connected_clients:1^M
connected_slaves:1^M

Jeremy Zawodny

Good eye!

Actually it's a slave of a master in a remote colo. But it also has a replica in the local colo. So technically it's both.

Does that make sense?

jokea

If it also acts as a slave, the "connected_clients:" should be 2 instead of 1, unless the
master <-> slave link is down when the crash happens.

This may not related to the crash though.

Salvatore Sanfilippo
Owner

oh that's interesting. Jeremy: the other crashing instances were slave and master at the same time too?

p.s. I guess so far none of your instance has a client with a big input buffer otherwise your monitoring system would alert you. So probably my first guess is wrong.

Thanks for the help

Jeremy Zawodny

I'll have a look in the morning to see if the others were masters and slaves--I'm just heading to bed.

Our system has seen some non-zero buffers reported, but they're occasional and only around 8KB in size when we see them.

Salvatore Sanfilippo
Owner

Our system has seen some non-zero buffers reported, but they're occasional and only around 8KB in size when we see them.

Thanks, it's time to look at this issue from another point of view then :) It is hardly what I suggested before.
So is a more provincial corruption-alike problem. I hope it is not still in 2.4 even if the version you are using is pretty new.

Today I'll setup some test to try simulating the issue, maybe we are lucky and it is reproducible. Thanks!

Salvatore Sanfilippo
Owner

Hello Jeremy! I guess that this never happened again since you upgrade since I see no news here, asking just for ACK :) Thank you and have a nice day.

Jeremy Zawodny

Actually, we had some of these happen a couple days ago. :-(

But this time our monitoring system recorded the problem for us:

http://dl.dropbox.com/u/6763272/redis-output-buff.png

We're not sure what triggered that yet, but it definitely makes me suspect some odd network condition.

Meanwhile, all crashed nodes are running 2.4.2 now.

Salvatore Sanfilippo
Owner

Hi Jeremy! It seems like the mystery is no longer a mystery :)

Facts:

  • The graph you provide is a proof that my early concepture was right: this is an overflow on sds reaching 2GB in size.
  • The overflowing sds buffer is the output buffer, so the crashing Redis instance is not able to send data, or to send data at the same speed as this data should be received by the client.

When Redis can't send data to clients? There are multiple possibilities:

  • There is a networking problem.
  • A client simply crashed or is blocked into a dead lock of some kind, without closing the connection, but it still should receive data continuously (for instance because it is subscribed to a Pub/Sub channel). Such a client can be a Redis slave, that is a special kind of slave that continuously receives writes.
  • For a bug we removed the write handler from the client file descriptor. Never happened before, but it is a possible cause in theory.

So Redis will write to the client input buffer, however no data is removed from the buffer since there is some problem either in the client not reading the data, or in the network that silently discards our packets without replying with a TCP reset packet closing our connection. At some point the sds length overflows and the server crashes.

Why this is particularly likely to happen with slaves? Because we only have timeout detection in the slave. The master does not understand when a slave is not available. We introduced pings and alike only from the point of view of the slave since it was much more important for the slave to detect it was no longer receiving fresh data from the master, to close the old connection and reconnect.

So what to do?

The first thing to fix this bug, or at least the effect of this bug, is to close the connection when the input buffer of a client reaches a given big size, like 1GB or alike. This will prevent the crash. I'll commit such a change monday at max and will release a new version of Redis stable.

But it is possible to argue that this only fixes the effect and not the cause, and that a proper fix would require timeout detection in the master -> slave link also from the point of view of the master, and not just the slave.

Jeremy: if we are near to a solution is thanks to your help, Thanks You!
I really don't like when we have open bugs of this kind... and often the only way we can get enough context to understand what is happening is thanks to information that the user experiencing the bug can collect and provide.
For instance in this case your mention of a possible network issue was very important to move the right bells...

Baptiste Courtois

Hi Antirez and Jeremy!

Today we faced the same problem on one slave of our prod environment only few minutes after it start:

[14532] 21 Nov 11:49:40 # ======= Ooops! Redis 2.4.2 got signal: -11- =======
[14532] 21 Nov 11:49:40 # redis_version:2.4.2
redis_git_sha1:00000000
redis_git_dirty:0
arch_bits:64
multiplexing_api:epoll
process_id:14532
uptime_in_seconds:2385
uptime_in_days:0
lru_clock:67042
used_cpu_sys:127.59
used_cpu_user:44.75
used_cpu_sys_children:4.84
used_cpu_user_children:1.77
connected_clients:20
connected_slaves:0
client_longest_output_list:0
client_biggest_input_buf:18446744071562068563
blocked_clients:0
used_memory:31861605448
used_memory_human:29.67G
used_memory_rss:30233288704
used_memory_peak:31861605448
used_memory_peak_human:29.67G
mem_fragmentation_ratio:0.95
mem_allocator:jemalloc-2.2.1
loading:0
aof_enabled:0
changes_since_last_save:60
bgsave_in_progress:0
last_save_time:1321873795
bgrewriteaof_in_progress:0
total_connections_received:119
total_commands_processed:237
expired_keys:0
evicted_keys:0
keyspace_hits:278
keyspace_misses:23
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:3429217
vm_enabled:0
role:slave
master_host: 
master_port:6379
master_
[14532] 21 Nov 11:49:40 # /lib/libc.so.6(memcpy+0xe1) [0x2aaaab3e7fb1]
[14532] 21 Nov 11:49:40 # /lib/libc.so.6(memcpy+0xe1) [0x2aaaab3e7fb1]
[14532] 21 Nov 11:49:40 # redis-server(sdscatlen+0x54) [0x412fe4]
[14532] 21 Nov 11:49:40 # redis-server(readQueryFromClient+0x33) [0x419293]
[14532] 21 Nov 11:49:40 # redis-server(aeProcessEvents+0x16e) [0x40d35e]
[14532] 21 Nov 11:49:40 # redis-server(aeMain+0x2e) [0x40d57e]
[14532] 21 Nov 11:49:40 # redis-server(main+0xf9) [0x4127d9]
[14532] 21 Nov 11:49:40 # /lib/libc.so.6(__libc_start_main+0xe6) [0x2aaaab3895a6]
[14532] 21 Nov 11:49:40 # redis-server [0x40c869]

As the crash log shows:

  • We are using Redis 2.4.2.
  • This instance was only a slave (it might have recieved some "SET" command from some clients).
  • The input buffer seems to be huge too ^^

I don't know if it can help you, but unlike Jeremy, we are not using ZSet, and we didn't have any monitoring tools on that slave.
Do not hesitate to ask me if you have any questions about our redis setup.

Regards,

Baptiste.

Salvatore Sanfilippo
Owner

Thanks this is definitely the same bug, but the fact that it is just a slave makes things a bit different.
Also it does not help that Redis truncates INFO outputs in stack traces... (fixing that as well, already fixed into unstable).

Btw the changes I'm doing should provide a lot more background when the input or output buffer overflows... doing the changes right now. I'll update this issue soon.

Btw in my reasoning early in this thread I completely missed that even in the Jeremy case the overflow is happening in the INPUT buffer of the client and not in the output buffer, this changes a few things...

More news soon.

anydot

What about using TCP keepalive on master<->slave link? If for some reason the link will be idle or network will drop packets from master, TCP keepalive will kick-in and after some time will close the link without some additional (other then enabling keepalive mode on socket) code in redis.

Salvatore Sanfilippo
Owner

@anydot: yes this may work, but apparently the problem here is different... but I'll keep that in mind when fixing the replication timeout issue. Thanks. Next message with updates on the current issue in a second.

Salvatore Sanfilippo
Owner

I just committed a number of changes in Redis unstable improving the CLIENT LIST output, and blocking clients with a query buffer going over 1GB in size. Redis detects the issue and uses the CLIENT LIST format to log the offending client in a way similar to the following:

[42159] 21 Nov 16:15:49 # Closing client that reached max query buffer length: addr=127.0.0.1:54530 fd=5 idle=0 flags=N db=0 sub=0 psub=0 qbuf=1064960000 obl=0 oll=0 events=r

This can provide further information about what's happening with this bug, and also will prevent the Redis instance from crashing hopefully. The bug is still here bug we get both more reliability and information.

I'm back porting this changes into 2.4 to release 2.4.3.

Thanks for your help,
Salvatore

Salvatore Sanfilippo
Owner

Jeremy: quick question, I promise: what is the unit of measurement in the x-axis of the graph you provided? Thanks.

Pieter Noordhuis
Collaborator

Salvatore: why don't we remove the read event listener when the client becomes blocked? I think the only way that the input buffer can grow too large is by continuing to read from the socket. Since we execute all commands that exist in the input buffer on every tick, it can only contain data when (1) there is a partial -- not fully parsed -- command in the buffer, or when (2) the client is blocked by a blocking pop. In the case of (1), we know that the parser will abort and close the connection when the protocol contains a string that is larger than 512MB, which is sooner than the aborting on the 1024MB limit you set on the input buffer. This only leaves the case of (2) where the client is blocked. We already have block/unblock routines in place, so we only need to remove/add the read event listener from the event loop in those routines to let TCP take care of the rest. This will also let the client implicitly know that it is sending commands too fast, because they are simply not drained from its write buffer.

Let me know if I'm missing something here.

Cheers,
Pieter

Salvatore Sanfilippo
Owner

Hi Pieter, removing the handler can be a solution but I prefer an explicit limit for a few reasons:

  • If you just write to Redis like in "AAAAAAAAAAAAAAAAAAAAA...." (a lot of bytes) there is no code path trapping this at all, and you'll get the query buffer as big as possible. In general putting logic inside the parser about maximum length in all the cases is not simpler or more effective than putting a limit in the read handler.
  • I want this problem to be detected and logged. This bug itself is makes it clear that it is much better to see strange stuff in the log.
  • Your assumption that the only way to get an huge buffer is by blocking operations is not so obvious: for instance Jeremy does not use blocking ops apparently still the bug is there. Whatever is the case of this bug you did not noticed when writing the above comment that there was a code path (filling with "AAA...") that had the effect of filling the buffer: all evidences that having the limit early in the code path is better than assuming the only way it can happen is by blocking.

Last issue is that even considering your points I can't see how the solution I'm using is bad in a way that should be avoided to consider an alternative.

Thanks for the suggestion!
Salvatore

Pieter Noordhuis
Collaborator

Yep, you're right. Doing this check once on a lower level rather than multiple times for every different parser and condition is definitely more safe. Still, the add/remove read handler is something to keep in mind. It may be helpful elsewhere, for example when clients use the right protocol but queue up countless commands after a blocking operation. Rather than closing the connection after the input buffer reaches 1GB we may want to avoid the input buffer getting this large altogether. This is different from this issue, however, so I'll just keep in mind ;-)

Cheers,
Pieter

Jeremy Zawodny

@antirez the units on my graph were just hours of the day. So the issue started around 3:30am and instances died between 8-9am.

Salvatore Sanfilippo
Owner

@pietern: yes it is an option, but my fear is that you don't have an explicit trace on the log about what is happening. But on the other hand 1GB limit can be already too big for small instances or instances near to OOM, so can be useful indeed.

@jzawodn: thanks for the info.

I just released Redis 2.4.3 backporting the above fixes, this still does not fix this issue but both mitigates it and will give us more clues about the cause. Thanks for all your help.

Jeremy Zawodny

@antirez great, I'll aim to deploy 2.4.3 on Tuesday. Thanks so much for tracking this down!

Salvatore Sanfilippo
Owner

@jzawodn: thank you Jeremy, unfortunately the bug root cause is still unknown, so now the effect when it happens is to have Redis still running, but a client disconnected and a line of warning logged in the Redis log file. Please if you happen to see an instance growing the buffer till 1GB in the monitoring system, could you please collect the log line that the instance will generate when the 1GB figure is reached? This contains very valuable informations in order to track the source cause of the issue. In the meantime I'll experiment more with replication simulating a link issue in the hope I'll be lucky enough to reproduce it. Thanks again!

Jeremy Zawodny

@antirez I understand. I'm going to set an alert in our monitoring system that will notify us when this happens and I will include the log(s) on this issue. Thanks!

Salvatore Sanfilippo
Owner

Thank you :)

jokea

We've encountered this bug today with redis-2.4.4. It happened when we attached 8 slaves to a master at the same time and two of them stopped processing commands from master, the query buffer kept growing and is now over 500MB:
...
addr=XXXX:pppp fd=5 idle=0 flags=M db=0 sub=0 psub=0 qbuf=533951418 obl=0 oll=0 events=r cmd=NULL
...

jokea

I managed to attach gdb to the slave process and below is the contents of server.master:
(gdb) p *server.master
$6 = {fd = 5, db = 0x2ab0443ef300, dictid = 0, querybuf = 0x2aabbac00008 "¬렬 argc = 0,
argv = 0x0, cmd = 0x0, lastcmd = 0x0, reqtype = 1, multibulklen = 0, bulklen = -1,
reply = 0x2ab043809220, sentlen = 0, lastinteraction = 1325063811, flags = 2,
slaveseldb = 0, authenticated = 1, replstate = 0, repldbfd = 0, repldboff = 0,
repldbsize = 0, mstate = {commands = 0x0, count = 0}, bpop = {keys = 0x0, count = 0,
timeout = 0, target = 0x0}, io_keys = 0x2ab043809250, watched_keys = 0x2ab043809280,
pubsub_channels = 0x2ab04380b940, pubsub_patterns = 0x2ab0438092b0, bufpos = 0,
buf = '\0' }

(gdb) x /50sb server.master->querybuf
0x2aabbac00008: "¬
0x2aabbac0000b: "\005sr"
0x2aabbac0000f: "\036xxxxxxx.sysxxxxxx.model.Notice[f�¿\003"
0x2aabbac00038: "\006J"
0x2aabbac0003b: "\ncreateTimeJ"
0x2aabbac00048: "\002idL"
0x2aabbac0004d: "\006ddddeyt"
0x2aabbac00056: "\022Lxxxx/lang/String;L"
0x2aabbac0006b: "\bauthorIdq"
0x2aabbac00076: "~"
0x2aabbac00078: "\001L"
0x2aabbac0007b: "\acondddtq"
0x2aabbac00085: "~"
0x2aabbac00087: "\001L"
0x2aabbac0008a: "\005tabcdq"
0x2aabbac00092: "~"
0x2aabbac00094: "\001xp"
0x2aabbac00098: ""
0x2aabbac00099: "\0014~㼜n\001t^⦜t\003t"
0x2aabbac000a9: "\n1111111111t"
0x2aabbac000b6: "\n1111111111t\002(彜210蚲01\227嚲06¾ﻜ214你䚲34¨微群 \"蚲25¿汜231罜216颜23733¢\"( http://xxxx111111111 ) 蚲07\214䚲17\221表暲32\204微䚲15\232 @蚲25 @蚲25¿汜231罜216颜23733¢ 䚲10 蚲31¤乜206⚲00\202 䚲10 蚲31¤微䚲15\232䚲06\205容ﻜ232大䚲17\221麻寜206ﻜ214主嚲11\223嚲10\217ᚲ00\234大仜227鸢...
0x2aabbac0017e: "»寜206ᚲ00\235嚲30¯䚲34¨䚲33½柜207麻寜206暲32\204䚲37º砜200䷜212滜224䚲14\226皲00\214嚲35¥ﻜ214䚲17ª嚲34\211䚲20\203⚲00\201碰⚲00\201嚲35 ⚲00\201䚲20¬⚲00\201皲03¡⚲00\201䚲15´嚲34\211䚲07 东216涵暲33\226乜206嚲11\200嚲34\211主洜201麻寜206暲25ª䚲36\213⚲00\202筜200䚲15\225暲32\204暲16©沜225ﻜ214䚲17\230䚲14\226暲32\204暲25ª䚲36\213ﻜ214䚲17ª䚲34¨大䚲17\221游嚲10\217ᚲ00\234大仜227砮..
---Type to continue, or q to quit---
0x2aabbac00246: "º»寜206ᚲ00\235⚲00\202䚲05¨3D游嚲10\217嚲25\210嚲36\234ﻜ214孜214罜216暲32\204游嚲10\217暲24»蚲35¢ﻜ214纜235佳暲32\204游嚲10\217享䚲17\227ﻜ214辜230嚲34\211嚲04\217嚲03³䷜21510°暲32\204嚲03\212䚲26\234大夜226ﻜ214䷜200䚲10\207䚲17ª䚲34¨ᚲ00\234大䚲17\221游嚲10\217ᚲ00\235⚲00\202http://xxxxdddddsadf"
0x2aabbac002ee: "=你䚲34¨蚲25 \"蚲25¿汜231罜216颜237䚲33¢\" 䚲06\205䚲17\221表暲32\20432\20䚲15\232被䚲10 蚲31¤乜206x\r\n3\r\n$3\r\nSET\r\n$20\r\nn_111111111111111111\r\n$551\r\n¬
0x2aabbac00360: "\005sr"
0x2aabbac00364: "\036xxxxxxx.sysxxxxxx.xxxxx.xxtice[f�¿\003"
0x2aabbac0038d: "\006J"
0x2aabbac00390: "\nabcadexxmeJ"
0x2aabbac0039d: "\002idL"
0x2aabbac003a2: "\006aasdfyt"
0x2aabbac003ab: "\022Lasdfffadtring;L"
0x2aabbac003c0: "\bxxxdhorIdq"
0x2aabbac003cb: "~"
0x2aabbac003cd: "\001L"
0x2aabbac003d0: "\axdfddntq"
0x2aabbac003da: "~"
0x2aabbac003dc: "\001L"
0x2aabbac003df: "\005asdffq"
0x2aabbac003e7: "~"
0x2aabbac003e9: "\001xp"
0x2aabbac003ed: ""
0x2aabbac003ee: "\0014~㵹\001t^갰6¦H\002t"
0x2aabbac003fe: "\n3027345222t"
0x2aabbac0040b: "\n1900639991t\001\033彜210蚲01\227嚲06¾ﻜ214你䚲34¨微群 \"䚲05¨暲20\203ᚲ35¤嚲21\204影模暲11¹婜222伜223䚲34\210\"( http://abcdefgad ) 蚲07\214䚲17\221表暲32\204讜204论2\204讜暲20\206䚲21\230 @大山暲13¬家嚲33\235嚲26\231 䚲10 蚲31¤中--Type to continue, or q to quit---
\206⚲00\202 䚲10 蚲31¤讜204论䚲06\205容ﻜ232@模暲11¹䚲20\215嚲30\237蚲33\206堮..
0x2aabbac004d3: "»\223䚲17· 暲32\204微群 \"模暲11¹嚲30\216嚲30\237蚲33\206纜223䚲17·\" 嚲14䷜215蚲24\231暲32\204 http://xxxx/SGD4Pt 嚲16¨皲15\220大家东237嚲35¥暲34\213暲34\213~t"
0x2aabbac00536: "L你䚲34¨微群 \"䚲05¨暲20\203ᚲ35¤嚲21\204影模暲11¹婜222伜223䚲34\210\" 䚲06\205䚲17\221表暲32\204讜204论被䚲10 蚲31¤乜206x\r\n
3\r\n$3\r\nSET\r\n$20\r\nn_111111111111111112\r\n$357\r\n¬
0x2aabbac005b7: "\005sr"
0x2aabbac005bb: "\036xxxxxxx.sysxxxxxx.mxxxl.xxxice[f�¿\003"
0x2aabbac005e4: "\006J"
0x2aabbac005e7: "\nabcdefgadJ"

The querybuf starts with exactly a "value" field of a SET command and the "*3\r\n$3\r\nSET\r\n$xx\r\nKEY\r\n$xx\r\n" header is lost.

Another interesting thing to notice is that the log shows that the length of bgrewritebuf after bgrewrite process terminated is 432 bytes:
...
[6588] 27 Dec 17:41:44 * Background AOF rewrite terminated with success
[6588] 27 Dec 17:41:44 * Parent diff successfully flushed to the rewritten AOF (432 bytes)
[6588] 27 Dec 17:41:44 * Background AOF rewrite successful
...
However, the last 432 bytes in the AOF is not a valid command, it's just a part of a value. And since we only use db 13, the next command to be appended to the new AOF should be a SELECT command, which is true in the AOF of a good slave.

The above situations are exactly the same on both two slaves.

We have also seen another problem, that if we attach 8 slaves to a master at the same time, some of them have different numbers of keys than the master. However, if we do it one by one(master dumps for every slave), they all get synced with the master.

Salvatore Sanfilippo
Owner

Thank you a ton, my guess is that when you see different keys in the slave is the same bug but just with a less radical effect, where the protocol desync just causes a few commands to be missed instead of creating a condition where no further commands are processed. Investigating right now... we have more info now.

Salvatore Sanfilippo
Owner

I think I found the bug:

        if (ln) {
            /* Perfect, the server is already registering differences for
             * another slave. Set the right state, and copy the buffer. */
            listRelease(c->reply);
            c->reply = listDup(slave->reply);
            c->reply_bytes = slave->reply_bytes;
            c->replstate = REDIS_REPL_WAIT_BGSAVE_END;
            redisLog(REDIS_NOTICE,"Waiting for end of BGSAVE for SYNC");

Here when the second slave attaches and there is already a "good" BGSAVE in progress to feed the slave, we duplicate the output buffer, but we don't duplicate the static buffer introduced at some point in Redis.
This means that the second (and any successive) slave attached while a BGSAVE for the first slave is already in progress can easily get an output buffer that is truncated: there is everything minus the part in the static buffer.

The fix is trivial but before fixing it I'm writing a test that is able to reproduce this issue.

Jeremy and Jokea: since we delayed so much in creating a Redis t-shirt I'm switching to "Redis Moka awards". You just won two Italian Bialetti moka coffee machines, together with other four Redis users that as you helped significantly making Redis better, more details in my blog in the next hours.

Thank you!

Salvatore

anydot
Salvatore Sanfilippo
Owner

@anydot yes, this is interesting... it is not clear why the client parsing code could end in this state, but probably it will be possible to understand it since we have a dump of what was in the querybuf... in theory it is jus ta matter of feeding it with the same data.

There are already known patterns causing this kind of issues as described in this issue, like the even ending steam of "A", but apparently this is happening also when there are newlines in the stream. Something to investigate as well. +1.

p.s. of course you are one of the six winners of the Moka award ;)

anydot
jokea

@anydot There is a “\r\n" in the querybuf before the next valid SET command as you can see in my previous post.

@anydot @antirez The first 1MB dump of querybuf has been sent to you.

jokea

I think it's the problem of strstr(3) which caused the slave stop processing the querybuf:

int processInlineBuffer(redisClient *c) {
     char *newline = strstr(c->querybuf,"\r\n");

    /* Nothing to do without a \r\n */
    if (newline == NULL)
        return REDIS_ERR;
    ...

The querybuf contains a '\0' before "\r\n".(you can examine it with the dump file I've sent to you, strlen(qbuf) returns 2).

anydot

Yep, you are right. I hadn't have code around me yesterday, anyway, still it same error -- redis wait for oldproto on slave line due to protocol desync. So, we have there third error, where redis in processInlineBuffer should check if there's \0 before end of sds buffer and return err in that case, or use mem- family function for handling of oldproto. But because oldproto is deprecated anyway (and possibly only used for admin/maintenance via direct telnet) disallowing \0 is by far easier solution.

Salvatore Sanfilippo
Owner

Indeed, but this again only fixes part of the problem since the infinite stream of "A" will still have the same effect. Probably a simpler solution is to set an hard size limit to the inline protocol and/or the size of the (*|$) stage of the normal protocol.

Btw I agree that we should fix those problems as well, so I'll fix all this (with tests) before closing the issue.

I think that after the release of 2.6 RC1, and later 2.6, Redis should enter a "stop the world" stage where all the source code is covered with tests before moving forward.

Salvatore Sanfilippo
Owner

I confirm I can reproduce the broken replication easily using the following procedure:

  • Start a Redis master at port 6379
  • Target it with load using the command ./redis-benchmark -t set -n 100000000 -r 100000
  • Use the following script (see end of the message) to start N slaves while there is load in the master.
  • Enjoy the difference in DBSIZE figures.
#!/bin/bash
SBIN=/Users/antirez/hack/redis/src/redis-server
CBIN=/Users/antirez/hack/redis/src/redis-cli

$CBIN flushall
sleep 2

for PORT in 6380 6381 6382 6383 6384
do
    mkdir /tmp/$PORT 2> /dev/null
    rm -f /tmp/$PORT/dump.rdb
    $SBIN --dir /tmp/$PORT --port $PORT --slaveof 127.0.0.1 6379 &
done

read foo

for PORT in 6380 6381 6382 6383 6384
do
    $CBIN -p $PORT dbsize
done

read foo

for PORT in 6380 6381 6382 6383 6384
do
    $CBIN -p $PORT shutdown
done

Now I'll turn this into a real (failing) test, and then I'll implement the fix (that's like two lines of code).
When implementing the fix I'll take care if there is some other state to set in the client structure when a different output buffer / list is set.

Salvatore Sanfilippo
Owner

Since the fix will get some day to reach a stable release here is the diff to get this fixed manually:

diff --git a/src/replication.c b/src/replication.c
index e08517e..df9320c 100644
--- a/src/replication.c
+++ b/src/replication.c
@@ -124,6 +124,8 @@ void syncCommand(redisClient *c) {
              * another slave. Set the right state, and copy the buffer. */
             listRelease(c->reply);
             c->reply = listDup(slave->reply);
+            memcpy(c->buf,slave->buf,slave->bufpos);
+            c->bufpos = slave->bufpos;
             c->replstate = REDIS_REPL_WAIT_BGSAVE_END;
             redisLog(REDIS_NOTICE,"Waiting for end of BGSAVE for SYNC");
         } else {

I verified that this appears to work correctly, synchronizing all the slaves correctly.

Salvatore Sanfilippo antirez referenced this issue from a commit
Salvatore Sanfilippo Fixed replication when multiple slaves are attaching at the same time…
…. The output buffer was not copied correctly between slaves. This fixes issue #141.
1824e3a
Salvatore Sanfilippo antirez referenced this issue from a commit
Salvatore Sanfilippo Protections against protocol desyncs, leading to infinite query buffe…
…r growing, due to nul-terms in specific bytes of the request or indefinitely long multi bulk or bulk count strings without newlines. This bug is related to Issue #141 as well.
11e0c4c
Salvatore Sanfilippo antirez referenced this issue from a commit
Salvatore Sanfilippo Regression test for the main problem causing issue #141. Minor change…
…s/fixes/additions to the test suite itself needed to write the test.
414c3de
Salvatore Sanfilippo antirez referenced this issue from a commit
Salvatore Sanfilippo Fixed replication when multiple slaves are attaching at the same time…
…. The output buffer was not copied correctly between slaves. This fixes issue #141.
a5045d5
Salvatore Sanfilippo antirez referenced this issue from a commit
Salvatore Sanfilippo Protections against protocol desyncs, leading to infinite query buffe…
…r growing, due to nul-terms in specific bytes of the request or indefinitely long multi bulk or bulk count strings without newlines. This bug is related to Issue #141 as well.
e7b85b3
Salvatore Sanfilippo antirez referenced this issue from a commit
Salvatore Sanfilippo Regression test for the main problem causing issue #141. Minor change…
…s/fixes/additions to the test suite itself needed to write the test.
2c2b615
Salvatore Sanfilippo
Owner

Issue 141 is now completely fixed (and even more important, completely understood!) in both 2.4 and unstable branches.

I'm writing a blog post about this issue, but here is a short description of what happened and why.

Issue 141 was experienced by users in two ways:

  • Slave instances crashing because of a query-buffer length overflow receiving data from the client representing the master instance to which the slave is attached.
  • Slaves with a few different keys, or missing keys, after the synchronization with the master.

Now we are aware that this issues only happen when there are multiple slaves attached to a master, that in case of network issues can just be the same slave attached multiple times from the point of view of the master.

Issue 141 was caused by two completely unrelated bugs (apparently at least, since combined they lead to the crash experienced in this issue, fortunately, since the problem under the hoods was even worse):

First bug

A protocol desync was possible in three different cases: if the query started with "\x00" there is an "inline protocol" desync. Note that the inline protocol in Redis >= 2.4 is not the old inline protocol, it is just a simplified one that is used to make Redis able to accept hand-written queries by telnet.

The second protocol desync was caused by queries starting with "*\x00" and the third with "$\x00" in the context respectively of a multi bulk and bulk count. All the three bugs are related to the fact that Redis was expecting to find a newline with strstr() or strchr() functions, that never arrived, so a client writing this bad sequences was able to overflow the query buffer in the client structure. Now there is an explicit limit in all the three conditions.

Second bug

In the replication process, when the first slave arrives Redis starts generating an RDB file. If a second slave arrives (or a third, and so forth) if there is already an RDB save in progress for an already synchronizing slave, Redis avoids generating another RDB file, but it simply duplicates the output buffer of the first slave (containing the difference between the RDB we are saving, and the new writes that we received) into the output buffer of this new slave.

This copy between output buffers was broken, since when we introduced the static buffer in the client structure to improve performances, this place was not modified (for reasons that I'll explain in the blog post). So the copy was performed only in the list of objects. So when multiple slaves attached at the same time, only the first was handled correctly. All the successive slaves getting a copy of the output buffer were at risk of receiving a corrupted output buffer, with part of the "chat" missing.

Conclusion

When there were no \00 bytes in the data transmitted from the client to the server only the second bug was in effect, causing bad synchronizations with missed keys or corrupted data in successive slaves. When there were also \00 bytes, a protocol desync was very likely to happen (the inline one, that is, +"\x00") causing the output buffer overflow, and also, completely blocking the replication process for days like it was happening to Jeremy.

Ok more details in the blog post as I think we can learn a few things from this bug.

Thanks again, and finally CLOSING THIS BUG! :)

Salvatore Sanfilippo antirez closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.