Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redis 3.2.0 crashed by signal: 11 #3607

Closed
racielrod opened this issue Nov 14, 2016 · 10 comments
Closed

Redis 3.2.0 crashed by signal: 11 #3607

racielrod opened this issue Nov 14, 2016 · 10 comments

Comments

@racielrod
Copy link

racielrod commented Nov 14, 2016

From the logs:

=== REDIS BUG REPORT START: Cut & paste starting from here ===
981:M 13 Nov 22:04:13.019 # Redis 3.2.0 crashed by signal: 11
981:M 13 Nov 22:04:13.019 # Crashed running the instuction at: 0x7f1eb624ea44
981:M 13 Nov 22:04:13.020 # Accessing address: 0x3735010200
981:M 13 Nov 22:04:13.020 # Failed assertion: (:0)

------ STACK TRACE ------
EIP:
/lib/x86_64-linux-gnu/libc.so.6(+0x153a44)[0x7f1eb624ea44]

Backtrace:
/usr/local/bin/redis-server(logStackTrace+0x29)[0x45dfc9]
/usr/local/bin/redis-server(sigsegvHandler+0xaa)[0x45e4fa]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7f1eb64d0330]
/lib/x86_64-linux-gnu/libc.so.6(+0x153a44)[0x7f1eb624ea44]
/usr/local/bin/redis-server(clusterLoadConfig+0xb9)[0x463229]
/usr/local/bin/redis-server(clusterInit+0xfd)[0x464d6d]
/usr/local/bin/redis-server(initServer+0x40c)[0x42adec]
/usr/local/bin/redis-server(main+0x48a)[0x41e89a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f1eb611cf45]
/usr/local/bin/redis-server[0x41ea92]

------ INFO OUTPUT ------
945:M 13 Nov 22:24:16.653 * Increased maximum number of open files to 10032 (it was originally set to 1024).

Info Server

Server

redis_version:3.2.0
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:13911a99b348671e
redis_mode:cluster
os:Linux 4.2.0-41-generic x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:4.8.4
process_id:971
run_id:b7391f7ede1a59304f9069c85200573e8d5bd721
tcp_port:6379
uptime_in_seconds:6694
uptime_in_days:0
hz:10
lru_clock:2706637
executable:/usr/local/bin/redis-server
config_file:/etc/redis/6379.conf

This happened after the 3 HYPER-V hosts were reset by mistake. There are 6 nodes in total distributed in 3 physical servers.

The rest of the nodes are fine. I can't get this one to work after this issue.

Any ideas how do I recover from this?
I can setup another slave and join it to the cluster but ideally I would like to use the one I'm not able to start.

@antirez
Copy link
Contributor

antirez commented Nov 14, 2016

Hello please, could you share with me the nodes.conf file that caused the crash?

@antirez
Copy link
Contributor

antirez commented Nov 14, 2016

Also if possible please update to latest 3.2.x, it is able to report more info during crashes. Thanks.

@racielrod
Copy link
Author

Thanks for looking at this so quickly.
Please find attached the nodes.conf (renamed to nodes.png) for the node that is crashing.
Note the node crashing is 192.168.10.38.

Additional info:

  • It looks like the 3 masters were started first and the slaves were started 30 seconds after.
  • The slave 192.168.10.39 never took over as a mater after 192.168.10.38 crashed and failed to start.
  • I had to do a "cluster failover force" on 192.168.10.39 to make the cluster operational.

Not sure if the info above would add any value, but I figured it wouldn't hurt mentioning some of the facts.
Should I update the failing node to 3.2.5? Would it be able to work with the rest of the nodes if they are running 3.2.0?

My main goal is to restore the cluster to be fully operational with the minimum impact possible during the day. I can update all the nodes tonight.

Thanks again!
nodes

@antirez
Copy link
Contributor

antirez commented Nov 14, 2016

Hello, I've a fever so not really able to analyze the situation, however I've an idea of what is going wrong here. For now I hope this helps you: to restart try to remove the final line in nodes.conf file, that is just a lot of zero bytes if you edit it with vim/emac/whatever. After removing the final strange line the clusters should start.

@antirez
Copy link
Contributor

antirez commented Nov 14, 2016

Btw the original bug causing this is related to the way the cluster configuration file is generated with the help of truncate. I'll investigate and fix as soon as my fever is gone and I'm back at the PC.

@antirez
Copy link
Contributor

antirez commented Nov 14, 2016

p.s. Thanks a lot for your help

@racielrod
Copy link
Author

I took the long route and wipe out Redis from the crashing node and re-install. I removed the node from the cluster and added it back with redis-trib.rb add-node --slave.
I was planning to update the production cluster to 3.2.5 but will hold until the fix for this particular issue is release.

Thank you!

@antirez
Copy link
Contributor

antirez commented Nov 16, 2016

Thanks @racielrod. A very important point is the following: what happened to your VMs, is similar to a sudden power outage? The file zero-padding could be a result of lack of flushing of metadata of the file. Btw I pushed a patch that is able to ignore zero-padding when loading the file, so this could be avoided the next time, however other corruptions are possible, so I'm also exploring the idea to change the implementation to use rename instead of write + truncate.

@racielrod
Copy link
Author

racielrod commented Nov 16, 2016

@antirez it was a very similar scenario. Someone took those hosts down for maintenance all at once on a maintenance window.
We are making sure we are manually failing over now and doing maintenance one at a time, to avoid this issue in the future.
I'm glad this mistake helped find and edge case here.

@antirez
Copy link
Contributor

antirez commented Jan 26, 2017

Thanks, opened an issue about switching to rename(). Also the code was modified when this was reported in order to avoid crashing on trailing zeroes.

@antirez antirez closed this as completed Jan 26, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants