Corrupted RedisCluster with data loss #3161

Spikhalskiy · 2016-04-07T22:49:05Z

We are running a Redis Cluster (3.0.7) on 3 physical nodes. We have 9 master processes with 2 slaves each (18 slaves) for a total of 27 processes. We ran into a situation where 2 / 3 physical nodes crashed, resulting in an unexpected loss of 1/9 of the data in the cluster.

Here's the state of our cluster during the crash, with only one physical node left alive:

93b592b931cce71951c249147dc0e517ac128c87 10.201.12.200:9008 slave,fail 211413386844d7179fc2c3e2c9d4a9642e56735c 1460045369231 1460045361711 69 connected
c074ad5f3d7f21b17bd13e595b0579bb2e51c433 10.201.12.200:9003 slave,fail 736f91905c41c63b78f18c8818c28ad5478c70ba 1460045369732 1460045363716 66 connected
43a8b99f49b8553dd7ff7878b190bd5d78cd305a 10.201.12.200:9000 slave,fail 436eeb52747c640acc9aaaabfec74e386dbbffd5 1460045365220 1460045357709 70 connected
436eeb52747c640acc9aaaabfec74e386dbbffd5 10.201.12.215:9004 master - 0 1460052115201 70 connected 9102-10922
1d919bef205427a941e83c2003b1d9ccc062bf5c 10.201.12.215:9005 slave 574497673a49bfc628583a639cfbbafbfc9fcf8d 0 1460052112197 73 connected
4620a84394beb08e3a1b1d6d79a5c6161ada0671 10.201.12.214:9003 slave,fail 436eeb52747c640acc9aaaabfec74e386dbbffd5 1460050359343 1460050351783 70 connected
8d6e8cb03b52980cf84078f5de6394a687ed466b 10.201.12.215:9001 master - 0 1460052112198 76 connected 5461-7281
574497673a49bfc628583a639cfbbafbfc9fcf8d 10.201.12.215:9003 master - 0 1460052112099 73 connected 3641-5460
c08a837c604138ba66f7378d54da05315f596c94 10.201.12.200:9006 slave,fail 019f5615309b20ab880f36c7cd75501f4ea218d6 1460045366723 1460045359205 75 connected
72bb8521dd580bfc2d33f03d6b7fbc432633f1a5 10.201.12.200:9007 slave,fail 8d6e8cb03b52980cf84078f5de6394a687ed466b 1460045367724 1460045361210 76 connected
7c77a0656b2fdf94e6421945093a14a8d10595a5 10.201.12.214:9000 slave,fail 736f91905c41c63b78f18c8818c28ad5478c70ba 1460050355994 1460050348411 66 connected
736f91905c41c63b78f18c8818c28ad5478c70ba 10.201.12.215:9006 master - 0 1460052116201 66 connected 1820-3640
3705315bd4b599f8e2b3e74268827612855d7408 10.201.12.200:9001 slave,fail 6b9083214aa562c813bdc6ded8205915c2e87263 1460045369231 1460045361711 71 connected
8bec043121d340c7d78c43137eaee39a7554df19 10.201.12.214:9008 slave,fail 211413386844d7179fc2c3e2c9d4a9642e56735c 1460050353245 1460050347222 69 connected
4e94841bd9945e2b3c09cf8f8ab2c91127dc14f0 10.201.12.200:9004 slave,fail 574497673a49bfc628583a639cfbbafbfc9fcf8d 1460045365721 1460045358205 73 connected
98a9f0f68d81224cc815f1c56c3e027b71eac950 10.201.12.214:9005 slave,fail 411fad74d3df7432ace1d519042a578244099a36 1460050354387 1460050348608 72 connected
411fad74d3df7432ace1d519042a578244099a36 10.201.12.214:9002 master,fail - 1460050356643 1460050350306 72 connected 14564-16383
7ba757f11e8e9452de1f4dbc09b91d7de4173bc0 10.201.12.214:9006 slave,fail 019f5615309b20ab880f36c7cd75501f4ea218d6 1460050359720 1460050352453 75 connected
019f5615309b20ab880f36c7cd75501f4ea218d6 10.201.12.215:9000 master - 0 1460052109681 75 connected 0-1819
353d95c2b851be78f96297c727491de77b96300d 10.201.12.214:9004 slave,fail 93421aea13f1c20804a65db50b5a5c166538664d 1460050356843 1460050349288 74 connected
93421aea13f1c20804a65db50b5a5c166538664d 10.201.12.215:9008 master - 0 1460052112698 74 connected 12743-14563
211413386844d7179fc2c3e2c9d4a9642e56735c 10.201.12.215:9002 master - 0 1460052115101 69 connected 10923-12742
95f3e3ee1ceed706e5b04ac46f3a3fe6813008c8 10.201.12.200:9002 slave,fail 411fad74d3df7432ace1d519042a578244099a36 1460045369732 1460045362213 72 connected
4b4a3b969e20de630a34334473f0146f11626ff9 10.201.12.200:9005 slave,fail 93421aea13f1c20804a65db50b5a5c166538664d 1460045370736 1460045364317 74 connected
4fb0e435aee6d2d21953467c1e45d01a2869a651 10.201.12.214:9007 slave,fail 8d6e8cb03b52980cf84078f5de6394a687ed466b 1460050354186 1460050346682 76 connected
6b9083214aa562c813bdc6ded8205915c2e87263 10.201.12.214:9001 master,fail - 1460050360232 1460050352646 71 connected
9e650277f280f07f03bb4d797dcfb795b37e952d 10.201.12.215:9007 myself,master - 0 0 77 connected 7282-9101

We expected that the last physical node left alive would have 9 master processes containing all of the data in the cluster. We did indeed see 9 processes, but instead of 9 masters we saw that this last node left alive had a slave process for one of the master procs (so - 8 masters, 1 slave).

1d919bef205427a941e83c2003b1d9ccc062bf5c 10.201.12.215:9005 slave 574497673a49bfc628583a639cfbbafbfc9fcf8d 0 1460052112197 73 connected
574497673a49bfc628583a639cfbbafbfc9fcf8d 10.201.12.215:9003 master - 0 1460052112099 73 connected 3641-5460

So, looks like we thus lost 1/9th of the data in our cluster.

What could have gone wrong? Is it possible we did not configure something correctly and we can avoid such a situation in the future with a config change? Is it possible that there is a bug in the cluster algorithm?

There is one more possibility. Now all slots separated between 8 masters. Is it possible that Redis regrouped 9 masters to 8 masters and moved data to this 8 masters? And just made one slave for one master. It looks strange, but it's possibility how Redis could be actually fine in described flow.

The text was updated successfully, but these errors were encountered:

Spikhalskiy changed the title ~~Corrupting cus~~ Corrupted cluster with data loss Apr 7, 2016

Spikhalskiy changed the title ~~Corrupted cluster with data loss~~ Corrupted RedisCluster with data loss Apr 7, 2016

Spikhalskiy closed this as completed May 10, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corrupted RedisCluster with data loss #3161

Corrupted RedisCluster with data loss #3161

Spikhalskiy commented Apr 7, 2016

Corrupted RedisCluster with data loss #3161

Corrupted RedisCluster with data loss #3161

Comments

Spikhalskiy commented Apr 7, 2016