Random and small clusterdown #2180

atlantis3001 · 2014-12-02T09:24:03Z

Hello,
I use redis cluster and I have some errors

# /usr/local/bin/redis-server -v
Redis server v=2.9.101 sha=00000000:0 malloc=jemalloc-3.6.0 bits=64 build=36dadd96225ec7c3

4107:M 02 Dec 07:32:57.062 * 10 changes in 300 seconds. Saving...
4107:M 02 Dec 07:32:57.076 * Background saving started by pid 138527
138527:C 02 Dec 07:32:59.105 * DB saved on disk
138527:C 02 Dec 07:32:59.113 * RDB: 0 MB of memory used by copy-on-write
4107:M 02 Dec 07:32:59.181 * Background saving terminated with success
4107:M 02 Dec 07:36:39.716 # Cluster state changed: fail
4107:M 02 Dec 07:36:40.216 # Cluster state changed: ok
4107:M 02 Dec 07:38:00.033 * 10 changes in 300 seconds. Saving...
4107:M 02 Dec 07:38:00.047 * Background saving started by pid 138583
138583:C 02 Dec 07:38:01.941 * DB saved on disk
138583:C 02 Dec 07:38:01.950 * RDB: 0 MB of memory used by copy-on-write
4107:M 02 Dec 07:38:02.051 * Background saving terminated with success
4107:M 02 Dec 07:43:03.015 * 10 changes in 300 seconds. Saving...
4107:M 02 Dec 07:43:03.028 * Background saving started by pid 138640
138640:C 02 Dec 07:43:04.919 * DB saved on disk
138640:C 02 Dec 07:43:04.928 * RDB: 0 MB of memory used by copy-on-write
4107:M 02 Dec 07:43:05.033 * Background saving terminated with success
4107:M 02 Dec 07:46:14.373 # Cluster state changed: fail
4107:M 02 Dec 07:46:14.873 # Cluster state changed: ok
4107:M 02 Dec 07:47:06.487 # Cluster state changed: fail
4107:M 02 Dec 07:47:07.087 # Cluster state changed: ok
4107:M 02 Dec 07:48:06.100 * 10 changes in 300 seconds. Saving...
4107:M 02 Dec 07:48:06.115 * Background saving started by pid 138698
138698:C 02 Dec 07:48:08.165 * DB saved on disk
138698:C 02 Dec 07:48:08.173 * RDB: 0 MB of memory used by copy-on-write
4107:M 02 Dec 07:48:08.219 * Background saving terminated with success

This is my topology

# /usr/local/bin/redis-cli -c -p 7001 cluster nodes
2fd06a10e539be049759e65dbe8cfa13b62bb11a 10.1.0.141:7103 slave c357db3020d8b8f753a043f6346578629f4975e3 0 1417517259801 31 connected
c13dbfb0f0e5541cf40ae23419c56d0113465c91 10.1.0.141:7101 slave 8fbf69a69c9b3644c7ef08252eb96fbeb3d41e0f 0 1417517259801 29 connected
0fcd1a7d909496e58114ba065d2fabf5fa814a37 10.1.0.142:7102 slave 96a4f973c9c173f1f9305ea33e14b3deed80c0d3 0 1417517259802 37 connected
9f8aeaad2f99e942208e65fb6159f568910d18b0 10.1.0.143:7102 slave caa475624dcc60b385021ae642c5a84914ad9b75 0 1417517259802 42 connected
40f158b0d545521356e06fde379b90a1d71bc260 10.1.0.143:7002 master - 0 1417517259802 30 connected 10923-12742
968043a75545109e5cd4afe0c3fc0712bddecf34 10.1.0.141:7003 master - 0 1417517259801 9 connected 14563-16383
e21946bee1839fefdec6111c511a131c9e1760b8 10.1.0.143:7103 slave c993961f8acbdcbd1adf478299143c84533ea331 0 1417517259802 43 connected
a32a645aa6fc253ba8be2c78cddc70abc4cc7a05 10.1.0.142:7001 master - 0 1417517259802 41 connected 0-1819
c993961f8acbdcbd1adf478299143c84533ea331 10.1.0.142:7003 master - 0 1417517259802 43 connected 5461-7280
caa475624dcc60b385021ae642c5a84914ad9b75 10.1.0.142:7002 master - 0 1417517259802 42 connected 1820-3639
5ce7d47fd8114f90d88e7644211896e88530cdc4 10.1.0.141:7102 slave 40f158b0d545521356e06fde379b90a1d71bc260 0 1417517259802 30 connected
96a4f973c9c173f1f9305ea33e14b3deed80c0d3 10.1.0.141:7002 master - 0 1417517259802 37 connected 9101-10922
8fbf69a69c9b3644c7ef08252eb96fbeb3d41e0f 10.1.0.143:7001 master - 0 1417517259802 29 connected 7281-9100
c357db3020d8b8f753a043f6346578629f4975e3 10.1.0.143:7003 master - 0 1417517259802 31 connected 12743-14562
c3b9a2c34f48a2d56bb0e29ce415a2b085d1a310 10.1.0.142:7101 slave 8e5bcd18054d1211c7b71221506892daccd22055 0 1417517259802 25 connected
89dabed32190ab456cb669e7cede900a1bcc74ac 10.1.0.143:7101 slave a32a645aa6fc253ba8be2c78cddc70abc4cc7a05 0 1417517259802 41 connected
8e5bcd18054d1211c7b71221506892daccd22055 10.1.0.141:7001 myself,master - 0 0 25 connected 3640-5460
ba98065c4a7854ae078ca194fd44ee494a363c35 10.1.0.142:7103 slave 968043a75545109e5cd4afe0c3fc0712bddecf34 0 1417517259802 18 connected

Do you have any idea where this error come from ?

Regards

The text was updated successfully, but these errors were encountered:

mattsta · 2014-12-03T22:04:50Z

Given your log with only this:

4107:M 02 Dec 07:36:39.716 # Cluster state changed: fail
4107:M 02 Dec 07:36:40.216 # Cluster state changed: ok

There's no way to tell what actually happened. Connections between servers going down? Each of your transitions between fail and ok last less than 1 second each.

For more details, you'd need to increase the log level.

antirez · 2014-12-19T09:22:08Z

Hello, this looks like a node-timeout configuration which is too short for the latency of the instances/network. Please could you provide us with CONFIG GET cluster* output? Thanks.

atlantis3001 · 2014-12-19T09:55:32Z

Hello antirez,
I'm realy sorry but I forgot to close this topic.
I've found the problem and you are right, first I use some virtuals server to create and test the cluster with very small time out and when I put it in reals servers I forgot to increase the time out.
Thank you for your help and your work.

antirez · 2014-12-19T10:10:37Z

Thanks for replying! Have a nice day.

antirez added the WAITING-OP-REPLY label Dec 19, 2014

atlantis3001 closed this as completed Dec 19, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random and small clusterdown #2180

Random and small clusterdown #2180

atlantis3001 commented Dec 2, 2014

mattsta commented Dec 3, 2014

antirez commented Dec 19, 2014

atlantis3001 commented Dec 19, 2014

antirez commented Dec 19, 2014

Random and small clusterdown #2180

Random and small clusterdown #2180

Comments

atlantis3001 commented Dec 2, 2014

mattsta commented Dec 3, 2014

antirez commented Dec 19, 2014

atlantis3001 commented Dec 19, 2014

antirez commented Dec 19, 2014