Unexpected resync with master #828

jokea · 2012-12-14T03:15:14Z

Redis slave will resync with its master if it has been assigned to an unreachable host as its master before.

How to reproduce:
(1). find a host that will take some long time before tcp connection fails:

# date; telnet 10.68.6.165 6379; date
Fri Dec 14 10:52:45 CST 2012
Trying 10.68.6.165...
telnet: connect to address 10.68.6.165: Connection timed out
telnet: Unable to connect to remote host: Connection timed out
Fri Dec 14 10:55:54 CST 2012

it takes telnet more than 3 minutes to detect connection timeout.

(2). start 2 servers with port 6379 and 6380:

# ./redis-server --port 6379
# ./redis-server --port 6380

(3). assign the unreachable host to be the master of redis running on port 6379:

redis 127.0.0.1:6379> slaveof 10.68.6.165 6379

(4). wait 60 seconds before redis detect timeout, (this step is not necessary, just to confirm that it takes some time before a connect error happens),

(5). assign the server running on port 6380 as the new master:

redis 127.0.0.1:6379> slaveof 127.0.0.1 6380

(6). after 2 minutes, timeout detected for the connection in step 3, and redis resync with its new master.

log below:

Slave:

[5379] 14 Dec 10:38:39.470 # Server started, Redis version 2.6.7
[5379] 14 Dec 10:38:39.470 * DB loaded from disk: 0.000 seconds
[5379] 14 Dec 10:38:39.470 * The server is now ready to accept connections on port 6379
[5379] 14 Dec 10:39:13.553 * SLAVE OF 10.68.6.165:6379 enabled (user request)
[5379] 14 Dec 10:39:14.520 * Connecting to MASTER...
[5379] 14 Dec 10:39:14.520 * MASTER <-> SLAVE sync started
[5379] 14 Dec 10:40:15.538 # Timeout connecting to the MASTER...
[5379] 14 Dec 10:40:15.538 * Connecting to MASTER...
[5379] 14 Dec 10:40:15.538 * MASTER <-> SLAVE sync started
[5379] 14 Dec 10:40:24.811 * SLAVE OF 127.0.0.1:6380 enabled (user request)
[5379] 14 Dec 10:40:25.546 * Connecting to MASTER...
[5379] 14 Dec 10:40:25.546 * MASTER <-> SLAVE sync started
[5379] 14 Dec 10:40:25.546 * Non blocking connect for SYNC fired the event.
[5379] 14 Dec 10:40:25.546 * Master replied to PING, replication can continue...
[5379] 14 Dec 10:40:25.549 * MASTER <-> SLAVE sync: receiving 18 bytes from master
[5379] 14 Dec 10:40:25.549 * MASTER <-> SLAVE sync: Loading DB in memory
[5379] 14 Dec 10:40:25.549 * MASTER <-> SLAVE sync: Finished with success
[5379] 14 Dec 10:43:24.526 # Error condition on socket for SYNC: Connection timed out
[5379] 14 Dec 10:43:24.632 * Connecting to MASTER...
[5379] 14 Dec 10:43:24.632 * MASTER <-> SLAVE sync started
[5379] 14 Dec 10:43:24.632 * Non blocking connect for SYNC fired the event.
[5379] 14 Dec 10:43:24.632 * Master replied to PING, replication can continue...
[5379] 14 Dec 10:43:24.635 * MASTER <-> SLAVE sync: receiving 18 bytes from master
[5379] 14 Dec 10:43:24.635 * MASTER <-> SLAVE sync: Loading DB in memory
[5379] 14 Dec 10:43:24.635 * MASTER <-> SLAVE sync: Finished with success

Master:

[5383] 14 Dec 10:38:53.870 # Server started, Redis version 2.6.7
[5383] 14 Dec 10:38:53.871 * DB loaded from disk: 0.000 seconds
[5383] 14 Dec 10:38:53.871 * The server is now ready to accept connections on port 6380
[5383] 14 Dec 10:40:25.547 * Slave ask for synchronization
[5383] 14 Dec 10:40:25.547 * Starting BGSAVE for SYNC
[5383] 14 Dec 10:40:25.547 * Background saving started by pid 6153
[6153] 14 Dec 10:40:25.548 * DB saved on disk
[6153] 14 Dec 10:40:25.548 * RDB: 0 MB of memory used by copy-on-write
[5383] 14 Dec 10:40:25.549 * Background saving terminated with success
[5383] 14 Dec 10:40:25.549 * Synchronization with slave succeeded
[5383] 14 Dec 10:43:24.632 * Slave ask for synchronization
[5383] 14 Dec 10:43:24.632 * Starting BGSAVE for SYNC
[5383] 14 Dec 10:43:24.633 * Background saving started by pid 7175
[7175] 14 Dec 10:43:24.633 * DB saved on disk
[7175] 14 Dec 10:43:24.634 * RDB: 0 MB of memory used by copy-on-write
[5383] 14 Dec 10:43:24.635 * Background saving terminated with success
[5383] 14 Dec 10:43:24.635 * Synchronization with slave succeeded

antirez · 2012-12-14T12:30:58Z

Cool work @jokea, thank you... I'll work on this today.

jokea · 2012-12-18T02:35:48Z

More test on this issue:

redis 127.0.0.1:6379> slaveof 10.68.6.165 6379       // unreachable
OK
redis 127.0.0.1:6379> slaveof 10.68.6.165 6380      // unreachable
OK
redis 127.0.0.1:6379> slaveof 10.68.6.165 6381      // unreachable
OK
redis 127.0.0.1:6379> slaveof 10.68.6.165 6382      // unreachable
OK
redis 127.0.0.1:6379> slaveof 127.0.0.1 6380       // the right one

Wait for all connection timeout and here's the info on master:

redis 127.0.0.1:6380> info Replication
# Replication
role:master
connected_slaves:5
slave0:127.0.0.1,6379,online
slave1:127.0.0.1,6379,online
slave2:127.0.0.1,6379,online
slave3:127.0.0.1,6379,online
slave4:127.0.0.1,6379,online
redis 127.0.0.1:6380> client list
addr=127.0.0.1:60663 fd=5 age=345 idle=0 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=sync
addr=127.0.0.1:60678 fd=6 age=331 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=0 omem=0 events=r cmd=client
addr=127.0.0.1:60815 fd=7 age=209 idle=0 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=sync
addr=127.0.0.1:46090 fd=8 age=181 idle=0 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=sync
addr=127.0.0.1:46098 fd=9 age=174 idle=0 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=sync
addr=127.0.0.1:46104 fd=10 age=169 idle=0 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=sync

info on slave:

redis 127.0.0.1:6379> info replication
# Replication
role:slave
master_host:127.0.0.1
master_port:6380
master_link_status:up
master_last_io_seconds_ago:0
master_sync_in_progress:0
slave_priority:100
slave_read_only:1
connected_slaves:0
redis 127.0.0.1:6379> client list
addr=127.0.0.1:6380 fd=10 age=384 idle=9 flags=M db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
addr=127.0.0.1:6380 fd=5 age=248 idle=9 flags=M db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
addr=127.0.0.1:6380 fd=6 age=220 idle=9 flags=M db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
addr=127.0.0.1:6380 fd=7 age=213 idle=9 flags=M db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
addr=127.0.0.1:6380 fd=8 age=208 idle=9 flags=M db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
addr=127.0.0.1:53536 fd=9 age=16 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=0 omem=0 events=r cmd=client

and an INCR against master gets executed 5 times in slave:

redis 127.0.0.1:6380> incr foo
(integer) 1
------------------------------------------
redis 127.0.0.1:6379> get foo
"5"

jokea · 2012-12-18T02:42:15Z

I think it's the same problem with issue #680

srned · 2012-12-18T04:56:04Z

hey.. I think the bug is in replication.c, slaveofCommand(). The fix is to undo connect with master when slave of command is issued. Tested against unstable branch.

731         if (server.repl_state == REDIS_REPL_TRANSFER)
732             replicationAbortSyncTransfer();
733         else if (server.repl_state == REDIS_REPL_CONNECTING ||
734                  server.repl_state == REDIS_REPL_RECEIVE_PONG)
735             undoConnectWithMaster();

Issue #828 shows how Redis was not correctly undoing a non-blocking connection attempt with the previous master when the master was set to a new address using the SLAVEOF command. This was also a result of lack of refactoring, so now there is a function to cancel the non blocking handshake with the master. The new function is now used when SLAVEOF NO ONE is called or when SLAVEOF is used to set the master to a different address.

antirez · 2013-01-18T11:03:11Z

Closing since the fix was merged in all the branches.

Issue redis#828 shows how Redis was not correctly undoing a non-blocking connection attempt with the previous master when the master was set to a new address using the SLAVEOF command. This was also a result of lack of refactoring, so now there is a function to cancel the non blocking handshake with the master. The new function is now used when SLAVEOF NO ONE is called or when SLAVEOF is used to set the master to a different address.

ghost assigned antirez Dec 14, 2012

charsyam mentioned this issue Dec 20, 2012

Unexpected resync with master #841

Closed

antirez closed this as completed Jan 18, 2013

robscary mentioned this issue Oct 17, 2014

REDIS Master shows same Slave twice #680

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected resync with master #828

Unexpected resync with master #828

jokea commented Dec 14, 2012

antirez commented Dec 14, 2012

jokea commented Dec 18, 2012

jokea commented Dec 18, 2012

srned commented Dec 18, 2012

antirez commented Jan 18, 2013

Unexpected resync with master #828

Unexpected resync with master #828

Comments

jokea commented Dec 14, 2012

antirez commented Dec 14, 2012

jokea commented Dec 18, 2012

jokea commented Dec 18, 2012

srned commented Dec 18, 2012

antirez commented Jan 18, 2013