Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected resync with master #828

Closed
jokea opened this issue Dec 14, 2012 · 5 comments
Closed

Unexpected resync with master #828

jokea opened this issue Dec 14, 2012 · 5 comments
Assignees

Comments

@jokea
Copy link
Contributor

jokea commented Dec 14, 2012

Redis slave will resync with its master if it has been assigned to an unreachable host as its master before.

How to reproduce:
(1). find a host that will take some long time before tcp connection fails:

# date; telnet 10.68.6.165 6379; date
Fri Dec 14 10:52:45 CST 2012
Trying 10.68.6.165...
telnet: connect to address 10.68.6.165: Connection timed out
telnet: Unable to connect to remote host: Connection timed out
Fri Dec 14 10:55:54 CST 2012

it takes telnet more than 3 minutes to detect connection timeout.

(2). start 2 servers with port 6379 and 6380:

# ./redis-server --port 6379
# ./redis-server --port 6380

(3). assign the unreachable host to be the master of redis running on port 6379:

redis 127.0.0.1:6379> slaveof 10.68.6.165 6379

(4). wait 60 seconds before redis detect timeout, (this step is not necessary, just to confirm that it takes some time before a connect error happens),

(5). assign the server running on port 6380 as the new master:

redis 127.0.0.1:6379> slaveof 127.0.0.1 6380

(6). after 2 minutes, timeout detected for the connection in step 3, and redis resync with its new master.

log below:

Slave:

[5379] 14 Dec 10:38:39.470 # Server started, Redis version 2.6.7
[5379] 14 Dec 10:38:39.470 * DB loaded from disk: 0.000 seconds
[5379] 14 Dec 10:38:39.470 * The server is now ready to accept connections on port 6379
[5379] 14 Dec 10:39:13.553 * SLAVE OF 10.68.6.165:6379 enabled (user request)
[5379] 14 Dec 10:39:14.520 * Connecting to MASTER...
[5379] 14 Dec 10:39:14.520 * MASTER <-> SLAVE sync started
[5379] 14 Dec 10:40:15.538 # Timeout connecting to the MASTER...
[5379] 14 Dec 10:40:15.538 * Connecting to MASTER...
[5379] 14 Dec 10:40:15.538 * MASTER <-> SLAVE sync started
[5379] 14 Dec 10:40:24.811 * SLAVE OF 127.0.0.1:6380 enabled (user request)
[5379] 14 Dec 10:40:25.546 * Connecting to MASTER...
[5379] 14 Dec 10:40:25.546 * MASTER <-> SLAVE sync started
[5379] 14 Dec 10:40:25.546 * Non blocking connect for SYNC fired the event.
[5379] 14 Dec 10:40:25.546 * Master replied to PING, replication can continue...
[5379] 14 Dec 10:40:25.549 * MASTER <-> SLAVE sync: receiving 18 bytes from master
[5379] 14 Dec 10:40:25.549 * MASTER <-> SLAVE sync: Loading DB in memory
[5379] 14 Dec 10:40:25.549 * MASTER <-> SLAVE sync: Finished with success
[5379] 14 Dec 10:43:24.526 # Error condition on socket for SYNC: Connection timed out
[5379] 14 Dec 10:43:24.632 * Connecting to MASTER...
[5379] 14 Dec 10:43:24.632 * MASTER <-> SLAVE sync started
[5379] 14 Dec 10:43:24.632 * Non blocking connect for SYNC fired the event.
[5379] 14 Dec 10:43:24.632 * Master replied to PING, replication can continue...
[5379] 14 Dec 10:43:24.635 * MASTER <-> SLAVE sync: receiving 18 bytes from master
[5379] 14 Dec 10:43:24.635 * MASTER <-> SLAVE sync: Loading DB in memory
[5379] 14 Dec 10:43:24.635 * MASTER <-> SLAVE sync: Finished with success

Master:

[5383] 14 Dec 10:38:53.870 # Server started, Redis version 2.6.7
[5383] 14 Dec 10:38:53.871 * DB loaded from disk: 0.000 seconds
[5383] 14 Dec 10:38:53.871 * The server is now ready to accept connections on port 6380
[5383] 14 Dec 10:40:25.547 * Slave ask for synchronization
[5383] 14 Dec 10:40:25.547 * Starting BGSAVE for SYNC
[5383] 14 Dec 10:40:25.547 * Background saving started by pid 6153
[6153] 14 Dec 10:40:25.548 * DB saved on disk
[6153] 14 Dec 10:40:25.548 * RDB: 0 MB of memory used by copy-on-write
[5383] 14 Dec 10:40:25.549 * Background saving terminated with success
[5383] 14 Dec 10:40:25.549 * Synchronization with slave succeeded
[5383] 14 Dec 10:43:24.632 * Slave ask for synchronization
[5383] 14 Dec 10:43:24.632 * Starting BGSAVE for SYNC
[5383] 14 Dec 10:43:24.633 * Background saving started by pid 7175
[7175] 14 Dec 10:43:24.633 * DB saved on disk
[7175] 14 Dec 10:43:24.634 * RDB: 0 MB of memory used by copy-on-write
[5383] 14 Dec 10:43:24.635 * Background saving terminated with success
[5383] 14 Dec 10:43:24.635 * Synchronization with slave succeeded
@ghost ghost assigned antirez Dec 14, 2012
@antirez
Copy link
Contributor

antirez commented Dec 14, 2012

Cool work @jokea, thank you... I'll work on this today.

@jokea
Copy link
Contributor Author

jokea commented Dec 18, 2012

More test on this issue:

redis 127.0.0.1:6379> slaveof 10.68.6.165 6379       // unreachable
OK
redis 127.0.0.1:6379> slaveof 10.68.6.165 6380      // unreachable
OK
redis 127.0.0.1:6379> slaveof 10.68.6.165 6381      // unreachable
OK
redis 127.0.0.1:6379> slaveof 10.68.6.165 6382      // unreachable
OK
redis 127.0.0.1:6379> slaveof 127.0.0.1 6380       // the right one

Wait for all connection timeout and here's the info on master:

redis 127.0.0.1:6380> info Replication
# Replication
role:master
connected_slaves:5
slave0:127.0.0.1,6379,online
slave1:127.0.0.1,6379,online
slave2:127.0.0.1,6379,online
slave3:127.0.0.1,6379,online
slave4:127.0.0.1,6379,online
redis 127.0.0.1:6380> client list
addr=127.0.0.1:60663 fd=5 age=345 idle=0 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=sync
addr=127.0.0.1:60678 fd=6 age=331 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=0 omem=0 events=r cmd=client
addr=127.0.0.1:60815 fd=7 age=209 idle=0 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=sync
addr=127.0.0.1:46090 fd=8 age=181 idle=0 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=sync
addr=127.0.0.1:46098 fd=9 age=174 idle=0 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=sync
addr=127.0.0.1:46104 fd=10 age=169 idle=0 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=sync

info on slave:

redis 127.0.0.1:6379> info replication
# Replication
role:slave
master_host:127.0.0.1
master_port:6380
master_link_status:up
master_last_io_seconds_ago:0
master_sync_in_progress:0
slave_priority:100
slave_read_only:1
connected_slaves:0
redis 127.0.0.1:6379> client list
addr=127.0.0.1:6380 fd=10 age=384 idle=9 flags=M db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
addr=127.0.0.1:6380 fd=5 age=248 idle=9 flags=M db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
addr=127.0.0.1:6380 fd=6 age=220 idle=9 flags=M db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
addr=127.0.0.1:6380 fd=7 age=213 idle=9 flags=M db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
addr=127.0.0.1:6380 fd=8 age=208 idle=9 flags=M db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
addr=127.0.0.1:53536 fd=9 age=16 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=0 omem=0 events=r cmd=client

and an INCR against master gets executed 5 times in slave:

redis 127.0.0.1:6380> incr foo
(integer) 1
------------------------------------------
redis 127.0.0.1:6379> get foo
"5"

@jokea
Copy link
Contributor Author

jokea commented Dec 18, 2012

I think it's the same problem with issue #680

@srned
Copy link

srned commented Dec 18, 2012

hey.. I think the bug is in replication.c, slaveofCommand(). The fix is to undo connect with master when slave of command is issued. Tested against unstable branch.

731         if (server.repl_state == REDIS_REPL_TRANSFER)
732             replicationAbortSyncTransfer();
733         else if (server.repl_state == REDIS_REPL_CONNECTING ||
734                  server.repl_state == REDIS_REPL_RECEIVE_PONG)
735             undoConnectWithMaster();

antirez added a commit that referenced this issue Jan 14, 2013
Issue #828 shows how Redis was not correctly undoing a non-blocking
connection attempt with the previous master when the master was set to a
new address using the SLAVEOF command.

This was also a result of lack of refactoring, so now there is a
function to cancel the non blocking handshake with the master.
The new function is now used when SLAVEOF NO ONE is called or when
SLAVEOF is used to set the master to a different address.
antirez added a commit that referenced this issue Jan 15, 2013
Issue #828 shows how Redis was not correctly undoing a non-blocking
connection attempt with the previous master when the master was set to a
new address using the SLAVEOF command.

This was also a result of lack of refactoring, so now there is a
function to cancel the non blocking handshake with the master.
The new function is now used when SLAVEOF NO ONE is called or when
SLAVEOF is used to set the master to a different address.
antirez added a commit that referenced this issue Jan 15, 2013
Issue #828 shows how Redis was not correctly undoing a non-blocking
connection attempt with the previous master when the master was set to a
new address using the SLAVEOF command.

This was also a result of lack of refactoring, so now there is a
function to cancel the non blocking handshake with the master.
The new function is now used when SLAVEOF NO ONE is called or when
SLAVEOF is used to set the master to a different address.
antirez added a commit that referenced this issue Jan 15, 2013
Issue #828 shows how Redis was not correctly undoing a non-blocking
connection attempt with the previous master when the master was set to a
new address using the SLAVEOF command.

This was also a result of lack of refactoring, so now there is a
function to cancel the non blocking handshake with the master.
The new function is now used when SLAVEOF NO ONE is called or when
SLAVEOF is used to set the master to a different address.
@antirez
Copy link
Contributor

antirez commented Jan 18, 2013

Closing since the fix was merged in all the branches.

@antirez antirez closed this as completed Jan 18, 2013
JackieXie168 pushed a commit to JackieXie168/redis that referenced this issue Aug 29, 2016
Issue redis#828 shows how Redis was not correctly undoing a non-blocking
connection attempt with the previous master when the master was set to a
new address using the SLAVEOF command.

This was also a result of lack of refactoring, so now there is a
function to cancel the non blocking handshake with the master.
The new function is now used when SLAVEOF NO ONE is called or when
SLAVEOF is used to set the master to a different address.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants