Replication inconsistent issue #2694

GeorgeBJ · 2015-07-24T09:53:27Z

Redis version 2.8 and 3.0.3

Initially, there are individual A and B(Master)->C(Slave)-D(Slave)
In A execute "set a 1", in B execute "set a 2". Now there is key a with value 2 in B, C and D.
In D execute "multi", "client kill :", "debug sleep 60", "exec" to make D try psync after step 4.
Make B slave of A with slaveof cmd.
Wait D reconnect with C.

Expect Result: value of a is 1 in D
Actual Result: value of a is 2 in D

GeorgeBJ · 2015-07-24T09:57:00Z

I think in step 2 C should reset backlog, then D can only full sync with C.

antirez · 2015-07-28T12:43:03Z

Thanks for submitting, I think I found the cause for this issue. Working on a fix right now.

antirez · 2015-07-28T13:28:10Z

Probably it will never be useful again, but given that I wrote it, we can use it to better document the bug for the future: here is the script to reproduce it easily:

#!/bin/bash
mkdir -p /tmp/a; rm -rf /tmp/a/*
mkdir -p /tmp/b; rm -rf /tmp/b/*
mkdir -p /tmp/c; rm -rf /tmp/c/*
mkdir -p /tmp/d; rm -rf /tmp/d/*
A=8888
B=8889
C=8810
D=8811
BIN=~/hack/redis/src/redis-server
$BIN --logfile /tmp/a/redis.log --port $A &
$BIN --logfile /tmp/b/redis.log --port $B &
$BIN --logfile /tmp/c/redis.log --port $C &
$BIN --logfile /tmp/d/redis.log --port $D &

sleep 2
redis-cli -p $A SLAVEOF NO ONE
redis-cli -p $B SLAVEOF NO ONE
redis-cli -p $C SLAVEOF NO ONE
redis-cli -p $D SLAVEOF NO ONE

redis-cli -p $A FLUSHALL
redis-cli -p $B FLUSHALL
redis-cli -p $C FLUSHALL
redis-cli -p $D FLUSHALL

# Setup A, B <- C <- D
redis-cli -p $C SLAVEOF 127.0.0.1 $B
redis-cli -p $D SLAVEOF 127.0.0.1 $C

# Write the two keys
redis-cli -p $A set a 1
redis-cli -p $B set a 2
sleep 2

# Setup the SLEEP & RECONNECT condition for D
redis-cli -p $D client list
(echo -e "multi\nclient kill id 5\ndebug sleep 5\nexec\n" | redis-cli -p $D) &

# Make B slave of A
sleep 1
redis-cli -p $B SLAVEOF 127.0.0.1 $A

redis-cli -p $A ping
redis-cli -p $B ping
redis-cli -p $C ping
redis-cli -p $D ping

# Fetch the value
sleep 6
echo "The following value should be 1 but is 2 because of the bug:"
redis-cli -p $D get a

# Kill servers
redis-cli -p $A SHUTDOWN NOSAVE
redis-cli -p $B SHUTDOWN NOSAVE
redis-cli -p $C SHUTDOWN NOSAVE
redis-cli -p $D SHUTDOWN NOSAVE

antirez · 2015-07-28T14:31:25Z

I wrote a first patch, than realized that this bug is just a manifestation of a deeper problem. Redis replication code used to do two things:

When a slave lost connection with its master, disconnected the chained slaves ASAP. Which is not needed since after a successful PSYNC with the master, the slaves can continue and don't need to resync in turn.
However after a failed PSYNC the replication backlog was not reset, so a slave was able to PSYNC successfully even if the instance did a full sync with its master, containing now an entirely different data set.

So I'm writing a different fix that only forces a full SYNC of the connected slaves once the slave has to full SYNC with its master.

Using chained replication where C is slave of B which is in turn slave of A, if B reconnects the replication link with A but discovers it is no longer possible to PSYNC, slaves of B must be disconnected and PSYNC not allowed, since the new B dataset may be completely different after the synchronization with the master. Note that there are varius semantical differences in the way this is handled now compared to the past. In the past the semantics was: 1. When a slave lost connection with its master, disconnected the chained slaves ASAP. Which is not needed since after a successful PSYNC with the master, the slaves can continue and don't need to resync in turn. 2. However after a failed PSYNC the replication backlog was not reset, so a slave was able to PSYNC successfully even if the instance did a full sync with its master, containing now an entirely different data set. Now instead chained slaves are not disconnected when the slave lose the connection with its master, but only when it is forced to full SYNC with its master. This means that if the slave having chained slaves does a successful PSYNC all its slaves can continue without troubles. See issue #2694 for more details.

Using chained replication where C is slave of B which is in turn slave of A, if B reconnects the replication link with A but discovers it is no longer possible to PSYNC, slaves of B must be disconnected and PSYNC not allowed, since the new B dataset may be completely different after the synchronization with the master. Note that there are varius semantical differences in the way this is handled now compared to the past. In the past the semantics was: 1. When a slave lost connection with its master, disconnected the chained slaves ASAP. Which is not needed since after a successful PSYNC with the master, the slaves can continue and don't need to resync in turn. 2. However after a failed PSYNC the replication backlog was not reset, so a slave was able to PSYNC successfully even if the instance did a full sync with its master, containing now an entirely different data set. Now instead chained slaves are not disconnected when the slave lose the connection with its master, but only when it is forced to full SYNC with its master. This means that if the slave having chained slaves does a successful PSYNC all its slaves can continue without troubles. See issue redis#2694 for more details.

antirez added critical bug replication labels Jul 28, 2015

antirez closed this as completed Jul 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replication inconsistent issue #2694

Replication inconsistent issue #2694

GeorgeBJ commented Jul 24, 2015

GeorgeBJ commented Jul 24, 2015

antirez commented Jul 28, 2015

antirez commented Jul 28, 2015

antirez commented Jul 28, 2015

Replication inconsistent issue #2694

Replication inconsistent issue #2694

Comments

GeorgeBJ commented Jul 24, 2015

GeorgeBJ commented Jul 24, 2015

antirez commented Jul 28, 2015

antirez commented Jul 28, 2015

antirez commented Jul 28, 2015