Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replication inconsistent issue #2694

Closed
GeorgeBJ opened this issue Jul 24, 2015 · 4 comments
Closed

Replication inconsistent issue #2694

GeorgeBJ opened this issue Jul 24, 2015 · 4 comments

Comments

@GeorgeBJ
Copy link

Redis version 2.8 and 3.0.3

  1. Initially, there are individual A and B(Master)->C(Slave)-D(Slave)
  2. In A execute "set a 1", in B execute "set a 2". Now there is key a with value 2 in B, C and D.
  3. In D execute "multi", "client kill :", "debug sleep 60", "exec" to make D try psync after step 4.
  4. Make B slave of A with slaveof cmd.
  5. Wait D reconnect with C.

Expect Result: value of a is 1 in D
Actual Result: value of a is 2 in D

@GeorgeBJ
Copy link
Author

I think in step 2 C should reset backlog, then D can only full sync with C.

@antirez
Copy link
Contributor

antirez commented Jul 28, 2015

Thanks for submitting, I think I found the cause for this issue. Working on a fix right now.

@antirez
Copy link
Contributor

antirez commented Jul 28, 2015

Probably it will never be useful again, but given that I wrote it, we can use it to better document the bug for the future: here is the script to reproduce it easily:

#!/bin/bash
mkdir -p /tmp/a; rm -rf /tmp/a/*
mkdir -p /tmp/b; rm -rf /tmp/b/*
mkdir -p /tmp/c; rm -rf /tmp/c/*
mkdir -p /tmp/d; rm -rf /tmp/d/*
A=8888
B=8889
C=8810
D=8811
BIN=~/hack/redis/src/redis-server
$BIN --logfile /tmp/a/redis.log --port $A &
$BIN --logfile /tmp/b/redis.log --port $B &
$BIN --logfile /tmp/c/redis.log --port $C &
$BIN --logfile /tmp/d/redis.log --port $D &

sleep 2
redis-cli -p $A SLAVEOF NO ONE
redis-cli -p $B SLAVEOF NO ONE
redis-cli -p $C SLAVEOF NO ONE
redis-cli -p $D SLAVEOF NO ONE

redis-cli -p $A FLUSHALL
redis-cli -p $B FLUSHALL
redis-cli -p $C FLUSHALL
redis-cli -p $D FLUSHALL

# Setup A, B <- C <- D
redis-cli -p $C SLAVEOF 127.0.0.1 $B
redis-cli -p $D SLAVEOF 127.0.0.1 $C

# Write the two keys
redis-cli -p $A set a 1
redis-cli -p $B set a 2
sleep 2

# Setup the SLEEP & RECONNECT condition for D
redis-cli -p $D client list
(echo -e "multi\nclient kill id 5\ndebug sleep 5\nexec\n" | redis-cli -p $D) &

# Make B slave of A
sleep 1
redis-cli -p $B SLAVEOF 127.0.0.1 $A

redis-cli -p $A ping
redis-cli -p $B ping
redis-cli -p $C ping
redis-cli -p $D ping

# Fetch the value
sleep 6
echo "The following value should be 1 but is 2 because of the bug:"
redis-cli -p $D get a

# Kill servers
redis-cli -p $A SHUTDOWN NOSAVE
redis-cli -p $B SHUTDOWN NOSAVE
redis-cli -p $C SHUTDOWN NOSAVE
redis-cli -p $D SHUTDOWN NOSAVE

@antirez
Copy link
Contributor

antirez commented Jul 28, 2015

I wrote a first patch, than realized that this bug is just a manifestation of a deeper problem. Redis replication code used to do two things:

  1. When a slave lost connection with its master, disconnected the chained slaves ASAP. Which is not needed since after a successful PSYNC with the master, the slaves can continue and don't need to resync in turn.
  2. However after a failed PSYNC the replication backlog was not reset, so a slave was able to PSYNC successfully even if the instance did a full sync with its master, containing now an entirely different data set.

So I'm writing a different fix that only forces a full SYNC of the connected slaves once the slave has to full SYNC with its master.

antirez added a commit that referenced this issue Jul 28, 2015
Using chained replication where C is slave of B which is in turn slave of
A, if B reconnects the replication link with A but discovers it is no
longer possible to PSYNC, slaves of B must be disconnected and PSYNC
not allowed, since the new B dataset may be completely different after
the synchronization with the master.

Note that there are varius semantical differences in the way this is
handled now compared to the past. In the past the semantics was:

1. When a slave lost connection with its master, disconnected the chained
slaves ASAP. Which is not needed since after a successful PSYNC with the
master, the slaves can continue and don't need to resync in turn.

2. However after a failed PSYNC the replication backlog was not reset, so a
slave was able to PSYNC successfully even if the instance did a full
sync with its master, containing now an entirely different data set.

Now instead chained slaves are not disconnected when the slave lose the
connection with its master, but only when it is forced to full SYNC with
its master. This means that if the slave having chained slaves does a
successful PSYNC all its slaves can continue without troubles.

See issue #2694 for more details.
antirez added a commit that referenced this issue Aug 20, 2015
Using chained replication where C is slave of B which is in turn slave of
A, if B reconnects the replication link with A but discovers it is no
longer possible to PSYNC, slaves of B must be disconnected and PSYNC
not allowed, since the new B dataset may be completely different after
the synchronization with the master.

Note that there are varius semantical differences in the way this is
handled now compared to the past. In the past the semantics was:

1. When a slave lost connection with its master, disconnected the chained
slaves ASAP. Which is not needed since after a successful PSYNC with the
master, the slaves can continue and don't need to resync in turn.

2. However after a failed PSYNC the replication backlog was not reset, so a
slave was able to PSYNC successfully even if the instance did a full
sync with its master, containing now an entirely different data set.

Now instead chained slaves are not disconnected when the slave lose the
connection with its master, but only when it is forced to full SYNC with
its master. This means that if the slave having chained slaves does a
successful PSYNC all its slaves can continue without troubles.

See issue #2694 for more details.
antirez added a commit that referenced this issue Aug 20, 2015
Using chained replication where C is slave of B which is in turn slave of
A, if B reconnects the replication link with A but discovers it is no
longer possible to PSYNC, slaves of B must be disconnected and PSYNC
not allowed, since the new B dataset may be completely different after
the synchronization with the master.

Note that there are varius semantical differences in the way this is
handled now compared to the past. In the past the semantics was:

1. When a slave lost connection with its master, disconnected the chained
slaves ASAP. Which is not needed since after a successful PSYNC with the
master, the slaves can continue and don't need to resync in turn.

2. However after a failed PSYNC the replication backlog was not reset, so a
slave was able to PSYNC successfully even if the instance did a full
sync with its master, containing now an entirely different data set.

Now instead chained slaves are not disconnected when the slave lose the
connection with its master, but only when it is forced to full SYNC with
its master. This means that if the slave having chained slaves does a
successful PSYNC all its slaves can continue without troubles.

See issue #2694 for more details.
antirez added a commit that referenced this issue Aug 21, 2015
Using chained replication where C is slave of B which is in turn slave of
A, if B reconnects the replication link with A but discovers it is no
longer possible to PSYNC, slaves of B must be disconnected and PSYNC
not allowed, since the new B dataset may be completely different after
the synchronization with the master.

Note that there are varius semantical differences in the way this is
handled now compared to the past. In the past the semantics was:

1. When a slave lost connection with its master, disconnected the chained
slaves ASAP. Which is not needed since after a successful PSYNC with the
master, the slaves can continue and don't need to resync in turn.

2. However after a failed PSYNC the replication backlog was not reset, so a
slave was able to PSYNC successfully even if the instance did a full
sync with its master, containing now an entirely different data set.

Now instead chained slaves are not disconnected when the slave lose the
connection with its master, but only when it is forced to full SYNC with
its master. This means that if the slave having chained slaves does a
successful PSYNC all its slaves can continue without troubles.

See issue #2694 for more details.
JackieXie168 pushed a commit to JackieXie168/redis that referenced this issue Aug 29, 2016
Using chained replication where C is slave of B which is in turn slave of
A, if B reconnects the replication link with A but discovers it is no
longer possible to PSYNC, slaves of B must be disconnected and PSYNC
not allowed, since the new B dataset may be completely different after
the synchronization with the master.

Note that there are varius semantical differences in the way this is
handled now compared to the past. In the past the semantics was:

1. When a slave lost connection with its master, disconnected the chained
slaves ASAP. Which is not needed since after a successful PSYNC with the
master, the slaves can continue and don't need to resync in turn.

2. However after a failed PSYNC the replication backlog was not reset, so a
slave was able to PSYNC successfully even if the instance did a full
sync with its master, containing now an entirely different data set.

Now instead chained slaves are not disconnected when the slave lose the
connection with its master, but only when it is forced to full SYNC with
its master. This means that if the slave having chained slaves does a
successful PSYNC all its slaves can continue without troubles.

See issue redis#2694 for more details.
@antirez antirez closed this as completed Jul 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants