Replication fails with 2.8.3 slaves #1488

Closed
donspaulding opened this Issue Dec 31, 2013 · 19 comments

Comments

Projects
None yet
8 participants

We get a "Protocol error from client" error when replication completes with 2.8.3 slaves. Rolling back to version 2.8.2 makes the error go away. Furthermore, we have slaves on the same master that are using older versions of redis, and do not exhibit this issue. A first glance makes me think this is a regression introduced by #1449.

I'm not sure which data to collect to help you replicate the issue, so please let me know if there's more information I can provide.

[9290] 31 Dec 01:22:10.720 * Connecting to MASTER 127.0.0.1:6380
[9290] 31 Dec 01:22:10.721 * MASTER <-> SLAVE sync started
[9290] 31 Dec 01:22:10.721 * Non blocking connect for SYNC fired the event.
[9290] 31 Dec 01:22:10.849 * Master replied to PING, replication can continue...
[9290] 31 Dec 01:22:10.899 * (Non critical) Master does not understand REPLCONF listening-port: -ERR unknown command 'REPLCONF'
[9290] 31 Dec 01:22:10.899 * Partial resynchronization not possible (no cached master)
[9290] 31 Dec 01:22:10.950 * Master does not support PSYNC or is in error state (reply: -ERR unknown command 'PSYNC')
[9290] 31 Dec 01:22:10.950 * Retrying with SYNC...
[9290] 31 Dec 01:22:12.136 * MASTER <-> SLAVE sync: receiving 46765549 bytes from master
[9290] 31 Dec 01:22:14.761 - DB 0: 224736 keys (0 volatile) in 262144 slots HT.
[9290] 31 Dec 01:22:14.762 - 1 clients connected (0 slaves), 138179528 bytes in use
[9290] 31 Dec 01:22:19.779 - DB 0: 224736 keys (0 volatile) in 262144 slots HT.
[9290] 31 Dec 01:22:19.779 - 1 clients connected (0 slaves), 138179528 bytes in use
[9290] 31 Dec 01:22:24.803 - DB 0: 224736 keys (0 volatile) in 262144 slots HT.
[9290] 31 Dec 01:22:24.803 - 1 clients connected (0 slaves), 138179528 bytes in use
[9290] 31 Dec 01:22:28.969 * MASTER <-> SLAVE sync: Flushing old data
[9290] 31 Dec 01:22:30.888 * MASTER <-> SLAVE sync: Loading DB in memory
[9290] 31 Dec 01:22:34.550 * MASTER <-> SLAVE sync: Finished with success
[9290] 31 Dec 01:22:35.353 - DB 0: 224736 keys (0 volatile) in 262144 slots HT.
[9290] 31 Dec 01:22:35.353 - 2 clients connected (0 slaves), 138200368 bytes in use
[9290] 31 Dec 01:22:35.440 - Protocol error from client: addr=127.0.0.1:6380 fd=5 name= age=7 idle=0 flags=M db=0 sub=0 psub=0 multi=-1 qbuf=202 qbuf-free=32566 obl=0 oll=0 omem=0 events=r cmd=ping
[9290] 31 Dec 01:22:37.359 * Connecting to MASTER 127.0.0.1:6380
    .... <AND SO ON> ....

I believe the cmd=ping line shows different commands depending on exactly what the first command is that a client runs after the sync completes.

Here's the master server's INFO:

$ redis-cli -p 6380 info
redis_version:2.4.8
redis_git_sha1:00000000
redis_git_dirty:0
arch_bits:32
multiplexing_api:epoll
gcc_version:4.6.1
process_id:2305
uptime_in_seconds:5660266
uptime_in_days:65
lru_clock:433306
used_cpu_sys:18294.49
used_cpu_user:3419.75
used_cpu_sys_children:10052.08
used_cpu_user_children:34195.58
connected_clients:1
connected_slaves:9
client_longest_output_list:0
client_biggest_input_buf:4
blocked_clients:0
used_memory:121029984
used_memory_human:115.42M
used_memory_rss:128729088
used_memory_peak:126078008
used_memory_peak_human:120.24M
mem_fragmentation_ratio:1.06
mem_allocator:jemalloc-2.2.5
loading:0
aof_enabled:0
changes_since_last_save:0
bgsave_in_progress:1
last_save_time:1388453381
bgrewriteaof_in_progress:0
total_connections_received:1233625
total_commands_processed:21474513
expired_keys:0
evicted_keys:0
keyspace_hits:13018
keyspace_misses:221
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:47353
vm_enabled:0
role:master
slave0:127.0.0.1,36431,online
slave1:127.0.0.1,41795,online
slave2:127.0.0.1,59897,online
slave3:127.0.0.1,45379,online
slave4:127.0.0.1,59744,online
slave5:127.0.0.1,60378,online
slave6:127.0.0.1,60674,online
slave7:127.0.0.1,33161,send_bulk
slave8:127.0.0.1,34722,wait_bgsave
db0:keys=224736,expires=0

And the slave's INFO:

$ redis-cli info
# Server
redis_version:2.8.3
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:3ec550d40ac43a9d
redis_mode:standalone
os:Linux 3.9.3-x86_64-linode33 x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:4.7.2
process_id:9290
run_id:988f3b3233b3b377a4cbedf0312853b545bf908e
tcp_port:6379
uptime_in_seconds:87772
uptime_in_days:1
hz:10
lru_clock:433305
config_file:/var/projects/configuration/redis.conf

# Clients
connected_clients:2
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0

# Memory
used_memory:138238176
used_memory_human:131.83M
used_memory_rss:142532608
used_memory_peak:138407048
used_memory_peak_human:132.00M
used_memory_lua:33792
mem_fragmentation_ratio:1.03
mem_allocator:jemalloc-3.4.1

# Persistence
loading:0
rdb_changes_since_last_save:163
rdb_bgsave_in_progress:0
rdb_last_save_time:1388439273
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:7
rdb_current_bgsave_time_sec:-1
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok

# Stats
total_connections_received:5
total_commands_processed:324738
instantaneous_ops_per_sec:0
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
evicted_keys:0
keyspace_hits:76717
keyspace_misses:10779
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:167050

# Replication
role:slave
master_host:127.0.0.1
master_port:6380
master_link_status:down
master_last_io_seconds_ago:-1
master_sync_in_progress:1
slave_repl_offset:1
master_sync_left_bytes:46241261
master_sync_last_io_seconds_ago:0
master_link_down_since_seconds:4
slave_priority:100
slave_read_only:1
connected_slaves:0
master_repl_offset:0
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

# CPU
used_cpu_sys:4740.17
used_cpu_user:24389.60
used_cpu_sys_children:54.23
used_cpu_user_children:294.30

# Keyspace
db0:keys=224736,expires=0,avg_ttl=0

And finally our slave's redis.conf:

# Redis configuration file example

# Note on units: when memory size is needed, it is possible to specifiy
# it in the usual form of 1k 5GB 4M and so forth:
#
# 1k => 1000 bytes
# 1kb => 1024 bytes
# 1m => 1000000 bytes
# 1mb => 1024*1024 bytes
# 1g => 1000000000 bytes
# 1gb => 1024*1024*1024 bytes
#
# units are case insensitive so 1GB 1Gb 1gB are all the same.

# By default Redis does not run as a daemon. Use 'yes' if you need it.
# Note that Redis will write a pid file in /var/run/redis.pid when daemonized.
# -- warning:  saying yes makes this die under upstart
daemonize no

# When running daemonized, Redis writes a pid file in /var/run/redis.pid by
# default. You can specify a custom pid file location here.
# pidfile /var/run/redis-kaczynski.pid

# Accept connections on the specified port, default is 6379.
# If port 0 is specified Redis will not listen on a TCP socket.
port 6379

# If you want you can bind a single interface, if the bind option is not
# specified all the interfaces will listen for incoming connections.
#
bind 127.0.0.1

# Specify the path for the unix socket that will be used to listen for
# incoming connections. There is no default, so Redis will not listen
# on a unix socket when not specified.
#
# unixsocket /tmp/redis.sock

# Close the connection after a client is idle for N seconds (0 to disable)
timeout 3600

# Set server verbosity to 'debug'
# it can be one of:
# debug (a lot of information, useful for development/testing)
# verbose (many rarely useful info, but not a mess like the debug level)
# notice (moderately verbose, what you want in production probably)
# warning (only very important / critical messages are logged)
loglevel verbose

# Specify the log file name. Also 'stdout' can be used to force
# Redis to log on the standard output. Note that if you use standard
# output for logging but daemonize, logs will be sent to /dev/null
# logfile stdout
# logfile /var/log/redis/redis.log
logfile /var/projects/kaczynski/logs/redis.kaczynski.log

# To enable logging to the system logger, just set 'syslog-enabled' to yes,
# and optionally update the other syslog parameters to suit your needs.
# syslog-enabled no

# Specify the syslog identity.
# syslog-ident redis

# Specify the syslog facility.  Must be USER or between LOCAL0-LOCAL7.
# syslog-facility local0

# Set the number of databases. The default database is DB 0, you can select
# a different one on a per-connection basis using SELECT <dbid> where
# dbid is a number between 0 and 'databases'-1
databases 16

################################ SNAPSHOTTING  #################################
#
# Save the DB on disk:
#
#   save <seconds> <changes>
#
#   Will save the DB if both the given number of seconds and the given
#   number of write operations against the DB occurred.
#
#   In the example below the behaviour will be to save:
#   after 900 sec (15 min) if at least 1 key changed
#   after 300 sec (5 min) if at least 10 keys changed
#   after 60 sec if at least 10000 keys changed
#
#   Note: you can disable saving at all commenting all the "save" lines.

save 60 250
save 30 500
save 15 1500

# Compress string objects using LZF when dump .rdb databases?
# For default that's set to 'yes' as it's almost always a win.
# If you want to save some CPU in the saving child set it to 'no' but
# the dataset will likely be bigger if you have compressible values or keys.
rdbcompression yes

# The filename where to dump the DB
dbfilename kaczynski.redis.dump.rdb

# The working directory.
#
# The DB will be written inside this directory, with the filename specified
# above using the 'dbfilename' configuration directive.
#
# Also the Append Only File will be created inside this directory.
#
# Note that you must specify a directory here, not a file name.
dir /var/projects/kaczynski/data

################################# REPLICATION #################################

# Master-Slave replication. Use slaveof to make a Redis instance a copy of
# another Redis server. Note that the configuration is local to the slave
# so for example it is possible to configure the slave to save the DB with a
# different interval, or to listen to another port, and so on.
#
slaveof 127.0.0.1 6380


# If the master is password protected (using the "requirepass" configuration
# directive below) it is possible to tell the slave to authenticate before
# starting the replication synchronization process, otherwise the master will
# refuse the slave request.
#
# masterauth <master-password>

# When a slave lost the connection with the master, or when the replication
# is still in progress, the slave can act in two different ways:
#
#1) if slave-serve-stale-data is set to 'yes' (the default) the slave will
#    still reply to client requests, possibly with out of data data, or the
#    data set may just be empty if this is the first synchronization.
#
#2) if slave-serve-stale data is set to 'no' the slave will reply with
#    an error "SYNC with master in progress" to all the kind of commands
#    but to INFO and SLAVEOF.
#
slave-serve-stale-data yes

################################## SECURITY ###################################

# Require clients to issue AUTH <PASSWORD> before processing any other
# commands.  This might be useful in environments in which you do not trust
# others with access to the host running redis-server.
#
# This should stay commented out for backward compatibility and because most
# people do not need auth (e.g. they run their own servers).
#
# Warning: since Redis is pretty fast an outside user can try up to
# 150k passwords per second against a good box. This means that you should
# use a very strong password otherwise it will be very easy to break.
#
# requirepass foobared

# Command renaming.
#
# It is possilbe to change the name of dangerous commands in a shared
# environment. For instance the CONFIG command may be renamed into something
# of hard to guess so that it will be still available for internal-use
# tools but not available for general clients.
#
# Example:
#
# rename-command CONFIG b840fc02d524045429941cc15f59e41cb7be6c52
#
# It is also possilbe to completely kill a command renaming it into
# an empty string:
#
# rename-command CONFIG ""

################################### LIMITS ####################################

# Set the max number of connected clients at the same time. By default there
# is no limit, and it's up to the number of file descriptors the Redis process
# is able to open. The special value '0' means no limits.
# Once the limit is reached Redis will close all the new connections sending
# an error 'max number of clients reached'.
#
# maxclients 128

# Don't use more memory than the specified amount of bytes.
# When the memory limit is reached Redis will try to remove keys with an
# EXPIRE set. It will try to start freeing keys that are going to expire
# in little time and preserve keys with a longer time to live.
# Redis will also try to remove objects from free lists if possible.
#
# If all this fails, Redis will start to reply with errors to commands
# that will use more memory, like SET, LPUSH, and so on, and will continue
# to reply to most read-only commands like GET.
#
# WARNING: maxmemory can be a good idea mainly if you want to use Redis as a
# 'state' server or cache, not as a real DB. When Redis is used as a real
# database the memory usage will grow over the weeks, it will be obvious if
# it is going to use too much memory in the long run, and you'll have the time
# to upgrade. With maxmemory after the limit is reached you'll start to get
# errors for write operations, and this may even lead to DB inconsistency.
#
# maxmemory <bytes>

# MAXMEMORY POLICY: how Redis will select what to remove when maxmemory
# is reached? You can select among five behavior:
#
# volatile-lru -> remove the key with an expire set using an LRU algorithm
# allkeys-lru -> remove any key accordingly to the LRU algorithm
# volatile-random -> remove a random key with an expire set
# allkeys->random -> remove a random key, any key
# volatile-ttl -> remove the key with the nearest expire time (minor TTL)
# noeviction -> don't expire at all, just return an error on write operations
#
# Note: with all the kind of policies, Redis will return an error on write
#       operations, when there are not suitable keys for eviction.
#
#       At the date of writing this commands are: set setnx setex append
#       incr decr rpush lpush rpushx lpushx linsert lset rpoplpush sadd
#       sinter sinterstore sunion sunionstore sdiff sdiffstore zadd zincrby
#       zunionstore zinterstore hset hsetnx hmset hincrby incrby decrby
#       getset mset msetnx exec sort
#
# The default is:
#
# maxmemory-policy volatile-lru

# LRU and minimal TTL algorithms are not precise algorithms but approximated
# algorithms (in order to save memory), so you can select as well the sample
# size to check. For instance for default Redis will check three keys and
# pick the one that was used less recently, you can change the sample size
# using the following configuration directive.
#
# maxmemory-samples 3

############################## APPEND ONLY MODE ###############################

# By default Redis asynchronously dumps the dataset on disk. If you can live
# with the idea that the latest records will be lost if something like a crash
# happens this is the preferred way to run Redis. If instead you care a lot
# about your data and don't want to that a single record can get lost you should
# enable the append only mode: when this mode is enabled Redis will append
# every write operation received in the file appendonly.aof. This file will
# be read on startup in order to rebuild the full dataset in memory.
#
# Note that you can have both the async dumps and the append only file if you
# like (you have to comment the "save" statements above to disable the dumps).
# Still if append only mode is enabled Redis will load the data from the
# log file at startup ignoring the dump.rdb file.
#
# IMPORTANT: Check the BGREWRITEAOF to check how to rewrite the append
# log file in background when it gets too big.

appendonly no

# The name of the append only file (default: "appendonly.aof")
# appendfilename appendonly.aof

# The fsync() call tells the Operating System to actually write data on disk
# instead to wait for more data in the output buffer. Some OS will really flush
# data on disk, some other OS will just try to do it ASAP.
#
# Redis supports three different modes:
#
# no: don't fsync, just let the OS flush the data when it wants. Faster.
# always: fsync after every write to the append only log . Slow, Safest.
# everysec: fsync only if one second passed since the last fsync. Compromise.
#
# The default is "everysec" that's usually the right compromise between
# speed and data safety. It's up to you to understand if you can relax this to
# "no" that will will let the operating system flush the output buffer when
# it wants, for better performances (but if you can live with the idea of
# some data loss consider the default persistence mode that's snapshotting),
# or on the contrary, use "always" that's very slow but a bit safer than
# everysec.
#
# If unsure, use "everysec".

# appendfsync always
# appendfsync everysec
appendfsync no

# When the AOF fsync policy is set to always or everysec, and a background
# saving process (a background save or AOF log background rewriting) is
# performing a lot of I/O against the disk, in some Linux configurations
# Redis may block too long on the fsync() call. Note that there is no fix for
# this currently, as even performing fsync in a different thread will block
# our synchronous write(2) call.
#
# In order to mitigate this problem it's possible to use the following option
# that will prevent fsync() from being called in the main process while a
# BGSAVE or BGREWRITEAOF is in progress.
#
# This means that while another child is saving the durability of Redis is
# the same as "appendfsync none", that in pratical terms means that it is
# possible to lost up to 30 seconds of log in the worst scenario (with the
# default Linux settings).
#
# If you have latency problems turn this to "yes". Otherwise leave it as
# "no" that is the safest pick from the point of view of durability.
no-appendfsync-on-rewrite no

# Automatic rewrite of the append only file.
# Redis is able to automatically rewrite the log file implicitly calling
# BGREWRITEAOF when the AOF log size will growth by the specified percentage.
#
# This is how it works: Redis remembers the size of the AOF file after the
# latest rewrite (or if no rewrite happened since the restart, the size of
# the AOF at startup is used).
#
# This base size is compared to the current size. If the current size is
# bigger than the specified percentage, the rewrite is triggered. Also
# you need to specify a minimal size for the AOF file to be rewritten, this
# is useful to avoid rewriting the AOF file even if the percentage increase
# is reached but it is still pretty small.
#
# Specify a precentage of zero in order to disable the automatic AOF
# rewrite feature.

auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb

################################## SLOW LOG ###################################

# The Redis Slow Log is a system to log queries that exceeded a specified
# execution time. The execution time does not include the I/O operations
# like talking with the client, sending the reply and so forth,
# but just the time needed to actually execute the command (this is the only
# stage of command execution where the thread is blocked and can not serve
# other requests in the meantime).
#
# You can configure the slow log with two parameters: one tells Redis
# what is the execution time, in microseconds, to exceed in order for the
# command to get logged, and the other parameter is the length of the
# slow log. When a new command is logged the oldest one is removed from the
# queue of logged commands.

# The following time is expressed in microseconds, so 1000000 is equivalent
# to one second. Note that a negative number disables the slow log, while
# a value of zero forces the logging of every command.
slowlog-log-slower-than 10000

# There is no limit to this length. Just be aware that it will consume memory.
# You can reclaim memory used by the slow log with SLOWLOG RESET.
slowlog-max-len 1024

############################### ADVANCED CONFIG ###############################

# Hashes are encoded in a special way (much more memory efficient) when they
# have at max a given numer of elements, and the biggest element does not
# exceed a given threshold. You can configure this limits with the following
# configuration directives.
# hash-max-zipmap-entries 512
# hash-max-zipmap-value 64

# Similarly to hashes, small lists are also encoded in a special way in order
# to save a lot of space. The special representation is only used when
# you are under the following limits:
list-max-ziplist-entries 512
list-max-ziplist-value 64

# Sets have a special encoding in just one case: when a set is composed
# of just strings that happens to be integers in radix 10 in the range
# of 64 bit signed integers.
# The following configuration setting sets the limit in the size of the
# set in order to use this special memory saving encoding.
set-max-intset-entries 512

# Similarly to hashes and lists, sorted sets are also specially encoded in
# order to save a lot of space. This encoding is only used when the length and
# elements of a sorted set are below the following limits:
zset-max-ziplist-entries 128
zset-max-ziplist-value 64

# Active rehashing uses 1 millisecond every 100 milliseconds of CPU time in
# order to help rehashing the main Redis hash table (the one mapping top-level
# keys to values). The hash table implementation redis uses (see dict.c)
# performs a lazy rehashing: the more operation you run into an hash table
# that is rhashing, the more rehashing "steps" are performed, so if the
# server is idle the rehashing is never complete and some more memory is used
# by the hash table.
#
# The default is to use this millisecond 10 times every second in order to
# active rehashing the main dictionaries, freeing memory when possible.
#
# If unsure:
# use "activerehashing no" if you have hard latency requirements and it is
# not a good thing in your environment that Redis can reply form time to time
# to queries with 2 milliseconds delay.
#
# use "activerehashing yes" if you don't have such hard requirements but
# want to free memory asap when possible.
activerehashing yes

################################## INCLUDES ###################################

# Include one or more other config files here.  This is useful if you
# have a standard template that goes to all redis server but also need
# to customize a few per-server settings.  Include files can include
# other files, so use this wisely.
#
# include /path/to/local.conf
# include /path/to/other.conf

pedigree commented Jan 5, 2014

I spent a couple of hours this morning trying to find why my clients were aborting with invalid protocol exceptions and will roll back to 2.8.2 to see if this fixes it

Owner

antirez commented Jan 8, 2014

Hello, back to work today, investigating. Thanks for reporting!

Owner

antirez commented Jan 8, 2014

Hi again, I see here that you are using a 2.4.8 master with 2.8.3 slaves. While this should be possible in the first synchronization stage, this is not safe in the long run, or at least not guaranteed to be safe at all.

I wonder if @pedigree has the same setup.

Btw what happens here AFAIK is that the 2.8.3 slave sends the newlines that the 2.4.8 master is not able to understand, replying with an error that is interpreted by 2.8.3 as a protocol violation.

This is easy to fix in this specific instance since 2.4.8 masters can be easily detected as they are flagged as not understanding PSYNC. However there is to check what happens with, for instance, 2.8.2 masters.

Investigating further the issue in order to provide a more general fix.

pedigree commented Jan 8, 2014

I have a 1 x 2.6.17 -> (stunnel over WAN) -> 1 x 2.8.2 (patched with
interval sets) -> 3 x 2.8.2 (patched with interval sets, was gold 2.8.3)

I've been concentrating on issues with hiphop hhvm just not connecting
after 24 hours of stress testing but I am not longer seeing protocol
errors being logging with a 2.8.2 slave, or in the 2.8.3 middle slave/master

I'll install a 2.8.3 for testing but it is a custom code base as
interval sets has dropped my system load so much that it's now going to
have to stay.

On 08/01/2014 12:11, Salvatore Sanfilippo wrote:

Hi again, I see here that you are using a 2.4.8 master with 2.8.3
slaves. While this should be possible in the first synchronization
stage, this is not safe in the long run, or at least not guaranteed to
be safe at all.

I wonder if @pedigree https://github.com/pedigree has the same setup.

Btw what happens here AFAIK is that the 2.8.3 slave sends the newlines
that the 2.4.8 master is not able to understand, replying with an
error that is interpreted by 2.8.3 as a protocol violation.

This is easy to fix in this specific instance since 2.4.8 masters can
be easily detected as they are flagged as not understanding PSYNC.
However there is to check what happens with, for instance, 2.8.2 masters.

Investigating further the issue in order to provide a more general fix.


Reply to this email directly or view it on GitHub
#1488 (comment).

Owner

antirez commented Jan 8, 2014

@pedigree thanks for the additional info.

All: here is a clarification of the issue. The new problem was actually exposed by commit 0ff078d and not by the fix for issue #1449.

But the new behavior is actually more correct, it just manages to expose a different bug at all, that is, slaves of old masters that don't support PSYNC, still trying to use REPLCONF ACK commands against masters.

The fix is easy and should be ready in a matter of minutes. Working at it right now.

@antirez antirez added a commit that referenced this issue Jan 8, 2014

@antirez antirez Don't send REPLCONF ACK to old masters.
Masters not understanding REPLCONF ACK will reply with errors to our
requests causing a number of possible issues.

This commit detects a global replication offest set to -1 at the end of
the replication, and marks the client representing the master with the
REDIS_PRE_PSYNC flag.

Note that this flag was called REDIS_PRE_PSYNC_SLAVE but now it is just
REDIS_PRE_PSYNC as it is used for both slaves and masters starting with
this commit.

This commit fixes issue #1488.
90a81b4

@antirez antirez added a commit that referenced this issue Jan 8, 2014

@antirez antirez Don't send REPLCONF ACK to old masters.
Masters not understanding REPLCONF ACK will reply with errors to our
requests causing a number of possible issues.

This commit detects a global replication offest set to -1 at the end of
the replication, and marks the client representing the master with the
REDIS_PRE_PSYNC flag.

Note that this flag was called REDIS_PRE_PSYNC_SLAVE but now it is just
REDIS_PRE_PSYNC as it is used for both slaves and masters starting with
this commit.

This commit fixes issue #1488.
2a1a31c
Owner

antirez commented Jan 8, 2014

Hello, the above commit fixes the issue (tested). Redis 2.8.4 will be released in the next days, however you should be able to cherry pick the commit in your branch or use the latest 2.8 commit from Github.

Thanks,
Salvatore

antirez closed this Jan 8, 2014

pedigree commented Jan 8, 2014

As im doing it via the web interface, I'm not sure how to but will
continue to update the interval set fork until I can convince you its a
great pull commit :)

On 08/01/2014 13:33, Salvatore Sanfilippo wrote:

Hello, the above commit fixes the issue (tested). Redis 2.8.4 will be
released in the next days, however you should be able to cherry pick
the commit in your branch or use the latest 2.8 commit from Github.

Thanks,
Salvatore


Reply to this email directly or view it on GitHub
#1488 (comment).

Thanks very much for this @antirez. We'll test this after 2.8.4 hits the Ubuntu PPA we're using and follow up if we run into further issues.

Hi,

I am wondering if it is really fixed. I could reproduce the issue with redis 2.8.4 as well as redis 2.8.7.
I am using master from redis 2.4.17 and was able to create slave easily on redis 2.4.17.

However, when I create a SLAVEOF using either 2.8.4 or 2.8.7, I get the following issue:

[20087] 10 Mar 14:11:19.551 * MASTER <-> SLAVE sync started
[20087] 10 Mar 14:11:19.552 * Non blocking connect for SYNC fired the event.
[20087] 10 Mar 14:11:19.552 * Master replied to PING, replication can continue...
[20087] 10 Mar 14:11:19.553 * Partial resynchronization not possible (no cached master)
[20087] 10 Mar 14:11:19.553 * Master does not support PSYNC or is in error state (reply: -ERR unknown command 'PSYNC')
[20087] 10 Mar 14:11:19.553 * Retrying with SYNC...

Replication

role:slave
master_host:10.101.4.255
master_port:6402
master_link_status:down
master_last_io_seconds_ago:-1
master_sync_in_progress:0
slave_repl_offset:1
master_link_down_since_seconds:1394486339
slave_priority:100
slave_read_only:1
connected_slaves:0
master_repl_offset:0
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

Contributor

mattsta commented Mar 10, 2014

Your console output above doesn't show the same error that started this issue (Protocol error from client). It's normal for your 2.8 replica to fail on PSYNC (not supported by 2.4) then fail back to SYNC.

Do you have more output showing the failure to sync against your master?

Do you have logs or console output from both the master and the attempted replica?

And, just curious, but why still use a 2.4 master? Has your master just been up for a year and a half without having to be restarted? You'll have memory improvements, performance fixes, and fewer edge case issues like this by upgrading everything to a recent 2.8.x.

Basically, I am not able to create slave at all from either redis 2.4.17 or redis 2.6.11.

We want to migrate from redis 2.4 to redis 2.8.7. Since our master is still running on 2.4.17. we want to have slave on 2.8.7 before switchover.

Regarding 2.4 master, we switchover master from one server to another server every day. But both are running on redis 2.4.17. Hence master/slave replica works fine.

Do you have more output showing the failure to sync against your master?
==> Yes. it is same output which goes in the infinite loop.

redis log from the slave server:

23008] 10 Mar 15:22:41.007 * MASTER <-> SLAVE sync started
[23008] 10 Mar 15:22:41.007 * Non blocking connect for SYNC fired the event.
[23008] 10 Mar 15:22:41.008 * Master replied to PING, replication can continue...
[23008] 10 Mar 15:22:41.008 * Partial resynchronization not possible (no cached master)
[23008] 10 Mar 15:22:41.008 * Master does not support PSYNC or is in error state (reply: -ERR unknown command 'PSYNC')
[23008] 10 Mar 15:22:41.008 * Retrying with SYNC...
[23008] 10 Mar 15:22:46.017 # Opening the temp file needed for MASTER <-> SLAVE synchronization: Read-only file system
[23008] 10 Mar 15:22:46.918 * Connecting to MASTER uid2.dw.sc.gwallet.com:6415
[23008] 10 Mar 15:22:46.918 * MASTER <-> SLAVE sync started
[23008] 10 Mar 15:22:46.918 * Non blocking connect for SYNC fired the event.
[23008] 10 Mar 15:22:46.919 * Master replied to PING, replication can continue...
[23008] 10 Mar 15:22:46.919 * Partial resynchronization not possible (no cached master)
[23008] 10 Mar 15:22:46.919 * Master does not support PSYNC or is in error state (reply: -ERR unknown command 'PSYNC')
[23008] 10 Mar 15:22:46.919 * Retrying with SYNC...
[23008] 10 Mar 15:22:51.930 # Opening the temp file needed for MASTER <-> SLAVE synchronization: Read-only file system
[23008] 10 Mar 15:22:52.832 * Connecting to MASTER uid2.dw.sc.gwallet.com:6415
[23008] 10 Mar 15:22:52.832 * MASTER <-> SLAVE sync started

redis log from master:

23642] 10 Mar 15:23:28 * Slave ask for synchronization
[23642] 10 Mar 15:23:28 * Starting BGSAVE for SYNC
[23642] 10 Mar 15:23:28 * Background saving started by pid 24427
[24427] 10 Mar 15:23:29 * DB saved on disk
[23642] 10 Mar 15:23:29 * Background saving terminated with success
[23642] 10 Mar 15:23:29 - DB 15: 985062 keys (985062 volatile) in 1048576 slots HT.
[23642] 10 Mar 15:23:29 - 0 clients connected (1 slaves), 127312240 bytes in use
[23642] 10 Mar 15:23:33 - Reading from client: Connection reset by peer
[23642] 10 Mar 15:23:34 - Accepted 10.101.4.143:57643
[23642] 10 Mar 15:23:34 * Slave ask for synchronization
[23642] 10 Mar 15:23:34 * Starting BGSAVE for SYNC
[23642] 10 Mar 15:23:34 * Background saving started by pid 24428
[23642] 10 Mar 15:23:34 - DB 15: 985062 keys (985062 volatile) in 1048576 slots HT.
[23642] 10 Mar 15:23:34 - 0 clients connected (1 slaves), 127312240 bytes in use
[24428] 10 Mar 15:23:35 * DB saved on disk
[23642] 10 Mar 15:23:35 * Background saving terminated with success
[23642] 10 Mar 15:23:39 - Reading from client: Connection reset by peer
[23642] 10 Mar 15:23:39 - DB 15: 985062 keys (985062 volatile) in 1048576 slots HT.
[23642] 10 Mar 15:23:39 - 0 clients connected (0 slaves), 127303608 bytes in use
[23642] 10 Mar 15:23:40 - Accepted 10.101.4.143:57645
[23642] 10 Mar 15:23:40 * Slave ask for synchronization

Please let me know if you need more information.

Contributor

mattsta commented Mar 10, 2014

Thanks for the additional information!

Redis has a sneaky way of showing errors: they are denoted by # in the log. So all of your lines with # Opening the temp file needed for MASTER <-> SLAVE synchronization: Read-only file system are an error — your filesystem has become read only, which means your replica can't injest the RDB from the master.

Try some fsck on your replica's file system and see if you can wake it up again.

Thank you very much. I got it fixed now. It was my problem. You were right, there was permission issue in writing new rdb file.

Thanks for your quick response. I was able to create slaveof from redis 2.4.17.

tdk2nyt commented Apr 17, 2014

I met the same problem with redis 2.8.8

log of slave:

[2584] 17 Apr 16:37:15.041 # Server started, Redis version 2.8.8
[2584] 17 Apr 16:37:15.042 * The server is now ready to accept connections on port 6379
[2584] 17 Apr 16:37:16.043 - 0 clients connected (0 slaves), 750712 bytes in use
[2584] 17 Apr 16:37:16.044 * Connecting to MASTER 192.168.100.129:6379
[2584] 17 Apr 16:37:16.045 * MASTER <-> SLAVE sync started
[2584] 17 Apr 16:37:16.070 * Non blocking connect for SYNC fired the event.
[2584] 17 Apr 16:37:16.071 * Master replied to PING, replication can continue...
[2584] 17 Apr 16:37:16.071 * Partial resynchronization not possible (no cached master)
[2584] 17 Apr 16:37:16.072 * Full resync from master: b18b2915b1380087ff1a0016f9cf2994fa696e3f:1
[2584] 17 Apr 16:37:16.171 * MASTER <-> SLAVE sync: receiving 18 bytes from master

log of master

[3807] 17 Apr 16:37:05.546 # Server started, Redis version 2.8.8
[3807] 17 Apr 16:37:05.547 * DB loaded from disk: 0.000 seconds
[3807] 17 Apr 16:37:05.547 * The server is now ready to accept connections on port 6379
[3807] 17 Apr 16:37:06.546 - 0 clients connected (0 slaves), 750680 bytes in use
[3807] 17 Apr 16:37:11.599 - 0 clients connected (0 slaves), 750680 bytes in use
[3807] 17 Apr 16:37:16.045 - Accepted 192.168.100.130:39467
[3807] 17 Apr 16:37:16.069 * Slave asks for synchronization
[3807] 17 Apr 16:37:16.069 * Full resync requested by slave.
[3807] 17 Apr 16:37:16.069 * Starting BGSAVE for SYNC
[3807] 17 Apr 16:37:16.074 * Background saving started by pid 3811
[3811] 17 Apr 16:37:16.090 * DB saved on disk
[3811] 17 Apr 16:37:16.090 * RDB: 6 MB of memory used by copy-on-write
[3807] 17 Apr 16:37:16.168 * Background saving terminated with success
[3807] 17 Apr 16:37:16.179 * Synchronization with slave succeeded
[3807] 17 Apr 16:37:16.672 - 0 clients connected (1 slaves), 1820184 bytes in use

it seems that master doesn't suppor PSYNC command:
[root@os_1 forfun]# redis-cli
127.0.0.1:6379> help PSYNC

127.0.0.1:6379> help SYNC

SYNC -
summary: Internal command used for replication
since: 1.0.0
group: server

127.0.0.1:6379>

Is there any suggestion? Thanks a lot

Contributor

badboy commented Apr 17, 2014

I don't see any error in your instances, @tdk2nyt. PSYNC is a command used only internally. As the log already says Partial resynchronization not possible (no cached master). This means there was never a full sync between your master and slave. partial sync can only happen if after a full sync the link between master & slave goes down for a short time. Please read the replication documentation for more info (section "Partial resynchronization")

tdk2nyt commented Apr 17, 2014

Thanks badboy!
Finally I checked the code, found I misunderstood PSYNC: PSYNC only works if slave disconnect with master for a short while, if slave restart, PSYNC not work.

hello tdk2nyt ,how do us know if PSYNC is working?

Contributor

mattsta commented Nov 19, 2014

If PSYNC is working, you will not see lines like:

* Master does not support PSYNC or is in error state (reply: -ERR unknown command 'PSYNC')

is there any PSYNC loginfo we you see

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment