Master slave replication failes when dump.rdb cross 4GB disk space #695

Closed
halur opened this Issue Oct 4, 2012 · 9 comments

Comments

Projects
None yet
3 participants

halur commented Oct 4, 2012

Hi,

I have multiple redis master slave structure, it is working seamlessly. But from today i am facing a very strange problem.

I have a master on which all redis mass insert happens with rcli-pipe. Here the dump.rdb file size has grown above 4 Gb now. Also there is aof activated on this server. The size of aof file is 20GB.

I have observed that master goes in repetitive bgsaves and bgrewriteaof. which practically stops all inserts on master. When ever i shutdown slave every thing goes smooth.

Is there a limit. on number of keys or db size for redis? [ for master]

Following are the server details,
Both master and slave are 4 core 32GB ram servers. The system ram status say that 14Gb of ram is used on master.

First snap of redis info on master when i start salve,

aof_current_size:24628818396
aof_base_size:22958745483
aof_pending_rewrite:0
aof_buffer_length:32862
aof_pending_bio_fsync:1
slave0:10.84.51.46,57873,wait_bgsave
db0:keys=31296759,expires=0

Here a bgsave is started and new dump file creation started.

slave info :-

role:slave
master_host:x.x.x.x
master_port:6379
master_link_status:down
master_last_io_seconds_ago:-1
master_sync_in_progress:1
master_sync_left_bytes:-1
master_sync_last_io_seconds_ago:9
master_link_down_since_seconds:73

here after bg save is completed on master. within fews it restarts bgsave and this keeps bgsave activity in loop.

On slave redis info i get,

master_link_status:down
master_last_io_seconds_ago:-1
master_sync_in_progress:1
master_sync_left_bytes:-1
master_sync_last_io_seconds_ago:11
master_link_down_since_seconds:847

but the temp.rdb file on slave remains 0 bytes.

where as on master redis logs,

07:54:35 * DB saved on disk
07:54:50 * Slave ask for synchronization
07:54:50 * Waiting for next BGSAVE for SYNC

and the process restarts.

I have many such master slave running, i never faced problem. Only on this master-slave group my data has grown 4 folds.

Owner

antirez commented Oct 4, 2012

Hello, what version of Redis is this, and what kind of filesystem are you using for Redis persistence? Thanks.

halur commented Oct 4, 2012

Hi,
Redis version is 2.4.14.

Server and filesystem specs,
Filesystem ext3. and storage is Amazon EBS.

Server is amazon ec2 m2.2xlarge. ,
34.2 GB of memory
13 EC2 Compute Units (4 virtual cores with 3.25 EC2 Compute Units each)
850 GB of instance storage
64-bit platform
I/O Performance: High
EBS-Optimized Available: No
API name: m2.2xlarge

Owner

antirez commented Oct 4, 2012

Ok so it's a 64 bit Redis running a 64 bit platform on a normal filesystem without 4GB limit (however it would help to have a complete INFO output!). What do you see as error in the slave side? Thank you.

Owner

antirez commented Oct 4, 2012

p.s. note that we are investigating why the replication is not working, because the issue you have with Redis being not very responsive is due to Amazon EBS that is extremely slow, it is a known problem. But there is no way this should prevent replication, if not perhaps for timeout, but if it's timeout we'll see in the slave.

The common advice in this case is to use ephemeral storage for Redis persistence with some job that copies data to the EBS volume from time to time.

halur commented Oct 4, 2012

Master info :-

redis_version:2.4.14
redis_git_sha1:00000000
redis_git_dirty:0
arch_bits:64
multiplexing_api:epoll
gcc_version:4.1.2
process_id:8940
uptime_in_seconds:48417
uptime_in_days:0
lru_clock:716174
used_cpu_sys:4263.36
used_cpu_user:11700.77
used_cpu_sys_children:1560.44
used_cpu_user_children:9561.38
connected_clients:18
connected_slaves:0
client_longest_output_list:1
client_biggest_input_buf:0
blocked_clients:0
used_memory:14433979448
used_memory_human:13.44G
used_memory_rss:15144411136
used_memory_peak:14855866728
used_memory_peak_human:13.84G
mem_fragmentation_ratio:1.05
mem_allocator:jemalloc-2.2.5
loading:0
aof_enabled:1
changes_since_last_save:57897750
bgsave_in_progress:0
last_save_time:1349336855
bgrewriteaof_in_progress:0
total_connections_received:965628
total_commands_processed:1240221380
expired_keys:0
evicted_keys:0
keyspace_hits:48270793
keyspace_misses:6497737
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:1665158
vm_enabled:0
role:master
aof_current_size:63993057587
aof_base_size:22958745483
aof_pending_rewrite:0
aof_buffer_length:629681
aof_pending_bio_fsync:1
db0:keys=31491013,expires=0

Master logs

[715] 04 Oct 14:02:14 * DB saved on disk
[8940] 04 Oct 14:02:15 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
[8940] 04 Oct 14:02:16 * Background saving terminated with success
[8940] 04 Oct 14:02:18 * Background saving started by pid 7595
[8940] 04 Oct 14:02:22 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
[8940] 04 Oct 14:02:25 - DB 0: 31491514 keys (0 volatile) in 33554432 slots HT.
[8940] 04 Oct 14:02:25 - 12 clients connected (1 slaves), 14434013864 bytes in use
[8940] 04 Oct 14:02:27 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
[8940] 04 Oct 14:03:06 * Slave ask for synchronization
[8940] 04 Oct 14:03:06 * Waiting for next BGSAVE for SYNC
[8940] 04 Oct 14:03:08 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.

Slave info :-

redis_version:2.4.14
redis_git_sha1:00000000
redis_git_dirty:0
arch_bits:64
multiplexing_api:epoll
gcc_version:4.1.2
process_id:22478
uptime_in_seconds:792
uptime_in_days:0
lru_clock:716260
used_cpu_sys:0.00
used_cpu_user:0.00
used_cpu_sys_children:0.00
used_cpu_user_children:0.00
connected_clients:1
connected_slaves:0
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0
used_memory:726128
used_memory_human:709.11K
used_memory_rss:1552384
used_memory_peak:717528
used_memory_peak_human:700.71K
mem_fragmentation_ratio:2.14
mem_allocator:jemalloc-2.2.5
loading:0
aof_enabled:0
changes_since_last_save:0
bgsave_in_progress:0
last_save_time:1349339088
bgrewriteaof_in_progress:0
total_connections_received:1
total_commands_processed:0
expired_keys:0
evicted_keys:0
keyspace_hits:0
keyspace_misses:0
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
vm_enabled:0
role:slave
master_host:10.226.243.227
master_port:6379
master_link_status:down
master_last_io_seconds_ago:-1
master_sync_in_progress:1
master_sync_left_bytes:-1
master_sync_last_io_seconds_ago:55
master_link_down_since_seconds:792

Slave logs :-

[22478] 04 Oct 14:08:00 - Client closed connection
[22478] 04 Oct 14:08:03 - 0 clients connected (0 slaves), 717528 bytes in use
[22478] 04 Oct 14:08:06 # Timeout receiving bulk data from MASTER...
[22478] 04 Oct 14:08:06 * Connecting to MASTER...
[22478] 04 Oct 14:08:06 * MASTER <-> SLAVE sync started
[22478] 04 Oct 14:08:06 * Non blocking connect for SYNC fired the event.

Here in slave logs i am getting following information,
Timeout receiving bulk data from MASTER...

Owner

antirez commented Oct 4, 2012

Ok it's just that the master is extremely slow to save the RDB file, so you can alter the following config parameter:

repl-timeout ...

This will make the replication working again as the slave will not retry the replication for timeout again and again.
While this will fix the problem it seems like that anyway the disk is too slow for your load, so probably you can see latency spikes from time to time.

I suggest trying to just modify the config to start, and later move persistence to ephemeral storage if needed.

Closing the issue but I'll appreciate if you can report if this fixed the problem. Thanks!

@antirez antirez closed this Oct 4, 2012

halur commented Oct 4, 2012

Hi,

After changing the timeout its started. Thank you very much.

Is there any proven statistics for such configuration parameters. Or say do's and donts.

I mean it would be very helpful. I sometime feels that i am not using redis up to the mark. There can be many tweak possible by which i can increase the performance.

Owner

antirez commented Oct 4, 2012

Actually the default parameter should be perfectly fine as even while the RDB file is being saved by the master, Redis will keep sending empty newlines to the slave to prevent disconnection. I think that what you are experimenting is related to the fact that your server completely hangs for more than 60 seconds during the saving process because of the EBS volume slowness... however I'm committing a change to the error message so that now it writes instead:

Timeout receiving bulk data from MASTER... If the problem persists try to set the 'repl-timeout' parameter in redis.conf to a larger value.

This may help :-)

About other tuning you can perform, actually Redis is pretty straightforward and the default configuration is often what works best in most systems.

antirez added a commit that referenced this issue Oct 4, 2012

"Timeout receiving bulk data" error message modified.
The new message now contains an hint about modifying the repl-timeout
configuration directive if the problem persists.

This should normally not be needed, because while the master generates
the RDB file it makes sure to send newlines to the replication channel
to prevent timeouts. However there are times when masters running on
very slow systems can completely stop for seconds during the RDB saving
process. In such a case enlarging the timeout value can fix the problem.

See issue #695 for an example of this problem in an EC2 deployment.

antirez added a commit that referenced this issue Oct 4, 2012

"Timeout receiving bulk data" error message modified.
The new message now contains an hint about modifying the repl-timeout
configuration directive if the problem persists.

This should normally not be needed, because while the master generates
the RDB file it makes sure to send newlines to the replication channel
to prevent timeouts. However there are times when masters running on
very slow systems can completely stop for seconds during the RDB saving
process. In such a case enlarging the timeout value can fix the problem.

See issue #695 for an example of this problem in an EC2 deployment.

v6 commented Aug 29, 2016

// , I just received this error, "Timeout receiving bulk data from MASTER... If the problem persists try to set the 'repl-timeout' parameter in redis.conf to a larger value.", and found it helpful.

Hopefully it will at least give the rest of us something to try before deeper troubleshooting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment