Skip to content
This repository

redis 99% cpu when master down #518

Open
donzell opened this Issue May 22, 2012 · 7 comments

3 participants

donzell jokea Salvatore Sanfilippo
donzell
donzell commented May 22, 2012

Hi,it's redis version 2.4.11
I started 16 redis -server on server A and B(each 8 redis-servers),all redis on A are slaves of redis on B,yes ,i have 8 master-slave shards. Today i found something when B was shutdown for some reason.

There was no client in truth.
All info below is about server A.

here's top's result:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12728 work 20 0 55888 15m 836 R 99.9 0.0 63:47.32 redis-server
12731 work 20 0 59984 16m 836 R 99.9 0.0 63:58.11 redis-server
12732 work 20 0 59984 15m 836 R 98.6 0.0 63:53.74 redis-server
12726 work 20 0 55888 13m 840 R 98.3 0.0 63:19.78 redis-server
12730 work 20 0 59984 15m 836 S 84.0 0.0 63:49.80 redis-server
12733 work 20 0 55888 15m 836 R 83.7 0.0 64:01.96 redis-server
12729 work 20 0 59984 15m 836 R 70.1 0.0 63:27.61 redis-server
12727 work 20 0 59984 16m 836 R 69.7 0.0 63:45.32 redis-server

And, netstat -lantp |grep redis

(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp 0 0 0.0.0.0:6372 0.0.0.0:* LISTEN 12733/redis-server

tcp 0 0 0.0.0.0:6373 0.0.0.0:* LISTEN 12732/redis-server

tcp 0 0 0.0.0.0:6374 0.0.0.0:* LISTEN 12731/redis-server

tcp 0 0 0.0.0.0:6375 0.0.0.0:* LISTEN 12730/redis-server

tcp 0 0 0.0.0.0:6376 0.0.0.0:* LISTEN 12729/redis-server

tcp 0 0 0.0.0.0:6377 0.0.0.0:* LISTEN 12728/redis-server

tcp 0 0 0.0.0.0:6378 0.0.0.0:* LISTEN 12727/redis-server

tcp 0 0 0.0.0.0:6379 0.0.0.0:* LISTEN 12726/redis-server

tcp 0 0 10.237.2.13:6377 10.237.2.37:20482 ESTABLISHED 12728/redis-server

tcp 0 0 10.237.2.13:6375 10.237.2.37:12221 ESTABLISHED 12730/redis-server

tcp 0 0 10.237.2.13:6373 10.237.2.37:56511 ESTABLISHED 12732/redis-server

tcp 0 0 10.237.2.13:6372 10.237.2.37:36262 ESTABLISHED 12733/redis-server

tcp 0 0 10.237.2.13:6379 10.237.2.37:57646 ESTABLISHED 12726/redis-server

tcp 0 0 10.237.2.13:6376 10.237.2.37:56959 ESTABLISHED 12729/redis-server

tcp 0 0 10.237.2.13:6378 10.237.2.37:54756 ESTABLISHED 12727/redis-server

tcp 0 0 10.237.2.13:6374 10.237.2.37:13526 ESTABLISHED 12731/redis-server
netstat show that they were masters,because 10.237.2.13 was ip of server A.but they were slaves ,i think.

redis's log.
[12727] 23 May 11:17:04 * MASTER <-> SLAVE sync started
[12727] 23 May 11:17:04 * Non blocking connect for SYNC fired the event.
[12727] 23 May 11:17:11 # I/O error writing to MASTER: Connection timed out
[12727] 23 May 11:17:11 * Connecting to MASTER...
[12727] 23 May 11:17:11 * MASTER <-> SLAVE sync started
[12727] 23 May 11:17:14 * Non blocking connect for SYNC fired the event.
[12727] 23 May 11:17:21 # I/O error writing to MASTER: Connection timed out
[12727] 23 May 11:17:21 * Connecting to MASTER...
[12727] 23 May 11:17:21 * MASTER <-> SLAVE sync started
[12727] 23 May 11:17:21 * Non blocking connect for SYNC fired the event.

info:
redis_version:2.4.11
redis_git_sha1:00000000
redis_git_dirty:0
arch_bits:64
multiplexing_api:epoll
gcc_version:3.4.5
process_id:12726
uptime_in_seconds:68187
uptime_in_days:0
lru_clock:1653765
used_cpu_sys:3030.79
used_cpu_user:887.53
used_cpu_sys_children:0.01
used_cpu_user_children:0.02
connected_clients:2
connected_slaves:0
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0
used_memory:10693528
used_memory_human:10.20M
used_memory_rss:14192640
used_memory_peak:12591456
used_memory_peak_human:12.01M
mem_fragmentation_ratio:1.33
mem_allocator:jemalloc-2.2.5
loading:0
aof_enabled:1
changes_since_last_save:391848
bgsave_in_progress:0
last_save_time:1337675232
bgrewriteaof_in_progress:0
total_connections_received:12
total_commands_processed:599612
expired_keys:0
evicted_keys:0
keyspace_hits:0
keyspace_misses:0
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:321
vm_enabled:0
role:slave
aof_current_size:124128354
aof_base_size:8580218
aof_pending_rewrite:0
aof_buffer_length:0
aof_pending_bio_fsync:0
master_host:10.237.2.37
master_port:6379
master_link_status:down
master_last_io_seconds_ago:-1
master_sync_in_progress:0
master_link_down_since_seconds:4623
db0:keys=19505,expires=0

gdb: i got this stack info from every process when i attached gdb on it.
(gdb) bt
#0 0x000000302afbd1e2 in poll () from /lib64/tls/libc.so.6
#1 0x000000000040c6da in aeWait (fd=Variable "fd" is not available.
) at ae.c:371
#2 0x0000000000431026 in syncWrite (fd=7, ptr=0x446da2 "SYNC \r\n", size=7, timeout=6) at syncio.c:47
#3 0x000000000041bf0a in syncWithMaster (el=Variable "el" is not available.
) at replication.c:406
#4 0x000000000040c4d3 in aeProcessEvents (eventLoop=0x7f0e12c47000, flags=Variable "flags" is not available.
) at ae.c:344
#5 0x000000000040c755 in aeMain (eventLoop=0x7f0e12c47000) at ae.c:385
#6 0x0000000000411255 in main (argc=Variable "argc" is not available.
) at redis.c:1797
(gdb)

what confused me is output of netstat and 99% cpu usage.

jokea
jokea commented May 23, 2012

The strange thing is that netstat shows the connection to master 10.237.2.37 is established when you say that server B was shutdown.

Maybe poll() in aeWait should check more events like POLLPRI/POLLERR?

donzell
donzell commented May 23, 2012

10.237.2.37 isn't master,but slave,because local port is 6379,this is redis's listen port,so 10.237.2.37 is slave actually.

jokea
jokea commented May 23, 2012

According to info output, 10.237.2.37 seems to be master:

master_host:10.237.2.37
master_port:6379
donzell
donzell commented May 23, 2012

yeah,you are right.Connections to master has been lost.
netstat's output can be ignore now,they are from some clients on 10.237.2.37,when 10.237.2.37 was shutdown,10.237.2.13 don't know peer close.And tcp keepalive is off,so these are all dead connections.just ignore them.
tcp 0 0 10.237.2.13:6379 10.237.2.37:57646 ESTABLISHED 12726/redis-server

jokea
jokea commented May 23, 2012

Sorry I mistakenly made changes to an old pull request.
This commit fixes this issue:
jokea@bd9dd56

Salvatore Sanfilippo
Owner
antirez commented May 23, 2012

If I understand correctly there is nothing to merge as the pull request was already merged into 2.4.x as well? Thanks.

jokea
jokea commented May 23, 2012

Exactly right. But I made a mistake reopening that PR. Sorry.
Added a new pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.