Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

aeWait bug that may crash slave when fd exceeds 1024 #267

Closed
jokea opened this Issue · 4 comments

2 participants

@jokea

Let's modify redis.c to demonstrate this problem:
....
int main(int argc, char **argv) {
for (int i=1020; i<1030; i++)
aeWait(i, AE_READABLE, 10);
....
}

Run the program with strace:

#strace -T ./redis-server
....
select(1021, [1020], [], [], {0, 10000}) = 0 (Timeout) < 0.010108>
select(1022, [1021], [], [], {0, 10000}) = 0 (Timeout) < 0.010105>
select(1023, [1022], [], [], {0, 10000}) = 0 (Timeout) < 0.010204>
select(1024, [1023], [], [], {0, 10000}) = 0 (Timeout) < 0.010111>
select(1025, [1024], [], [], {1, 10000}) = 0 (Timeout) < 1.011132>
select(1026, [1025], [], [], {2, 10000}) = 0 (Timeout) < 2.012141>
select(1027, [1026], [], [], {4, 10000}) = 0 (Timeout) < 4.012903>
select(1028, [1027], [], [], {8, 10000}) = 0 (Timeout) < 8.014138>
select(1029, [1028], [], [], {16, 10000}) = 0 (Timeout) < 16.011984>
select(1030, [1029], [], [], {32, 10000}) = 0 (Timeout) < 32.014721>
...

See, the timeval parameter passed to select gets changed after fd exceeds 1024.
in aeWait(), we have:
...
365 tv.tv_sec = milliseconds/1000;
366 tv.tv_usec = (milliseconds%1000)*1000;
367 FD_ZERO(&rfds);
368 FD_ZERO(&wfds);
369 FD_ZERO(&efds);
370
371 if (mask & AE_READABLE) FD_SET(fd,&rfds);
372 if (mask & AE_WRITABLE) FD_SET(fd,&wfds);
373 if ((retval = select(fd+1, &rfds, &wfds, &efds, &tv)) > 0) {
...

Debugging using gdb shows that the FD_SET macro modified tv accidentally in line 371 and 372.
If we move line 365 and 366 after line 372, then it works as expected:
#strace -T ./redis-server
...
select(1021, [1020], [], [], {0, 10000}) = 0 (Timeout) < 0.010135>
select(1022, [1021], [], [], {0, 10000}) = 0 (Timeout) < 0.010162>
select(1023, [1022], [], [], {0, 10000}) = 0 (Timeout) < 0.010146>
select(1024, [1023], [], [], {0, 10000}) = 0 (Timeout) < 0.010123>
select(1025, [], [], [], {0, 10000}) = 0 (Timeout) < 0.010143>
select(1026, [], [], [], {0, 10000}) = 0 (Timeout) < 0.010137>
select(1027, [], [], [], {0, 10000}) = 0 (Timeout) < 0.010388>
select(1028, [], [], [], {0, 10000}) = 0 (Timeout) < 0.010151>
select(1029, [], [], [], {0, 10000}) = 0 (Timeout) < 0.010206>
select(1030, [], [], [], {0, 10000}) = 0 (Timeout) < 0.010160>
...

However, this is not the right solution to this problem, and I think using
poll(2) will be a better choice.

@jokea

More interesting result:
if we start from 1087:
for (i=1087; k<2000; i++) {
aeWait(i, AE_READABLE, 10);
...
}
strace shows this:
select(1088, [1087], [], [], {9223372036854775808, 10000}) = -1 EINVAL (Invalid argument) < 0.000006>
select(1089, [1088], [], [], {0, 10001}) = 0 (Timeout) < 0.010180>
select(1090, [1089], [], [], {0, 10002}) = 0 (Timeout) < 0.010110>
select(1091, [1090], [], [], {0, 10004}) = 0 (Timeout) < 0.010127>
select(1092, [1091], [], [], {0, 10008}) = 0 (Timeout) < 0.010119>
select(1093, [1092], [], [], {0, 10000}) = 0 (Timeout) < 0.010123>
select(1094, [1092 1093], [], [], {0, 10032}) = 0 (Timeout) < 0.010151>

Note the first call to select returns -1, so the following line from redis-2.0
will cause redis block for an hour if fd happens to be 1087:
8104 if (syncReadLine(fd,buf,1024,3600) == -1) {
8105 close(fd);
8106 redisLog(REDIS_WARNING,"I/O error reading bulk count from MASTER: %s",
8107 strerror(errno));
8108 return REDIS_ERR;
8109 }

I think this caused several of our slave instances blocked and cost 100% of cpu today when
they connect to their master.

@jokea

The way to reproduce the problem with latest stable version:
1. start a master and set a password for it
#./redis-server

config set requirepass foobar

2.start a slave and sync from master:
#./redis-server -
port 6380

slaveof 127.0.0.1 6379

The log of slave:
[5894] 23 Dec 14:00:45 * Connecting to MASTER...
[5894] 23 Dec 14:00:45 * MASTER <-> SLAVE sync started
[5894] 23 Dec 14:00:45 * Non blocking connect for SYNC fired the event.
[5894] 23 Dec 14:00:45 # MASTER aborted replication with an error: ERR operation not permitted
[5894] 23 Dec 14:00:46 * Connecting to MASTER...
[5894] 23 Dec 14:00:46 * MASTER <-> SLAVE sync started
[5894] 23 Dec 14:00:46 * Non blocking connect for SYNC fired the event.
[5894] 23 Dec 14:00:46 # MASTER aborted replication with an error: ERR operation not permitted

The slave received error from master immediately after it sent "SYNC".

3.Attach 1100 idle clients to the slave:
#./redis-benchmark -p 6380 -c 1100 -I

4.Now the log of slave shows this:
[5894] 23 Dec 14:00:47 * Connecting to MASTER...
[5894] 23 Dec 14:00:47 * MASTER <-> SLAVE sync started
[5894] 23 Dec 14:00:47 * Non blocking connect for SYNC fired the event.
[5894] 23 Dec 14:00:47 # MASTER aborted replication with an error: ERR operation not permitted
[5894] 23 Dec 14:00:48 * Connecting to MASTER...
[5894] 23 Dec 14:00:48 * MASTER <-> SLAVE sync started
[5894] 23 Dec 14:00:48 * Non blocking connect for SYNC fired the event.
[5894] 23 Dec 14:00:55 # I/O error reading bulk count from MASTER: Connection timed out
[5894] 23 Dec 14:00:55 * Connecting to MASTER...
[5894] 23 Dec 14:00:55 * MASTER <-> SLAVE sync started
[5894] 23 Dec 14:00:55 * Non blocking connect for SYNC fired the event.
[5894] 23 Dec 14:01:02 # I/O error reading bulk count from MASTER: Connection timed out

The slave is blocked for 7 seconds, and gets a time out error. The strace output:
...
write(1105, "SYNC \r\n", 7) = 7 <0.000062>
open("temp-1324620644.8034.rdb", O_WRONLY|O_CREAT|O_EXCL, 0644) = 1106 <0.000027>
epoll_ctl(3, EPOLL_CTL_ADD, 1105, {EPOLLIN, {u32=1105, u64=1105}}) = 0 <0.000006>
epoll_wait(3, {{EPOLLIN, {u32=1105, u64=1105}}}, 102400, 100) = 1 <0.000006>
select(1106, [1024], [], [], {1, 0}) = 0 (Timeout) <1.002646>
select(1106, [1024], [], [], {1, 0}) = 0 (Timeout) <1.001798>
select(1106, [1024], [], [], {1, 0}) = 0 (Timeout) <1.001797>
select(1106, [1024], [], [], {1, 0}) = 0 (Timeout) <1.001797>
select(1106, [1024], [], [], {1, 0}) = 0 (Timeout) <1.001805>
select(1106, [1024], [], [], {1, 0}) = 0 (Timeout) <1.001801>
select(1106, [1024], [], [], {1, 0}) = 0 (Timeout) <1.001784>
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=405, ...}) = 0 <0.000011>
write(1, "[8034] 23 Dec 14:10:51 # I/O err"..., 88) = 88 <0.000048>
epoll_ctl(3, EPOLL_CTL_DEL, 1105, {0, {u32=1105, u64=1105}}) = 0 <0.000007>
...

@jokea

The problem exists in latest release and can cause the slave crash:
1. start master with a password:
#./redis-server

config set requirepass foo

2.start slave and attach 2200 idle clients:
#./redis-server -
port 6380
#./redis-benchmark -p 6380 -c 2200 -I

3.start replication:

slaveof 127.0.0.1 6379
and the slave crashed with the following backtrace:

[404] 26 Dec 15:43:22 - Accepted 127.0.0.1:56379
[404] 26 Dec 15:43:22 * SLAVE OF 127.0.0.1:6379 enabled (user request)
[404] 26 Dec 15:43:22 * Connecting to MASTER...
[404] 26 Dec 15:43:22 * MASTER <-> SLAVE sync started
[404] 26 Dec 15:43:22 * Non blocking connect for SYNC fired the event.
[404] 26 Dec 15:43:22 # === REDIS BUG REPORT START: Cut & paste starting from here ===
[404] 26 Dec 15:43:22 # Redis 2.4.5 crashed by signal: 11
[404] 26 Dec 15:43:22 # Failed assertion: (:0)
[404] 26 Dec 15:43:22 # --- STACK TRACE
[404] 26 Dec 15:43:22 # /lib64/libpthread.so.0 [0x344360e7c0]
[404] 26 Dec 15:43:22 # --- INFO OUTPUT
[404] 26 Dec 15:43:22 # redis_version:2.4.5
redis_git_sha1:00000000
redis_git_dirty:0
arch_bits:64
multiplexing_api:epoll
process_id:404
uptime_in_seconds:46
uptime_in_days:0
lru_clock:367964
used_cpu_sys:0.08
used_cpu_user:0.05
used_cpu_sys_children:0.00
used_cpu_user_children:0.00
connected_clients:2200
connected_slaves:0
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0
used_memory:19479080
used_memory_human:18.58M
used_memory_rss:11563008
used_memory_peak:19479080
used_memory_peak_human:18.58M
mem_fragmentation_ratio:0.59
mem_allocator:jemalloc-2.2.5
loading:0
aof_enabled:0
changes_since_last_save:0
bgsave_in_progress:0
last_save_time:1324885356
bgrewriteaof_in_progress:0
total_connections_received:2203
total_commands_processed:3
expired_keys:0
evicted_keys:0
keyspace_hits:0
keyspace_misses:0
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
vm_enabled:0
role:slave
master_host:127.0.0.1
master_port:6379
master_link_status:down
master_last_io_seconds_ago:-1
master_sync_in_progress:0
master_link_down_since_seconds:1324885403

[404] 26 Dec 15:43:22 # --- CLIENT LIST OUTPUT
[404] 26 Dec 15:43:22 # addr=127.0.0.1:54170 fd=5 idle=10 flags=N db=0 sub=0 psub=0 qbuf=0 obl=0 oll=0 events=r cmd=NULL
addr=127.0.0.1:54171 fd=6 idle=10 flags=N db=0 sub=0 psub=0 qbuf=0 obl=0 oll=0 events=r cmd=NULL
.....
addr=127.0.0.1:54210 fd=45 idle=10 flags=N db=0 sub=0 psub=0 qbuf=0 obl=0 oll=0 events=r cmd=NULL
addr=127.0.0.1:54211 fd=46 idle=10 flags=N db=0 sub=0 psub=0 qbuf=0 obl=0 oll=0 ev
[404] 26 Dec 15:43:22 # === REDIS BUG REPORT END. Make sure to include from START to END. ===

Please report the crash opening an issue on github:

    http://github.com/antirez/redis/issues

Segmentation fault

@antirez
Owner

Fixed merging your pull request, thank you. Closing.

@antirez antirez closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.