aeWait bug that may crash slave when fd exceeds 1024 #267

jokea · 2011-12-22T10:10:56Z

Let's modify redis.c to demonstrate this problem:
....
int main(int argc, char **argv) {
for (int i=1020; i<1030; i++)
aeWait(i, AE_READABLE, 10);
....
}

Run the program with strace:

strace -T ./redis-server

....
select(1021, [1020], [], [], {0, 10000}) = 0 (Timeout) < 0.010108>
select(1022, [1021], [], [], {0, 10000}) = 0 (Timeout) < 0.010105>
select(1023, [1022], [], [], {0, 10000}) = 0 (Timeout) < 0.010204>
select(1024, [1023], [], [], {0, 10000}) = 0 (Timeout) < 0.010111>
select(1025, [1024], [], [], {1, 10000}) = 0 (Timeout) < 1.011132>
select(1026, [1025], [], [], {2, 10000}) = 0 (Timeout) < 2.012141>
select(1027, [1026], [], [], {4, 10000}) = 0 (Timeout) < 4.012903>
select(1028, [1027], [], [], {8, 10000}) = 0 (Timeout) < 8.014138>
select(1029, [1028], [], [], {16, 10000}) = 0 (Timeout) < 16.011984>
select(1030, [1029], [], [], {32, 10000}) = 0 (Timeout) < 32.014721>
...

See, the timeval parameter passed to select gets changed after fd exceeds 1024.
in aeWait(), we have:
...
365 tv.tv_sec = milliseconds/1000;
366 tv.tv_usec = (milliseconds%1000)*1000;
367 FD_ZERO(&rfds);
368 FD_ZERO(&wfds);
369 FD_ZERO(&efds);
370
371 if (mask & AE_READABLE) FD_SET(fd,&rfds);
372 if (mask & AE_WRITABLE) FD_SET(fd,&wfds);
373 if ((retval = select(fd+1, &rfds, &wfds, &efds, &tv)) > 0) {
...

Debugging using gdb shows that the FD_SET macro modified tv accidentally in line 371 and 372.
If we move line 365 and 366 after line 372, then it works as expected:

strace -T ./redis-server

...
select(1021, [1020], [], [], {0, 10000}) = 0 (Timeout) < 0.010135>
select(1022, [1021], [], [], {0, 10000}) = 0 (Timeout) < 0.010162>
select(1023, [1022], [], [], {0, 10000}) = 0 (Timeout) < 0.010146>
select(1024, [1023], [], [], {0, 10000}) = 0 (Timeout) < 0.010123>
select(1025, [], [], [], {0, 10000}) = 0 (Timeout) < 0.010143>
select(1026, [], [], [], {0, 10000}) = 0 (Timeout) < 0.010137>
select(1027, [], [], [], {0, 10000}) = 0 (Timeout) < 0.010388>
select(1028, [], [], [], {0, 10000}) = 0 (Timeout) < 0.010151>
select(1029, [], [], [], {0, 10000}) = 0 (Timeout) < 0.010206>
select(1030, [], [], [], {0, 10000}) = 0 (Timeout) < 0.010160>
...

However, this is not the right solution to this problem, and I think using
poll(2) will be a better choice.

jokea · 2011-12-22T14:29:51Z

More interesting result:
if we start from 1087:
for (i=1087; k<2000; i++) {
aeWait(i, AE_READABLE, 10);
...
}
strace shows this:
select(1088, [1087], [], [], {9223372036854775808, 10000}) = -1 EINVAL (Invalid argument) < 0.000006>
select(1089, [1088], [], [], {0, 10001}) = 0 (Timeout) < 0.010180>
select(1090, [1089], [], [], {0, 10002}) = 0 (Timeout) < 0.010110>
select(1091, [1090], [], [], {0, 10004}) = 0 (Timeout) < 0.010127>
select(1092, [1091], [], [], {0, 10008}) = 0 (Timeout) < 0.010119>
select(1093, [1092], [], [], {0, 10000}) = 0 (Timeout) < 0.010123>
select(1094, [1092 1093], [], [], {0, 10032}) = 0 (Timeout) < 0.010151>

Note the first call to select returns -1, so the following line from redis-2.0
will cause redis block for an hour if fd happens to be 1087:
8104 if (syncReadLine(fd,buf,1024,3600) == -1) {
8105 close(fd);
8106 redisLog(REDIS_WARNING,"I/O error reading bulk count from MASTER: %s",
8107 strerror(errno));
8108 return REDIS_ERR;
8109 }

I think this caused several of our slave instances blocked and cost 100% of cpu today when
they connect to their master.

jokea · 2011-12-23T06:16:32Z

The way to reproduce the problem with latest stable version:

start a master and set a password for it
#./redis-server

config set requirepass foobar

2.start a slave and sync from master:
#./redis-server -
port 6380

slaveof 127.0.0.1 6379

The log of slave:
[5894] 23 Dec 14:00:45 * Connecting to MASTER...
[5894] 23 Dec 14:00:45 * MASTER <-> SLAVE sync started
[5894] 23 Dec 14:00:45 * Non blocking connect for SYNC fired the event.
[5894] 23 Dec 14:00:45 # MASTER aborted replication with an error: ERR operation not permitted
[5894] 23 Dec 14:00:46 * Connecting to MASTER...
[5894] 23 Dec 14:00:46 * MASTER <-> SLAVE sync started
[5894] 23 Dec 14:00:46 * Non blocking connect for SYNC fired the event.
[5894] 23 Dec 14:00:46 # MASTER aborted replication with an error: ERR operation not permitted

The slave received error from master immediately after it sent "SYNC".

3.Attach 1100 idle clients to the slave:
#./redis-benchmark -p 6380 -c 1100 -I

4.Now the log of slave shows this:
[5894] 23 Dec 14:00:47 * Connecting to MASTER...
[5894] 23 Dec 14:00:47 * MASTER <-> SLAVE sync started
[5894] 23 Dec 14:00:47 * Non blocking connect for SYNC fired the event.
[5894] 23 Dec 14:00:47 # MASTER aborted replication with an error: ERR operation not permitted
[5894] 23 Dec 14:00:48 * Connecting to MASTER...
[5894] 23 Dec 14:00:48 * MASTER <-> SLAVE sync started
[5894] 23 Dec 14:00:48 * Non blocking connect for SYNC fired the event.
[5894] 23 Dec 14:00:55 # I/O error reading bulk count from MASTER: Connection timed out
[5894] 23 Dec 14:00:55 * Connecting to MASTER...
[5894] 23 Dec 14:00:55 * MASTER <-> SLAVE sync started
[5894] 23 Dec 14:00:55 * Non blocking connect for SYNC fired the event.
[5894] 23 Dec 14:01:02 # I/O error reading bulk count from MASTER: Connection timed out

The slave is blocked for 7 seconds, and gets a time out error. The strace output:
...
write(1105, "SYNC \r\n", 7) = 7 <0.000062>
open("temp-1324620644.8034.rdb", O_WRONLY|O_CREAT|O_EXCL, 0644) = 1106 <0.000027>
epoll_ctl(3, EPOLL_CTL_ADD, 1105, {EPOLLIN, {u32=1105, u64=1105}}) = 0 <0.000006>
epoll_wait(3, {{EPOLLIN, {u32=1105, u64=1105}}}, 102400, 100) = 1 <0.000006>
select(1106, [1024], [], [], {1, 0}) = 0 (Timeout) <1.002646>
select(1106, [1024], [], [], {1, 0}) = 0 (Timeout) <1.001798>
select(1106, [1024], [], [], {1, 0}) = 0 (Timeout) <1.001797>
select(1106, [1024], [], [], {1, 0}) = 0 (Timeout) <1.001797>
select(1106, [1024], [], [], {1, 0}) = 0 (Timeout) <1.001805>
select(1106, [1024], [], [], {1, 0}) = 0 (Timeout) <1.001801>
select(1106, [1024], [], [], {1, 0}) = 0 (Timeout) <1.001784>
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=405, ...}) = 0 <0.000011>
write(1, "[8034] 23 Dec 14:10:51 # I/O err"..., 88) = 88 <0.000048>
epoll_ctl(3, EPOLL_CTL_DEL, 1105, {0, {u32=1105, u64=1105}}) = 0 <0.000007>
...

jokea · 2011-12-26T07:46:48Z

The problem exists in latest release and can cause the slave crash:

start master with a password:
#./redis-server

config set requirepass foo

2.start slave and attach 2200 idle clients:
#./redis-server -
port 6380
#./redis-benchmark -p 6380 -c 2200 -I

3.start replication:

slaveof 127.0.0.1 6379
and the slave crashed with the following backtrace:

[404] 26 Dec 15:43:22 - Accepted 127.0.0.1:56379
[404] 26 Dec 15:43:22 * SLAVE OF 127.0.0.1:6379 enabled (user request)
[404] 26 Dec 15:43:22 * Connecting to MASTER...
[404] 26 Dec 15:43:22 * MASTER <-> SLAVE sync started
[404] 26 Dec 15:43:22 * Non blocking connect for SYNC fired the event.
[404] 26 Dec 15:43:22 # === REDIS BUG REPORT START: Cut & paste starting from here ===
[404] 26 Dec 15:43:22 # Redis 2.4.5 crashed by signal: 11
[404] 26 Dec 15:43:22 # Failed assertion: (:0)
[404] 26 Dec 15:43:22 # --- STACK TRACE
[404] 26 Dec 15:43:22 # /lib64/libpthread.so.0 [0x344360e7c0]
[404] 26 Dec 15:43:22 # --- INFO OUTPUT
[404] 26 Dec 15:43:22 # redis_version:2.4.5
redis_git_sha1:00000000
redis_git_dirty:0
arch_bits:64
multiplexing_api:epoll
process_id:404
uptime_in_seconds:46
uptime_in_days:0
lru_clock:367964
used_cpu_sys:0.08
used_cpu_user:0.05
used_cpu_sys_children:0.00
used_cpu_user_children:0.00
connected_clients:2200
connected_slaves:0
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0
used_memory:19479080
used_memory_human:18.58M
used_memory_rss:11563008
used_memory_peak:19479080
used_memory_peak_human:18.58M
mem_fragmentation_ratio:0.59
mem_allocator:jemalloc-2.2.5
loading:0
aof_enabled:0
changes_since_last_save:0
bgsave_in_progress:0
last_save_time:1324885356
bgrewriteaof_in_progress:0
total_connections_received:2203
total_commands_processed:3
expired_keys:0
evicted_keys:0
keyspace_hits:0
keyspace_misses:0
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
vm_enabled:0
role:slave
master_host:127.0.0.1
master_port:6379
master_link_status:down
master_last_io_seconds_ago:-1
master_sync_in_progress:0
master_link_down_since_seconds:1324885403

[404] 26 Dec 15:43:22 # --- CLIENT LIST OUTPUT
[404] 26 Dec 15:43:22 # addr=127.0.0.1:54170 fd=5 idle=10 flags=N db=0 sub=0 psub=0 qbuf=0 obl=0 oll=0 events=r cmd=NULL
addr=127.0.0.1:54171 fd=6 idle=10 flags=N db=0 sub=0 psub=0 qbuf=0 obl=0 oll=0 events=r cmd=NULL
.....
addr=127.0.0.1:54210 fd=45 idle=10 flags=N db=0 sub=0 psub=0 qbuf=0 obl=0 oll=0 events=r cmd=NULL
addr=127.0.0.1:54211 fd=46 idle=10 flags=N db=0 sub=0 psub=0 qbuf=0 obl=0 oll=0 ev
[404] 26 Dec 15:43:22 # === REDIS BUG REPORT END. Make sure to include from START to END. ===

Please report the crash opening an issue on github:

    http://github.com/antirez/redis/issues

Segmentation fault

antirez · 2012-04-06T10:26:25Z

Fixed merging your pull request, thank you. Closing.

[5.0.7] add tarball tests

jokea mentioned this issue Jan 6, 2012

implement aeWait using poll(2) #287

Closed

jokea mentioned this issue Apr 6, 2012

slave crashed (2.6 branch) #438

Closed

antirez pushed a commit that referenced this issue Apr 6, 2012

implement aeWait using poll(2). Fixes issue #267.

7b4dc9b

antirez pushed a commit that referenced this issue Apr 6, 2012

implement aeWait using poll(2). Fixes issue #267.

3e1e1ac

antirez pushed a commit that referenced this issue Apr 6, 2012

implement aeWait using poll(2). Fixes issue #267.

a3ec16f

antirez closed this as completed Apr 6, 2012

JackieXie168 pushed a commit to JackieXie168/redis that referenced this issue Aug 29, 2016

implement aeWait using poll(2). Fixes issue redis#267.

86acaee

spring-operator mentioned this issue Mar 20, 2019

URL Cleanup spring-attic/redis-doc#1

Open

PatKamin added a commit to PatKamin/redis that referenced this issue Oct 28, 2022

Merge pull request redis#267 from szadam/5.0.7-devel

93cd358

[5.0.7] add tarball tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aeWait bug that may crash slave when fd exceeds 1024 #267

aeWait bug that may crash slave when fd exceeds 1024 #267

jokea commented Dec 22, 2011

jokea commented Dec 22, 2011

jokea commented Dec 23, 2011

jokea commented Dec 26, 2011

antirez commented Apr 6, 2012

aeWait bug that may crash slave when fd exceeds 1024 #267

aeWait bug that may crash slave when fd exceeds 1024 #267

Comments

jokea commented Dec 22, 2011

strace -T ./redis-server

strace -T ./redis-server

jokea commented Dec 22, 2011

jokea commented Dec 23, 2011

jokea commented Dec 26, 2011

antirez commented Apr 6, 2012