Bug Report - Server crashed due to ziplist #3740

Open
fgaule opened this Issue Jan 10, 2017 · 14 comments

Projects

None yet

2 participants

@fgaule
fgaule commented Jan 10, 2017 edited
=== REDIS BUG REPORT START: Cut & paste starting from here ===
1846:M 10 Jan 02:25:56.897 # === ASSERTION FAILED ===
1846:M 10 Jan 02:25:56.897 # ==> ziplist.c:411 'NULL' is not true
1846:M 10 Jan 02:25:56.897 # (forcing SIGSEGV to print the bug report.)
1846:M 10 Jan 02:25:56.897 # Redis 3.2.6 crashed by signal: 11
1846:M 10 Jan 02:25:56.897 # Crashed running the instuction at: 0x45cb1a
1846:M 10 Jan 02:25:56.897 # Accessing address: 0xffffffffffffffff
1846:M 10 Jan 02:25:56.897 # Failed assertion: NULL (ziplist.c:411)

------ STACK TRACE ------
EIP:
/usr/local/bin/redis-server *:6502 [cluster](_serverAssert+0x6a)[0x45cb1a]

Backtrace:
/usr/local/bin/redis-server *:6502 [cluster](logStackTrace+0x29)[0x45e7c9]
/usr/local/bin/redis-server *:6502 [cluster](sigsegvHandler+0xac)[0x45eecc]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7f78d6c29330]
/usr/local/bin/redis-server *:6502 [cluster](_serverAssert+0x6a)[0x45cb1a]
/usr/local/bin/redis-server *:6502 [cluster][0x43183f]
/usr/local/bin/redis-server *:6502 [cluster](ziplistFind+0x24b)[0x4328db]
/usr/local/bin/redis-server *:6502 [cluster](hashTypeGetFromZiplist+0x90)[0x44ff40]
/usr/local/bin/redis-server *:6502 [cluster][0x45010c]
/usr/local/bin/redis-server *:6502 [cluster](hmgetCommand+0x72)[0x4515a2]
/usr/local/bin/redis-server *:6502 [cluster](call+0x85)[0x4270b5]
/usr/local/bin/redis-server *:6502 [cluster](processCommand+0x367)[0x42a1e7]
/usr/local/bin/redis-server *:6502 [cluster](processInputBuffer+0x105)[0x436e15]
/usr/local/bin/redis-server *:6502 [cluster](aeProcessEvents+0x218)[0x421488]
/usr/local/bin/redis-server *:6502 [cluster](aeMain+0x2b)[0x42173b]
/usr/local/bin/redis-server *:6502 [cluster](main+0x410)[0x41e6f0]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f78d6872f45]
/usr/local/bin/redis-server *:6502 [cluster][0x41e962]

------ INFO OUTPUT ------
# Server
redis_version:3.2.6
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:3786d918480e6d9c
redis_mode:cluster
os:Linux 4.4.0-34-generic x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:4.8.4
process_id:1846
run_id:9106514b69f19ec8a489b16fbc7c2a79a4a10caa
tcp_port:6502
uptime_in_seconds:651744
uptime_in_days:7
hz:10
lru_clock:7638148
executable:/usr/local/bin/redis-server
config_file:/etc/redis/redis.conf

# Clients
connected_clients:3
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0

# Memory
used_memory:17997969880
used_memory_human:16.76G
used_memory_rss:19928453120
used_memory_rss_human:18.56G
used_memory_peak:18002086784
used_memory_peak_human:16.77G
total_system_memory:25282265088
total_system_memory_human:23.55G
used_memory_lua:37888
used_memory_lua_human:37.00K
maxmemory:18000000000
maxmemory_human:16.76G
maxmemory_policy:allkeys-lru
mem_fragmentation_ratio:1.11
mem_allocator:jemalloc-4.0.3

# Persistence
loading:0
rdb_changes_since_last_save:86471896
rdb_bgsave_in_progress:0
rdb_last_save_time:1483625502
rdb_last_bgsave_status:err
rdb_last_bgsave_time_sec:83
rdb_current_bgsave_time_sec:-1
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok

# Stats
total_connections_received:3
total_commands_processed:5255265
instantaneous_ops_per_sec:141
total_net_input_bytes:38725434125
total_net_output_bytes:133706554703
instantaneous_input_kbps:256.12
instantaneous_output_kbps:3990.56
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
evicted_keys:157814
keyspace_hits:2994957
keyspace_misses:187616
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0

# Replication
role:master
connected_slaves:0
master_repl_offset:0
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

# CPU
used_cpu_sys:11699.52
used_cpu_user:23734.44
used_cpu_sys_children:14845.52
used_cpu_user_children:46542.33

# Commandstats
cmdstat_hmset:calls=1965461,usec=1227848193,usec_per_call=624.71
cmdstat_hmget:calls=3182572,usec=3912851526,usec_per_call=1229.46
cmdstat_ping:calls=3,usec=5,usec_per_call=1.67
cmdstat_info:calls=107225,usec=3398055,usec_per_call=31.69
cmdstat_config:calls=1,usec=17,usec_per_call=17.00
cmdstat_cluster:calls=1,usec=185,usec_per_call=185.00
cmdstat_client:calls=2,usec=22,usec_per_call=11.00

# Cluster
cluster_enabled:1

# Keyspace
db0:keys=35974,expires=0,avg_ttl=0
hash_init_value: 1483657279

------ CLIENT LIST OUTPUT ------
id=554 addr=10.172.2.8:49208 fd=15 name= age=40701 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=0 omem=0 events=r cmd=info
id=558 addr=10.70.165.164:47558 fd=12 name= age=40534 idle=40534 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=559 addr=10.70.165.164:47560 fd=14 name= age=40534 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=5 oll=0 omem=0 events=r cmd=hmget

------ CURRENT CLIENT INFO ------
id=559 addr=10.70.165.164:47560 fd=14 name= age=40534 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=5 oll=0 omem=0 events=r cmd=hmget
argv[0]: 'HMGET'
argv[1]: 'hs:170110|170117|s|ec|2|op_de-s_demotr-s_de'
argv[2]: '236034'
argv[3]: '485253'
argv[4]: '833800'
argv[5]: '374146'
argv[6]: '544001'
argv[7]: '872071'
argv[8]: '486157'
argv[9]: '363151'
argv[10]: '331529'
argv[11]: '331531'
argv[12]: '235916'
argv[13]: '331530'
argv[14]: '236044'
argv[15]: '236307'
argv[16]: '331541'
argv[17]: '374295'
argv[18]: '236181'
argv[19]: '236315'
argv[20]: '344733'
argv[21]: '236446'
argv[22]: '235934'
argv[23]: '830355'
argv[24]: '344615'
argv[25]: '236198'
argv[26]: '235940'
argv[27]: '236079'
argv[28]: '235949'
argv[29]: '296117'
argv[30]: '363446'
argv[31]: '344496'
argv[32]: '836025'
argv[33]: '236340'
argv[34]: '435633'
argv[35]: '235962'
argv[36]: '490686'
argv[37]: '483899'
argv[38]: '236094'
argv[39]: '376517'
argv[40]: '355911'
argv[41]: '354119'
argv[42]: '907845'
argv[43]: '490190'
argv[44]: '344655'
argv[45]: '458827'
argv[46]: '361936'
argv[47]: '235988'
argv[48]: '354654'
argv[49]: '380382'
argv[50]: '236383'
argv[51]: '363226'
argv[52]: '235996'
argv[53]: '379364'
argv[54]: '490853'
argv[55]: '236007'
argv[56]: '354656'
argv[57]: '236393'
argv[58]: '360430'
argv[59]: '331502'
argv[60]: '236398'
argv[61]: '331499'
argv[62]: '819938'
argv[63]: '780667'
argv[64]: '378742'
argv[65]: '331894'
argv[66]: '458355'
argv[67]: '378736'
argv[68]: '358387'
argv[69]: '362483'
argv[70]: '236020'
argv[71]: '331890'
argv[72]: '788341'
argv[73]: '236024'
argv[74]: '236408'
argv[75]: '312825'
1846:M 10 Jan 02:25:56.900 # key 'hs:170110|170117|s|ec|2|op_de-s_demotr-s_de' found in DB containing the following object:
1846:M 10 Jan 02:25:56.900 # Object type: 4
1846:M 10 Jan 02:25:56.900 # Object encoding: 5
1846:M 10 Jan 02:25:56.900 # Object refcount: 1
1846:M 10 Jan 02:25:56.900 # Hash size: 2764

------ REGISTERS ------
1846:M 10 Jan 02:25:56.900 # 
RAX:0000000000000000 RBX:000000000000019b
RCX:00000000fbad000c RDX:0000000000000000
RDI:00007f78d6c0f760 RSI:0000000000000000
RBP:00000000004df8e8 RSP:00007fff0c191e80
R8 :00000000021d87a0 R9 :00007f78d6c0f7b8
R10:00007f78d6c0f7b8 R11:00007f78d6c0f7b0
R12:00000000004dfa97 R13:0000000000000000
R14:00007f73e2a302e4 R15:00000000000000ef
RIP:000000000045cb1a EFL:0000000000010202
CSGSFS:0000000000000033
1846:M 10 Jan 02:25:56.900 # (00007fff0c191e8f) -> 00007fff0c191f5c
1846:M 10 Jan 02:25:56.900 # (00007fff0c191e8e) -> 00007fff0c191f60
1846:M 10 Jan 02:25:56.900 # (00007fff0c191e8d) -> 00007f73e2a00000
1846:M 10 Jan 02:25:56.900 # (00007fff0c191e8c) -> 0000000a0d34372a
1846:M 10 Jan 02:25:56.900 # (00007fff0c191e8b) -> 0000000000039a02
1846:M 10 Jan 02:25:56.900 # (00007fff0c191e8a) -> f000000000000010
1846:M 10 Jan 02:25:56.900 # (00007fff0c191e89) -> 00007f7459787373
1846:M 10 Jan 02:25:56.900 # (00007fff0c191e88) -> 0000000557213a00
1846:M 10 Jan 02:25:56.900 # (00007fff0c191e87) -> 00000000004328db
1846:M 10 Jan 02:25:56.900 # (00007fff0c191e86) -> 0000000000000001
1846:M 10 Jan 02:25:56.900 # (00007fff0c191e85) -> 055bec00e2a0000a
1846:M 10 Jan 02:25:56.900 # (00007fff0c191e84) -> 00007f7457213a00
1846:M 10 Jan 02:25:56.900 # (00007fff0c191e83) -> 000000000043183f
1846:M 10 Jan 02:25:56.900 # (00007fff0c191e82) -> 0000000000000006
1846:M 10 Jan 02:25:56.900 # (00007fff0c191e81) -> 0000000000000001
1846:M 10 Jan 02:25:56.900 # (00007fff0c191e80) -> 00007f73e2a302e2

------ FAST MEMORY TEST ------
1846:M 10 Jan 02:25:56.901 # Bio thread for job type #0 terminated
1846:M 10 Jan 02:25:56.901 # Bio thread for job type #1 terminated
*** Preparing to test memory region 724000 (94208 bytes)
*** Preparing to test memory region 21cb000 (135168 bytes)
*** Preparing to test memory region 7f7288600000 (27051163648 bytes)
*** Preparing to test memory region 7f78d4dff000 (8388608 bytes)
*** Preparing to test memory region 7f78d5600000 (14680064 bytes)
*** Preparing to test memory region 7f78d6400000 (4194304 bytes)
*** Preparing to test memory region 7f78d6c11000 (20480 bytes)
*** Preparing to test memory region 7f78d6e33000 (16384 bytes)
*** Preparing to test memory region 7f78d755f000 (8192 bytes)
*** Preparing to test memory region 7f78d7569000 (4096 bytes)
*** Preparing to test memory region 7f78d756a000 (4096 bytes)
*** Preparing to test memory region 7f78d756d000 (8192 bytes)
*** Preparing to test memory region 7f78d756f000 (8192 bytes)
0.0
@antirez
Owner
antirez commented Jan 13, 2017

Hello, please can you tell me the value of the configuration parameter hash-max-ziplist-entries? Thanks.

@fgaule
fgaule commented Jan 13, 2017 edited
@antirez
Owner
antirez commented Jan 13, 2017

Thanks, please before going in depth with debugging of the crash report, could you answer the following?

  1. Do you run in a cloud environment on in your own machines?
  2. If you are using your hardware, are you using error corrected memory modules?
  3. If you are using your hardware and your RAM is not error corrected, could you run memtest86 in this box for some time?

Thanks.

@fgaule
fgaule commented Jan 13, 2017 edited
  1. Do you run in a cloud environment on in your own machines?
    Openstack cloud environment (3 nodes)

  2. If you are using your hardware, are you using error corrected memory modules?
    I can find this answer with the devops team if you need.

  3. If you are using your hardware and your RAM is not error corrected, could you run memtest86 in this box for some time?
    Not possible :(

@antirez
Owner
antirez commented Jan 13, 2017

Could you kindly ask your Cloud vendor if they use ECC memory? I'm starting anyway an investigation since this is the second bug report of a similar type we receive about ziplists. Even if the ziplist code did not changed significantly later it is possible that there are uncovered bugs. Thanks.

@antirez
Owner
antirez commented Jan 13, 2017

Sorry another question: I see that in your hashes, you stored numerical keys. Are the values also numbers, or there are also strings? What are the numbers ranges if they are numbers? Thanks.

@fgaule
fgaule commented Jan 13, 2017 edited

No problem. In that case, numbers are stored as string and they its range is 200.000 to 999.000, expecting a maximum of 20.000 per key. (Before you think this is madness, we are bench-marking a a key design idea which would match our use case).

Feel free to ask me whatever you need ;)

@antirez
Owner
antirez commented Jan 13, 2017

Thanks! I'm going to write a stress test for ziplists in order to try to replicate similar conditions. Normally I wrote ziplists stress testers targeting the Redis API. This time I'll write a C program that uses ziplist.c directly in order to test many operations per second and explore more states than otherwise possible. Btw modeling problems with hashes composed of numbers is a good very common design! Just a suggestion, once you use ziplists of, for example, 1000-3000 entires each, the key overhead is going to be small, very similar to using 20k items per ziplist. However 20k items make operations pretty slow, so I suggest also trying with a lower number of items per ziplist.

@fgaule
fgaule commented Jan 13, 2017 edited

"Could you kindly ask your Cloud vendor if they use ECC memory? "
I dont know if it is a good o bad news but they are currently using 'Advanced ECC'

PD: Thanks for the advice, benchmarks shows that having that big number of hash fields was pretty slow in terms of performance and cpu burner. I'm going to split my keys in buckets having 1000-3000 entries to check it out!

@antirez
Owner
antirez commented Jan 13, 2017

Thanks @fgaule the fact they run memory corrected modules makes more likely it's a Redis bug, which in theory is bad, but actually makes it more clear where to look, which is better :-) So I'm happy to hear that. It means that the efforts in reproducing this issue are well spent. About your use case, please feel free to comment here if you need any additional hint on ziplists tradeoffs or other ways to model your problem. Bug reports and fast follow-ups deserve this and more :-) Thanks.

@fgaule
fgaule commented Jan 13, 2017

Thanks @antirez, i appriciate!

@fgaule
fgaule commented Jan 13, 2017

Just in case, virtual machines are running over "HP ProLiant BL465c Gen8" hardware.

@antirez
Owner
antirez commented Jan 13, 2017

Thank you, cool that the provider is providing assistance and details btw.

@antirez antirez closed this Jan 13, 2017
@antirez antirez reopened this Jan 13, 2017
@antirez
Owner
antirez commented Jan 16, 2017

Hello again, quick update: the investigation is in progress. So far nothing found but this activity will take a few days... Please in case the crash happens again, a note here would be extremely useful. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment