Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redis 2.8.13 OOM crash even with maxmemory configured #2136

Closed
benzheren opened this issue Nov 11, 2014 · 26 comments
Closed

Redis 2.8.13 OOM crash even with maxmemory configured #2136

benzheren opened this issue Nov 11, 2014 · 26 comments

Comments

@benzheren
Copy link

We are running single Redis 2.8.13 instance on AWS EC2 instance with 122 GB memory, we try to config it as LRU cache with configurations like:

maxmemory 110000000000 (about 102.44 GB)
maxmemory-policy allkeys-lru
maxmemory-samples 5

We disabled both AOF and RDB persistence.

There is no crash log in redis log output, but the

dmesg output after crash is

[605806.165743] cron invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[605806.165748] cron cpuset=/ mems_allowed=0
[605806.165750] CPU: 7 PID: 1401 Comm: cron Not tainted 3.13.0-29-generic #53-Ubuntu
[605806.165751] Hardware name: Xen HVM domU, BIOS 4.2.amazon 06/02/2014
[605806.165753] 0000000000000000 ffff881dead09980 ffffffff8171a214 ffff8800376617f0
[605806.165756] ffff881dead09a08 ffffffff81714b4f 0000000000000000 0000000000000000
[605806.165759] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[605806.165761] Call Trace:
[605806.165768] [] dump_stack+0x45/0x56
[605806.165771] [] dump_header+0x7f/0x1f1
[605806.165774] [] oom_kill_process+0x1ce/0x330
[605806.165778] [] ? security_capable_noaudit+0x15/0x20
[605806.165780] [] out_of_memory+0x414/0x450
[605806.165783] [] __alloc_pages_nodemask+0xa5c/0xb80
[605806.165786] [] alloc_pages_current+0xa3/0x160
[605806.165789] [] __page_cache_alloc+0x97/0xc0
[605806.165792] [] filemap_fault+0x185/0x410
[605806.165795] [] __do_fault+0x6f/0x530
[605806.165798] [] handle_mm_fault+0x492/0xf10
[605806.165801] [] ? arch_vtime_task_switch+0x94/0xa0
[605806.165803] [] ? vtime_common_task_switch+0x3d/0x40
[605806.165806] [] ? finish_task_switch+0x128/0x170
[605806.165809] [] __do_page_fault+0x184/0x560
[605806.165813] [] ? sched_clock+0x9/0x10
[605806.165815] [] ? sched_clock_local+0x1d/0x80
[605806.165818] [] ? acct_account_cputime+0x1c/0x20
[605806.165820] [] ? account_user_time+0x8b/0xa0
[605806.165822] [] ? vtime_account_user+0x54/0x60
[605806.165824] [] do_page_fault+0x1a/0x70
[605806.165827] [] page_fault+0x28/0x30
[605806.165829] Mem-Info:
[605806.165830] Node 0 DMA per-cpu:
[605806.165832] CPU 0: hi: 0, btch: 1 usd: 0
[605806.165833] CPU 1: hi: 0, btch: 1 usd: 0
[605806.165834] CPU 2: hi: 0, btch: 1 usd: 0
[605806.165835] CPU 3: hi: 0, btch: 1 usd: 0
[605806.165836] CPU 4: hi: 0, btch: 1 usd: 0
[605806.165837] CPU 5: hi: 0, btch: 1 usd: 0
[605806.165838] CPU 6: hi: 0, btch: 1 usd: 0
[605806.165838] CPU 7: hi: 0, btch: 1 usd: 0
[605806.165839] CPU 8: hi: 0, btch: 1 usd: 0
[605806.165840] CPU 9: hi: 0, btch: 1 usd: 0
[605806.165842] CPU 10: hi: 0, btch: 1 usd: 0
[605806.165843] CPU 11: hi: 0, btch: 1 usd: 0
[605806.165844] CPU 12: hi: 0, btch: 1 usd: 0
[605806.165844] CPU 13: hi: 0, btch: 1 usd: 0
[605806.165845] CPU 14: hi: 0, btch: 1 usd: 0
[605806.165846] CPU 15: hi: 0, btch: 1 usd: 0
[605806.165847] Node 0 DMA32 per-cpu:
[605806.165849] CPU 0: hi: 186, btch: 31 usd: 0
[605806.165850] CPU 1: hi: 186, btch: 31 usd: 0
[605806.165851] CPU 2: hi: 186, btch: 31 usd: 0
[605806.165851] CPU 3: hi: 186, btch: 31 usd: 0
[605806.165852] CPU 4: hi: 186, btch: 31 usd: 139
[605806.165853] CPU 5: hi: 186, btch: 31 usd: 0
[605806.165854] CPU 6: hi: 186, btch: 31 usd: 52
[605806.165855] CPU 7: hi: 186, btch: 31 usd: 30
[605806.165856] CPU 8: hi: 186, btch: 31 usd: 0
[605806.165857] CPU 9: hi: 186, btch: 31 usd: 0
[605806.165858] CPU 10: hi: 186, btch: 31 usd: 0
[605806.165859] CPU 11: hi: 186, btch: 31 usd: 0
[605806.165860] CPU 12: hi: 186, btch: 31 usd: 0
[605806.165861] CPU 13: hi: 186, btch: 31 usd: 0
[605806.165862] CPU 14: hi: 186, btch: 31 usd: 0
[605806.165863] CPU 15: hi: 186, btch: 31 usd: 0
[605806.165864] Node 0 Normal per-cpu:
[605806.165865] CPU 0: hi: 186, btch: 31 usd: 0
[605806.165866] CPU 1: hi: 186, btch: 31 usd: 0
[605806.165867] CPU 2: hi: 186, btch: 31 usd: 0
[605806.165868] CPU 3: hi: 186, btch: 31 usd: 0
[605806.165869] CPU 4: hi: 186, btch: 31 usd: 0
[605806.165870] CPU 5: hi: 186, btch: 31 usd: 0
[605806.165871] CPU 6: hi: 186, btch: 31 usd: 0
[605806.165872] CPU 7: hi: 186, btch: 31 usd: 0
[605806.165873] CPU 8: hi: 186, btch: 31 usd: 0
[605806.165874] CPU 9: hi: 186, btch: 31 usd: 0
[605806.165875] CPU 10: hi: 186, btch: 31 usd: 0
[605806.165876] CPU 11: hi: 186, btch: 31 usd: 0
[605806.165877] CPU 12: hi: 186, btch: 31 usd: 0
[605806.165878] CPU 13: hi: 186, btch: 31 usd: 0
[605806.165879] CPU 14: hi: 186, btch: 31 usd: 0
[605806.165880] CPU 15: hi: 186, btch: 31 usd: 0
[605806.165882] active_anon:31039555 inactive_anon:76 isolated_anon:0
[605806.165882] active_file:60 inactive_file:60 isolated_file:0
[605806.165882] unevictable:0 dirty:0 writeback:0 unstable:0
[605806.165882] free:139633 slab_reclaimable:6506 slab_unreclaimable:11931
[605806.165882] mapped:7 shmem:95 pagetables:61334 bounce:0
[605806.165882] free_cma:0
[605806.165885] Node 0 DMA free:15904kB min:8kB low:8kB high:12kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[605806.165889] lowmem_reserve[]: 0 3744 122934 122934
[605806.165891] Node 0 DMA32 free:478708kB min:2056kB low:2568kB high:3084kB active_anon:3317128kB inactive_anon:20kB active_file:36kB inactive_file:36kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3915776kB managed:3836820kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:20kB slab_reclaimable:1292kB slab_unreclaimable:1440kB kernel_stack:40kB pagetables:29904kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9510 all_unreclaimable? yes
[605806.165895] lowmem_reserve[]: 0 0 119190 119190
[605806.165897] Node 0 Normal free:63920kB min:65516kB low:81892kB high:98272kB active_anon:120841092kB inactive_anon:284kB active_file:204kB inactive_file:204kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:124067840kB managed:122050804kB mlocked:0kB dirty:0kB writeback:0kB mapped:24kB shmem:360kB slab_reclaimable:24732kB slab_unreclaimable:46284kB kernel_stack:4392kB pagetables:215432kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:850 all_unreclaimable? yes
[605806.165900] lowmem_reserve[]: 0 0 0 0
[605806.165902] Node 0 DMA: 0_4kB 0_8kB 0_16kB 1_32kB (U) 2_64kB (U) 1_128kB (U) 1_256kB (U) 0_512kB 1_1024kB (U) 1_2048kB (R) 3_4096kB (M) = 15904kB
[605806.165910] Node 0 DMA32: 1454_4kB (UEM) 3121_8kB (UEM) 1970_16kB (UEM) 919_32kB (EM) 469_64kB (UEM) 293_128kB (UEM) 212_256kB (EM) 60_512kB (UM) 157_1024kB (UM) 0_2048kB 18_4096kB (MR) = 478720kB
[605806.165920] Node 0 Normal: 1085_4kB (UEM) 510_8kB (UEM) 255_16kB (UEM) 345_32kB (UEM) 159_64kB (UEM) 71_128kB (UEM) 26_256kB (E) 14_512kB (EM) 7_1024kB (U) 0_2048kB 0*4096kB = 63796kB
[605806.165929] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[605806.165930] 221 total pagecache pages
[605806.165931] 0 pages in swap cache
[605806.165932] Swap cache stats: add 0, delete 0, find 0/0
[605806.165933] Free swap = 0kB
[605806.165934] Total swap = 0kB
[605806.165935] 31999901 pages RAM
[605806.165936] 0 pages HighMem/MovableOnly
[605806.165937] 504259 pages reserved
[605806.165937] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[605806.165947] [ 691] 0 691 12473 222 27 0 -1000 systemd-udevd
[605806.165949] [ 897] 0 897 2556 575 8 0 0 dhclient
[605806.165951] [ 1283] 101 1283 65115 3682 35 0 0 rsyslogd
[605806.165952] [ 1370] 0 1370 3635 41 12 0 0 getty
[605806.165953] [ 1373] 0 1373 3635 39 12 0 0 getty
[605806.165955] [ 1378] 0 1378 3635 42 12 0 0 getty
[605806.165956] [ 1379] 0 1379 3635 41 12 0 0 getty
[605806.165958] [ 1382] 0 1382 3635 40 12 0 0 getty
[605806.165959] [ 1401] 0 1401 5914 56 17 0 0 cron
[605806.165960] [ 1402] 0 1402 4785 42 13 0 0 atd
[605806.165962] [ 1414] 0 1414 15341 169 33 0 -1000 sshd
[605806.165963] [ 1435] 0 1435 1092 35 8 0 0 acpid
[605806.165965] [ 1436] 102 1436 9803 105 23 0 0 dbus-daemon
[605806.165967] [ 1442] 0 1442 4863 117 13 0 0 irqbalance
[605806.165968] [ 1456] 0 1456 10883 90 26 0 0 systemd-logind
[605806.165970] [ 1512] 0 1512 3635 41 12 0 0 getty
[605806.165971] [ 1514] 0 1514 3197 39 12 0 0 getty
[605806.165973] [ 3073] 0 3073 4869 51 13 0 0 upstart-udev-br
[605806.165974] [ 3077] 0 3077 3819 58 12 0 0 upstart-file-br
[605806.165976] [ 3078] 0 3078 3815 58 11 0 0 upstart-socket-
[605806.165977] [39705] 999 39705 19922 941 39 0 0 gmond
[605806.165979] [74655] 998 74655 31118305 31025991 60725 0 0 redis-server
[605806.165981] [76879] 1003 76879 6511 175 16 0 0 screen
[605806.165982] [76880] 1003 76880 5510 691 15 0 0 bash
[605806.165984] [77322] 0 77322 15918 116 34 0 0 sudo
[605806.165985] [77323] 0 77323 17122 3850 37 0 0 gdb
[605806.165986] [82568] 0 82568 26408 246 55 0 0 sshd
[605806.165988] [82665] 1003 82665 26408 249 53 0 0 sshd
[605806.165989] [82666] 1003 82666 5535 707 15 0 0 bash
[605806.165991] Out of memory: Kill process 74655 (redis-server) score 987 or sacrifice child
[605806.172830] Killed process 74655 (redis-server) total-vm:124473220kB, anon-rss:124103964kB, file-rss:0kB


My question is why even with maxmemory config, OOM still happens. I happened to come cross this article online: http://www.couyon.net/blog/using-redis-as-a-lru-cache-dont-do-it.

Is there something more we should pay attention to when we use Redis as LRU cache?

@antirez
Copy link
Contributor

antirez commented Nov 11, 2014

Either the limit is too high or there are other non-trivial users of instance memory. Please could you post server INFO output as provided on the crash trace by Redis? Thanks.

@benzheren
Copy link
Author

@antirez we did attache gdb to the crashed redis when we first started it. So what is the easiest way to get the crash trace for you? We did not find crash trace in Redis log file. Any other way we can provide more information for you? This is what looks like when I grep redis related processes on the server:

redis     74655  6.7  0.0      0     0 ?        Zsl  Nov09 241:57 [redis-server] <defunct>
root      77322  0.0  0.0  63672   464 pts/1    S+   Nov10   0:00 sudo gdb /usr/local/bin/redis-server 74655
root      77323  0.0  0.0  68488 15440 pts/1    S+   Nov10   0:00 gdb /usr/local/bin/redis-server 74655

pid 74655 is the crashed redis instance.

@antirez
Copy link
Contributor

antirez commented Nov 12, 2014

Thanks, yep when it's killed by the OOM killer no crash report, you are right. I was curious to check if Redis was persisting on disk (forked a process) when OOM killer killed it.

Btw from the OOM killer info we have some info, plus, if you have logs, you can get the whole picture:

  1. anon-rss:124103964kB: The process was using 118 GB when it crashed. Maxmemory will enforce data size but there are other buffers that can make some difference, so when limiting the data size with maxmemory it is better to use a lower limit (you can always raise it at runtime with CONFIG SET maxmemory newlimit-in-bytes).
  2. Another cause to use more memory is fragmentation. Redis maxmemory checks the "logical" memory usage, which is the sum of all allocations + some overhead. The real memory usage however is always at least a bit greater, or much greater if the work load is not a good fit for jemalloc ability to avoid fragmentation.
  3. Copy on write during persistence is another source of memory usage. If you are using AOF or RDB, check during rewrites or RDB generation the log file to check the additional memory used by the other process. In extreme conditions it could be 2x.
  4. Transparent huge pages will cause the problem at "3" to be much worse, make sure to disable it.

I hope this helps,
Salvatore

@benzheren
Copy link
Author

@antirez Thanks for the feedback and suggestion. Since last crash, we have set the maxmemory to 100000000000, which is about 93G. But the usage of redis just keeps growing, free memory of the instance goes below 8G at this moment. I have attached the following free memory chart since the start of the instance. It seems to me that redis keeps eating the memory. I also attached the output of 'INFO' of the server when I took the snapshot of the free memory chart.

screen shot 2014-11-15 at 9 23 37 pm

redis_version:2.8.13
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:e81c9ef879587901
redis_mode:standalone
os:Linux 3.13.0-29-generic x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:4.8.2
process_id:82866
run_id:db2afcef13c75f4431e9793f0bf336754a1f4eda
tcp_port:6379
uptime_in_seconds:337598
uptime_in_days:3
hz:10
lru_clock:6770794
config_file:/etc/redis/6379.conf

# Clients
connected_clients:366
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0

# Memory
used_memory:99968724296
used_memory_human:93.10G
used_memory_rss:118807183360
used_memory_peak:100028209648
used_memory_peak_human:93.16G
used_memory_lua:33792
mem_fragmentation_ratio:1.19
mem_allocator:jemalloc-3.6.0

# Persistence
loading:0
rdb_changes_since_last_save:157248575
rdb_bgsave_in_progress:0
rdb_last_save_time:1415719340
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:-1
rdb_current_bgsave_time_sec:-1
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok

# Stats
total_connections_received:53845
total_commands_processed:1380700381
instantaneous_ops_per_sec:15340
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:221744
evicted_keys:13607389
keyspace_hits:1180605859
keyspace_misses:69929733
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0

# Replication
role:master
connected_slaves:0
master_repl_offset:0
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

# CPU
used_cpu_sys:21812.20
used_cpu_user:15842.75
used_cpu_sys_children:0.00
used_cpu_user_children:0.00

# Keyspace
db0:keys=23449039,expires=23441943,avg_ttl=534909200

For this instance, we disabled both AOF and RDB.

If memory keeps growing like this, is Redis still be a good fit for LRU cache? I have set the maxmemory to be 75% of the instance memory, and still the memory keeps growing. Then it seems to me it is not very memory efficient to use Redis as LRU cache.

@benzheren
Copy link
Author

I forgot to mention that the value which keeps growing is used_memory_rss.

The used_memory does not grow, it stays at the value we set for maxmemory.

@mattsta
Copy link
Contributor

mattsta commented Nov 15, 2014

Summary:

  • Around 23 million total keys.
  • All keys have an expire set except for 7,096 of them.
  • There's around 20% memory fragmentation:
    • used_memory_rss / used_memory = 118807183360 / 99968724296 = 1.18844
  • So, you are using 93 GB of memory for your data (exactly the maxmemory limit, so Redis is evicting), but Redis itself is taking up 110 GB of total memory.

@antirez Is there a reason the maxmemory limit doesn't use the RSS value? People tend to set limits based on physical memory usage, not local data size, right? If someone has 1 GB of data but it's an RSS of 2 GB with fragmentation, there's no way for the user to manage memory usage besides having the user manually monitor and frequently adjust the maxmemory size.

Other people have mentioned this before too. When Redis does hit maxmemory, Redis will constantly fragment itself until it consumes all system resources. The article mentioned above (http://www.couyon.net/blog/using-redis-as-a-lru-cache-dont-do-it) is a good intro to the topic too.

Right now Redis enforces a data size limit, but not a process size limit. By enfocing the data size limit, the process size limit can grow unbounded under eviction pressure, which seems really bad.

So, technically maxmemory is named the wrong thing (since it ignores replication buffers (but not client buffers) and fragmentation size). maxmemory is really maximum data set size and there's no way to enforce a Redis maxmemory limit on the entire process memory usage.

The quickest 90% fix would be to base maxmemory from RSS instead of logical memory usage:

diff --git a/src/redis.c b/src/redis.c
index eef5251..6d9f131 100644
--- a/src/redis.c
+++ b/src/redis.c
@@ -3160,7 +3160,7 @@ int freeMemoryIfNeeded(void) {

     /* Remove the size of slaves output buffers and AOF buffer from the
      * count of used memory. */
-    mem_used = zmalloc_used_memory();
+    mem_used = server.resident_set_size;
     if (slaves) {
         listIter li;
         listNode *ln;

That approach still allows memory growth beyond the limit because Redis doesn't count replication buffers towards eviction memory usage, but replication buffers tend to be small anyway (maybe? if they are small, why are we ignoring their memory usage since they don't count for much? if they are large, then... why ignore their memory usage since they can overflow the limits? boggle).

[Sidenote: my favorite party trick with maxmemory + a global eviction policy: have a client run a pipeline request containing thousands of commands so Redis has to build up a huge result buffer in memory (it could even be something dumb like LRANGE <biglist> 0 -1 5,000 times). That huge result buffer will count towards memory usage. Redis will see it is now using a lot of memory and start evicting keys. If your result buffer is big enough, Redis will evict all the keys, because no matter how many keys it evicts, the memory usage will never go below maxmemory during that eviction loop. Poof! All your data is now gone.]

@benzheren
Copy link
Author

@mattsta I agreed with your suggestion. Currently the maxmemory config of Redis is really confusing for people, especially those who try to use it as a LRU cache.

@antirez
Copy link
Contributor

antirez commented Nov 16, 2014

Note: replying to the original issue first and how it can be solved, next is a comment to reply to @mattsta.

@benzheren it looks like the problem is not the LRU algorithm of Redis itself: from his point of view it is expiring memory indeed, but the fact that for your work load unfortunately jemalloc is fragmenting, however usually fragmentation is logarithmic. The best thing to do is trying to work in the reverse way: Set a memory limit that allows a much higher fragmentation, for example up to 1.6 to be sure you have enough room, and monitor the fragmentation. You can do this with CONFIG SET maxmemory at runtime (but is blocking if you don't do it progressively), but I've some concern at this point that you may have THP enabled, and this may interact with the ability of jemalloc to free memory.

So also make sure at some point to disable transparent huge pages and restart the server (see http://redis.io/topics/admin if you don't know how to disable them).

Let's see what happens, maybe there is something else we are not considering here, since there are some parts of Redis that use normal malloc and are not traced for memory usage, and we could have a leak there. However Lua memory usage that could be one culprit is low, so unlikely.

For now the best bet is basically to consider the max fragmentation you'll experience upfront. Much more details in the next reply to Matt.

@antirez
Copy link
Contributor

antirez commented Nov 16, 2014

@benzheren oh a few questions to understand what could be a cause of fragmentation:

  1. Do you have a workload where currently progressively larger objects are created?
  2. What data types are you using?

Thanks,
Salvatore

@antirez
Copy link
Contributor

antirez commented Nov 16, 2014

@mattsta @benzheren Limiting memory usage via RSS would be great, but it is extremely impractical, or better, impossible, if not in two specific cases:

  1. When you are able to allocate all the memory at startup.
  2. When the allocation pattern makes fragmentation a non issue, so for example you can have a slab allocator that allocates and releases memory only at fixed 32 MB allocations. This will generate basically no fragmentation at all so you can actually have an RSS-perfect limit.

However limiting observing the RSS is not possible since when you start freeing data, the RSS does not changes dynamically: it may remain high since the allocator does not unmap a given memory region, or can go low later, incrementally, as the allocator performs some cleanup.

So if we do something like you suggest: mem_used = server.resident_set_size; what happens is the following: when maxmemory is reached, at every memory freeing iteration, you still see that you need to free memory, since the RSS is for quite some time over the memory limit, and you destroy your data set. Eventually RSS changes maybe and the allocation is not called again.

Matt suggestion was actually implemented by me time ago, with different changes in order to try to cope with the RSS "slowness" to change, but eventually I gave up in favor of another approach I never used but looks more promising, which is: we should try to instead adjust the instantaneous memory reporting based on the fragmentation experienced. For example if the max fragmentation seen so far when max_memory is near peak_memory is 1.3, the actual instantaneous memory usage to be used for memory limit goals should be zmalloc_used_memory * fragmentation.

This could improve things already, since we would start lowering the memory limit earlier, but could still be not perfect with certain patterns. For example we could reach the memory limit with perfect fragmentation of 1.0, and later the system may start to fragment. The memory reporting could start do adapt and we could start freeing more stuff, but if the objects added will not be able to use the old allocations, we could have RSS still going up, with little chances of going down in the future.

However I believe this could be already pretty cool. Even to the above problem, there is a fix, which is, to start with a pessimistic fragmentation figure, for example 1.3 or more. When maxmemory is reached in this way, we keep monitoring the actual fragmentation, and slowly adapt it based on the real data. If fragmentation will change we'll adapt on the long run and use all the memory in case it is very low.

The key of the implementation of such a system, is to make gentle adjustments, otherwise you get trapped into similar problems of RSS monitoring itself (but not as strong, of course). For example if you start with an estimated fragmentation F, you adapt it at every cycle by composing F like this: give F 99.99% of what it is already, and 0.01% of what the real fragmentation is. Just examples, you compute the percentage based on the delay you want it to be in "following" the current fragmentation figure.

Ok, I'll try to implement this stuff monday and report back... however I would love to have some help from @benzheren because this is the kind of stuff that should be confirmed in the field. We can make Redis better together. Thanks.

@benzheren
Copy link
Author

@antirez Answers to your questions:

  1. We did NOT disable THP, which we would change it and start a new instance to see how this affects the performance.
  2. We mainly use this instance as LRU cache, so we do cache database objects (marshaling to a string), some of them could be pretty large compared to a simple string.
  3. For this instance, majority data type is just simple key-value (marshalled objects string) store. We use Set to store relations between objects. These two are our major data types.

@antirez
Copy link
Contributor

antirez commented Nov 17, 2014

Hello @benzheren and @mattsta. I investigated how to improve maxmemory today. These are my findings.

First, let's start with what who is using Redis in LRU mode today should do:

  1. Set a much lower maxmemory, considering a 1.4 overhead, since Jemalloc rarely fragments over this limit.
  2. Observe the real fragmentation in the course of the next days.
  3. Raise the maxmemory limit with CONFIG SET according to the findings, progressively.

Now on the topic about making the above less a pain, and more automated:

I can confirm that the algorithm can't adjust itself once the user-defined maxmemory limit is reached, because if the RSS is already higher than the configured parameter, it will stay higher even if we evict 50% of the objects in memory. Jemalloc will only be able to actually reclaim memory if we evict almost the whole data set. Moreover, if we already went over the configured limit, it is already a problem.

It is also not possible to take decisions once the RSS reaches the configured limit: when this happens, we'll likely find no fragmentation at all. Shit happens once the server starts to evict since maxmemory was already reached. If at this point we try to evict more objects to get a live feedback from the RSS reporting, nothing good happens, since the RSS will stay at the same level.

So basically what is possible to do, is to mimic what the operation persons would do, that is, to guess a fragmentation, set a maxmemory parameter, observe what happens, and modify the setting accordingly.

I implemented the above strategy and indeed, it looks to work. This is how it works:

  1. We start with a pessimistic guess about fragmentation ratio for a given workload. I used 1.4.
  2. We set the actual memory limit to user_limit / 1.4.
  3. Once the server hits the limit, we start to sample the fragmentation every 1 million executed commands. If the fragmentation raises, we continue to monitor. At some point we check the fragmentation and find it to be equal to the previous one. We have our guess.
  4. The 1.4 guess is updated with the actual fragmentation observed at "3". The enforced limit is update accordingly. For successive CONFIG SET maxmemory commands, we'll use the new guess.

It is still not perfect, but I've a patch for the above we can use to evaluate if it is worth it or not. You can find it here: https://github.com/antirez/redis/commits/rssmaxmemory

@antirez
Copy link
Contributor

antirez commented Nov 17, 2014

p.s. the patch is not good at handling the user messing with CONFIG SET maxmemory while the sampling stage is still active. This is trivial to fix, but I avoided to do wasted work before of some feedback.

@benzheren
Copy link
Author

@antirez more findings, currently every time Redis automatically frees up more memory ( maybe b/c of the maxmemory settings), we will get some Redis connection timeout errors on the application side.

And at this moment, the mem_fragementation_ratio is reaching the 1.4 limit.

# Memory
used_memory:79990215688
used_memory_human:74.50G
used_memory_rss:111736643584
used_memory_peak:100028209648
used_memory_peak_human:93.16G
used_memory_lua:33792
mem_fragmentation_ratio:1.40
mem_allocator:jemalloc-3.6.0

@antirez
Copy link
Contributor

antirez commented Nov 17, 2014

@benzheren thanks for the update. It is possible that the fragmentation will go down later, however it looks like there is a workload stressing the allocator. This usually happens with progressively larger objects. It is very hard to handle this well. For example, memcached would not be able to re-use the slabs of small objects easily AFAIK (confirmation welcomed), so it is not like trivially solvable using other approaches, i.e., copying what memcached is doing (this would have many other side effects for Redis).

About the latency spikes, freeMemoryIfNeeded() is instrumented with the latency monitor, so please if you can enable it with CONFIG SET latency-monitor-threshold 100 (100 are milliseconds, you can use a smaller figure if your client timeout is lower).

I'm not sure the latency issues are due to the eviction: it is performed incrementally at every command run, so you should see it continuously. However the latency monitor will give you some good info. After some time, and when you see again latency spikes, you can use: LATENCY DOCTOR to get a report from the latency monitoring framework.

However note that if the latency is due to transparent huge pages, the latency monitoring system will not be able to detect it, since it happens at random in non instrumented places.

@antirez
Copy link
Contributor

antirez commented Nov 17, 2014

p.s. of course please post the LATENCY DOCTOR output here for reference when you grab one! Please feel free to post your next findings here. I would like to use your patience and use case in order to fix some Redis problem in this area, or at least to improve the doc if we don't find anything to change.

@antirez
Copy link
Contributor

antirez commented Nov 17, 2014

Ah, more importantly, your reported fragmentation is non real:

used_memory:79990215688
used_memory_human:74.50G
used_memory_rss:111736643584
used_memory_peak:100028209648
used_memory_peak_human:93.16G

The instance had a peak of 93 GB. Maybe you used CONFIG SET maxmemory to set a lower memory limit? RSS does not go backward, so this inflates the fragmentation figure.

@charsyam
Copy link
Contributor

@benzheren if you just use Redis as a cache.
I just know only one solution. just remove all data periodically.
In my case, redis just used 2G, but RSS is over 12G.
In my experience, It is easily occur when your data size are largely variable and very often adding and deletion are occur.

see this document. http://www.slideshare.net/deview/2b3arcus
page 37, showing general case of Redis Memory Usage and RSS, it is not bad.
but, page 39, Redis just use 2.4G but its rss is over 12G.

@antirez
Copy link
Contributor

antirez commented Nov 17, 2014

Flushing all data (FLUSHALL) will make the RSS small again, but is not practical in most environments IMHO. Fragmentation of the allocator if something to deal with. Moreover IMHO @benzheren may not have a real fragmentation problem. The last reported fragmentation is high because there was a previous peak, the old fragmentation reported is 1.19 which is non perfect but non critical, so in this specific case the best thing is: set a lower limit, restart the server, observe the fragmentation with your workload, setup a new limit, again with some spare memory. Some fragmentation is something we need to deal with, probably the user reporting the issue here has nothing to complain for its use case about the fact that Redis is making a less than perfect use of memory, but has to complain about the fact that it is counter intuitive you have to set a lower limit since it is unknown when you start what the actual memory usage will be. And I agree with him, however there is no silver bullet.

@mattsta
Copy link
Contributor

mattsta commented Nov 19, 2014

We have some INFO reports showing used_memory_rss lower than used_memory — do we know what causes that?

Some examples:

@benzheren
Copy link
Author

@antirez @mattsta follow up with this issue: we've set up a new server with THP disabled (the only system level config we changed compared to our last instance server) and we set the maxmemory to a much lower value according to 1.4 mem_fragmentation_ratio.

In addition to that, in our application level, we change the code to decrease the size of the objects, which gets stored in redis, a lot. Now after a week, the servers is much more stable with mem_fragmentation_ratio at 1.10.

I will follow up with more data.

@benzheren
Copy link
Author

@charsyam FLUSHDB is not a good idea in production environment. Especially if you have lots of data and busy traffic, this could block the redis for noticeable amount of time.

@charsyam
Copy link
Contributor

@benzheren In my system. we can run it with zookeeper, so server change A redis server to B redis. and
flush or kill, and recover it. so I already mentioned. flushing is a way when using redis as cache. :)

@charsyam
Copy link
Contributor

@benzheren and I think using muliple instances in a physical server is more useful for this situation :)

@mac2000
Copy link

mac2000 commented Apr 7, 2017

Not sure if that helps, but I have catched following case:

All servers start reporting OOM error

I have run keys * and got empty list

Looking at code where exception is thrown I do see in comments that error is thrown if redis was unable to free up enough memory, and looking inside freeMemoryIfNeeded seems that it tries to find keys to free up, but if there is no keys it wont ever return success

So the question is: is it possible that memory is taken but there is no keys or that was result of hang

And should freeMemoryIfNeeded deal with such situations especially in systems where redis is used as in memory cache only

@antirez
Copy link
Contributor

antirez commented Apr 11, 2017

Hello, many things changed in Redis internals since this issue was open, but there are currently no known bugs in that code path. However the Redis 4.0 MEMORY command is now able to provide accurate profiling of memory usage so that the user can more easily guess where the problem is.

It is possible that memory is taken without keys, depends on: clients output buffers, AOF buffers, Pub/Sub backlog with slow consumers and so forth. Recent versions of Redis terminate clients when they are using too much memory and can report all this conditions. Closing this issue since as I said there is no known bug, nor there is here a precise hint about something specific not working as expected (with full investigation and outputs of INFO and so forth), so there is nothing I can proceed with.

@antirez antirez closed this as completed Apr 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants