Upgraded Jemalloc causes hang on Debian 8 #3799

rwky · 2017-02-09T12:49:58Z

This is a weird one.

After upgrading to 3.2.7 redis just hangs using up all the CPU on one core, nothing is output to the log file, the rdb file isn't updated and all connections fail.

Since this only started happening with 3.2.7 and the biggest change there was Jemalloc I reverted 27e29f4 and tried it again and it worked fine. So it appears the Jemalloc upgrade is causing the hang.

Unfortunately I've not found an easy way of replicating this except by using production data.

I'm not sure what to do to debug further so I'm raising this issue.

The text was updated successfully, but these errors were encountered:

antirez · 2017-02-09T14:20:12Z

Thanks @rwky, where you coming from 3.2.6?

antirez · 2017-02-09T14:21:22Z

p.s. also please report your glibc version if possible. Thank you.

rwky · 2017-02-09T15:51:23Z

Yep straight upgrade from 3.2.6 which was working solidly.

glibc is ldd (Debian GLIBC 2.19-18+deb8u7) 2.19

davidtgoldblatt · 2017-02-09T17:52:40Z

Two ideas:

Can you grab some stack traces to see where the spinning is happening?
Can you try building jemalloc with --enable-debug?

rwky · 2017-02-09T18:31:36Z

Will do, it'll be a little while before it acts up again it maybe tomorrow before I can grab the details.

rwky · 2017-02-09T22:24:45Z

We got lucky it happened before I went to bed. Attached is the redis log after sending it SIGSEGV hopefully it's useful.
redis.txt

antirez · 2017-02-10T12:08:26Z

@rwky thanks a lot. Based on your new observations, are you still confident that the problem only happens with Jemalloc 4.4.0? It may be safe to release Redis 3.2.8 with the commit reverted at this point... Thanks.

rwky · 2017-02-10T12:24:12Z

Yep since I reverted that commit it works fine so it's something Jemalloc related and it's 100% repeatable I'm just not sure exactly what is triggering it something in our production work load.

antirez · 2017-02-10T12:28:57Z

Thank you, I think it's better to release 3.2.8 ASAP.

antirez · 2017-02-12T12:02:01Z

@davidtgoldblatt in case you did not notice, we have a stack trace thanks to @rwky:

------ STACK TRACE ------
EIP:
/usr/local/bin/redis-server 127.0.0.1:6379(je_spin_adaptive+0x22)[0x4e3552]

Backtrace:
/usr/local/bin/redis-server 127.0.0.1:6379(logStackTrace+0x29)[0x4623a9]
/usr/local/bin/redis-server 127.0.0.1:6379(sigsegvHandler+0xa6)[0x462a46]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf890)[0x7f5ffbb60890]
/usr/local/bin/redis-server 127.0.0.1:6379(je_spin_adaptive+0x22)[0x4e3552]
/usr/local/bin/redis-server 127.0.0.1:6379(je_chunk_alloc_dss+0x1d8)[0x4c16f8]
/usr/local/bin/redis-server 127.0.0.1:6379(je_chunk_alloc_wrapper+0x948)[0x4c0a18]
/usr/local/bin/redis-server 127.0.0.1:6379(je_arena_chunk_ralloc_huge_expand+0x263)[0x4b7bd3]
/usr/local/bin/redis-server 127.0.0.1:6379[0x4d6ff0]
/usr/local/bin/redis-server 127.0.0.1:6379(je_huge_ralloc_no_move+0x314)[0x4d7804]
/usr/local/bin/redis-server 127.0.0.1:6379(je_huge_ralloc+0x5c)[0x4d7b8c]
/usr/local/bin/redis-server 127.0.0.1:6379(je_realloc+0xb2)[0x4ae072]
/usr/local/bin/redis-server 127.0.0.1:6379(zrealloc+0x26)[0x431f76]
/usr/local/bin/redis-server 127.0.0.1:6379(sdsMakeRoomFor+0x2bd)[0x42f91d]
/usr/local/bin/redis-server 127.0.0.1:6379(readQueryFromClient+0xae)[0x43ab0e]
/usr/local/bin/redis-server 127.0.0.1:6379(aeProcessEvents+0x133)[0x425463]
/usr/local/bin/redis-server 127.0.0.1:6379(aeMain+0x2b)[0x4257ab]
/usr/local/bin/redis-server 127.0.0.1:6379(main+0x40b)[0x42285b]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f5ffb7c7b45]
/usr/local/bin/redis-server 127.0.0.1:6379[0x4229fe]

This reverts commit 153f2f0. Jemalloc 4.4.0 is apparently causing deadlocks in certain systems. See for example #3799. As a cautionary step we are reverting the commit back and releasing a new stable Redis version.

davidtgoldblatt · 2017-02-12T23:42:47Z

Thanks, I had missed that. I'll take a look sometime tomorrow.

jasone · 2017-02-13T05:55:26Z

This is likely due to the bug fixed here on the dev branch. jemalloc issue #618 is tracking the backport which will be part of the 4.5.0 release.

rwky · 2017-02-13T09:37:41Z

If @jasone's comment is the offending issue once 4.5.0 is released if someone wants to create a redis branch with 4.5.0 in it I'm happy to test it.

davidtgoldblatt · 2017-02-14T01:33:46Z

I've had some trouble replicating this at the commit before the suspected fix. But given that 4.5 is coming out soon, I'm included not to spend much time on it so long as you don't mind trying it out after the fix. I'll ping this issue once it's released?

<ChangeLog> Upgrade urgency CRITICAL: This release reverts back the Jemalloc upgrade that is believed to potentially cause a server deadlock. A MIGRATE crash is also fixed. Two important bug fixes, the first of one is critical: 1. Apparently Jemalloc 4.4.0 may contain a deadlock under particular conditions. See redis/redis#3799. We reverted back to the previously used Jemalloc versions and plan to upgrade Jemalloc again after having more info about the cause of the bug. 2. MIGRATE could crash the server after a socket error. See for reference: redis/redis#3796. </ChangeLog> git-svn-id: svn+ssh://svn.freebsd.org/ports/head@434063 35697150-7ecd-e111-bb59-0022644237b5

<ChangeLog> Upgrade urgency CRITICAL: This release reverts back the Jemalloc upgrade that is believed to potentially cause a server deadlock. A MIGRATE crash is also fixed. Two important bug fixes, the first of one is critical: 1. Apparently Jemalloc 4.4.0 may contain a deadlock under particular conditions. See redis/redis#3799. We reverted back to the previously used Jemalloc versions and plan to upgrade Jemalloc again after having more info about the cause of the bug. 2. MIGRATE could crash the server after a socket error. See for reference: redis/redis#3796. </ChangeLog>

There're 2 bug fixes in redis-3.2.8: 1. redis/redis#3799 Downgrade jemalloc-4.4.0 to jemalloc 4.0.3 2. redis/redis#3796 Fix a crash in command MIGRATE

rwky · 2017-02-14T11:09:55Z

Sounds good. I had trouble replicating it without throwing real data at it, I don't know what exactly triggers it I just know reverting jemalloc fixed it. Ping me once the fix is released and I'll check it out.

================================================================================ Redis 3.2.8 Released Sun Feb 12 16:11:18 CET 2017 ================================================================================ Two important bug fixes, the first of one is critical: 1. Apparently Jemalloc 4.4.0 may contain a deadlock under particular conditions. See redis/redis#3799. We reverted back to the previously used Jemalloc versions and plan to upgrade Jemalloc again after having more info about the cause of the bug. 2. MIGRATE could crash the server after a socket error. See for reference: redis/redis#3796. ================================================================================ Redis 3.2.7 Released Tue Jan 31 16:21:41 CET 2017 ================================================================================ Main bugs fixes and improvements in this release: 1. MIGRATE could incorrectly move keys between Redis Cluster nodes by turning keys with an expire set into persisting keys. This bug was introduced with the multiple-keys migration recently. It is now fixed. Only applies to Redis Cluster users that use the resharding features of Redis Cluster. 2. As Redis 4.0 beta and the unstable branch already did (for some months at this point), Redis 3.2.7 also aliases the Host: and POST commands to QUIT avoiding to process the remaining pipeline if there are pending commands. This is a security protection against a "Cross Scripting" attack, that usually involves trying to feed Redis with HTTP in order to execute commands. Example: a developer is running a local copy of Redis for development purposes. She also runs a web browser in the same computer. The web browser could send an HTTP request to http://127.0.0.1:6379 in order to access the Redis instance, since a specially crafted HTTP requesta may also be partially valid Redis protocol. However if POST and Host: break the connection, this problem should be avoided. IMPORTANT: It is important to realize that it is not impossible that another way will be found to talk with a localhost Redis using a Cross Protocol attack not involving sending POST or Host: so this is only a layer of protection but not a definitive fix for this class of issues. 3. A ziplist bug that could cause data corruption, could crash the server and MAY ALSO HAVE SECURITY IMPLICATIONS was fixed. The bug looks complex to exploit, but attacks always get worse, never better (cit). The bug is very very hard to catch in practice, it required manual analysis of the ziplist code in order to be found. However it is also possible that rarely it happened in the wild. Upgrading is required if you use LINSERT and other in-the-middle list manipulation commands. 4. We upgraded to Jemalloc 4.4.0 since the version we used to ship with Redis was an early 4.0 release of Jemalloc. This version may have several improvements including the ability to better reclaim/use the memory of system.

Venorcis · 2017-03-01T11:45:39Z

Jemalloc 4.5.0 is out, which probably fixes this
https://github.com/jemalloc/jemalloc/releases/tag/4.5.0

antirez · 2017-03-01T11:57:25Z

Thanks @Venorcis, kindly asking @jasone to confirm the issue is fixed since reading the changelog is not obvious what is in the list, probably locking order one, but want to be sure before upgrading. However I'll let other few weeks pass without any patch release.

jasone · 2017-03-01T17:02:27Z

Yes, I think the issue is fixed (Fix chunk_alloc_dss() regression.). The stack trace above is consistent with test failures we experienced once @davidtgoldblatt implemented CI testing for FreeBSD.

antirez · 2017-03-01T17:10:56Z

Thanks, we will upgrade 4.0 RC asap and the stable release with some delay. If we notice anything strange I'll write a note. Thank you!

…

On Mar 1, 2017 6:03 PM, "Jason Evans" ***@***.***> wrote: Yes, I think the issue is fixed (Fix chunk_alloc_dss() regression. <jemalloc/jemalloc@adae7cf>). The stack trace above <#3799 (comment)> is consistent with test failures we experienced once @davidtgoldblatt <https://github.com/davidtgoldblatt> implemented CI testing for FreeBSD. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3799 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEAYHe85oqjQcAOuzNxqIVr4Cv22PGWks5rhaSmgaJpZM4L8FeJ> .

rwky · 2017-03-01T18:03:09Z

@antirez if you want to create a 3.x branch with Jemalloc 4.5.0 I test if it works before you release it as a stable release.

This reverts commit 153f2f0. Jemalloc 4.4.0 is apparently causing deadlocks in certain systems. See for example redis#3799. As a cautionary step we are reverting the commit back and releasing a new stable Redis version.

siyangy · 2017-04-20T20:17:52Z

@jasone We hit this exactly same problem after upgrading to jemalloc 4.6.0(4.5.0 as well) when reallocating a large chunk of memory.

Stacktrace:
Program received signal SIGINT, Interrupt.
0x00000000006dfa9a in je_chunk_alloc_dss ()
(gdb) where
#0 0x00000000006dfa9a in je_chunk_alloc_dss ()
#1 0x00000000006df170 in je_chunk_alloc_wrapper ()
#2 0x00000000006d273a in je_arena_chunk_ralloc_huge_expand ()
#3 0x00000000006f22e5 in huge_ralloc_no_move_expand ()
#4 0x00000000006f282b in je_huge_ralloc_no_move.part ()
#5 0x00000000006f43e9 in je_huge_ralloc ()
#6 0x00000000006d77c4 in je_arena_ralloc ()
#7 0x00000000006be00f in je_realloc ()
#8 0x00000000006b9bff in zrealloc ()
#9 0x00000000006b9455 in sdsMakeRoomFor ()

jasone · 2017-04-21T17:03:24Z

@siyangy, can you please specify precisely which jemalloc revision(s) you're testing with? There is no 4.6.0 release, so I want to make sure we're talking about a version that has this fix.

Assuming you're testing with at least 4.5.0, it would be really helpful to get a printout of the primary variables that impact the loop logic in chunk_alloc_dss(), namely size, alignment, max_cur, dss_next, and dss_prev.

siyangy · 2017-04-21T20:35:38Z

Sorry we tried 4.3.1, 4.4.0 and 4.5.0 hitting the same thing, and we have verified that the fix is included for 4.5.0.

Here are the variables you asked:
SIZE 2097152 ALIGNMENT 2097152 MAX_CUR 0x4022 DSS_NEXT 0x206 DSS_PREV 0xe39cd348198558bf
SIZE 2097152 ALIGNMENT 2097152 MAX_CUR 0x4022 DSS_NEXT 0x206 DSS_PREV 0xe39a00b5d8d90cbf
SIZE 2097152 ALIGNMENT 2097152 MAX_CUR 0x4022 DSS_NEXT 0x206 DSS_PREV 0xe396eed526b144bf
SIZE 2097152 ALIGNMENT 2097152 MAX_CUR 0x4022 DSS_NEXT 0x206 DSS_PREV 0xe389f58522d98cbf
SIZE 2097152 ALIGNMENT 2097152 MAX_CUR 0x4022 DSS_NEXT 0x206 DSS_PREV 0xe388069b823d53bf

One thing worth noting is that it stops printing instead of being stuck in the while loop (we put the print withing the while(true) loop) after a few iterations while the program hangs. During our previous debugging we found out that it gets calling zrealloc recursively (nested zrealloc within zrealloc).

jasone · 2017-04-21T21:59:26Z

Wow, is that dss_prev value for real, or is there perhaps an argument missing from the printf (or is the printf placed prior to the initialization)?

jasone · 2017-04-21T22:13:39Z

I'm having a hard time seeing how the max_cur and dss_next values could be correct either, assuming the backtrace from yesterday corresponds to when these numbers were collected. The backtrace is for in-place expanding huge reallocation, and that would require at least one chunk to have already been allocated from dss for us to get into the core of chunk_alloc_dss() (max_cur would be NULL, and we'd bail out to label_oom).

siyangy · 2017-04-22T00:02:25Z

okay, so we use this to print

printf("SIZE %lu ALIGNMENT %lu MAX_CUR %p DSS_NEXT %p DSS_PREV %p\n",
                    size, alignment, max_cur, dss_next, dss_prev);

However, it is put in the very beginning of the while(true) loop, where dss_prev is not initialized. We put it there cuz we assumed that it went in infinite loop within this while(true). When we move this line after dss_prev is initialized there's nothing printed out.

siyangy · 2017-04-22T07:42:00Z

@jasone After some more debugging I finally find out where the code gets stuck: there's a spin_adaptive in function chunk_dss_max_update, so that's why we didn't see a valid dss_prev and dss_next updated - it actually never gets out of the spin in chunk_dss_max_update. For some reason our stacktrace doesn't show the spin_adaptive call. The comment before spin_adaptive says 'Another thread optimistically updated dss_max. Wait for it to finish.' Apparently dss_max is not updated as expected.

This reverts commit 153f2f0. Jemalloc 4.4.0 is apparently causing deadlocks in certain systems. See for example redis#3799. As a cautionary step we are reverting the commit back and releasing a new stable Redis version.

antirez · 2017-04-26T07:31:53Z

Just found this in the CI test, happening with Jemalloc 4.0.3 after a server restart. Not sure if it's related but looks like a Jemalloc deadlock at a first glance. It's worth to note that happened immediately after the Redis server was restarted so basically there is very little allocated at this time.

(gdb) bt
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007fd7b109d67f in _L_lock_1081 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007fd7b109d5f8 in __GI___pthread_mutex_lock (mutex=0x7fd7b0a006b0)
    at ../nptl/pthread_mutex_lock.c:134
#3  0x00000000004eb280 in je_malloc_mutex_lock (mutex=0x7fd7b0a006b0)
    at include/jemalloc/internal/mutex.h:85
#4  je_tcache_bin_flush_small (tsd=<optimized out>, tcache=<optimized out>,
    tbin=0x7fd7b080d040, binind=<optimized out>, rem=100) at src/tcache.c:115
#5  0x00000000004bef71 in je_tcache_dalloc_small (binind=<optimized out>,
    ptr=<optimized out>, tcache=<optimized out>, tsd=<optimized out>)
    at include/jemalloc/internal/tcache.h:376
#6  je_arena_dalloc (tcache=<optimized out>, ptr=<optimized out>, tsd=<optimized out>)
    at include/jemalloc/internal/arena.h:1271
#7  je_idalloctm (is_metadata=<optimized out>, tcache=<optimized out>,
    ptr=<optimized out>, tsd=<optimized out>)
    at include/jemalloc/internal/jemalloc_internal.h:1005
#8  je_iqalloc (tcache=<optimized out>, ptr=<optimized out>, tsd=<optimized out>)
    at include/jemalloc/internal/jemalloc_internal.h:1029
#9  ifree (tsd=<optimized out>, tcache=<optimized out>, ptr=<optimized out>)
    at src/jemalloc.c:1745
#10 je_free (ptr=0x7fd7b0227a70) at src/jemalloc.c:1839
#11 0x00000000004320dd in sdsfreesplitres (tokens=tokens@entry=0x7fd7b02c3f00,
    count=<optimized out>) at sds.c:851
#12 0x000000000046df16 in clusterLoadConfig (filename=<optimized out>)
    at cluster.c:269
#13 0x000000000046fb06 in clusterInit () at cluster.c:440
#14 0x000000000042f600 in initServer () at server.c:1911
#15 0x0000000000423473 in main (argc=<optimized out>, argv=0x7ffcd4163438)
    at server.c:3772

bhuvanl · 2017-04-26T09:24:36Z

observed similar stack server struck (on of thread spinning with 100% CPU), while running redid-server version Redis 3.9.102 (00000000/0) 64 bit
(gdb) info stack
#0 0x00007ffff77411dd in __pthread_mutex_lock_full () from /lib64/libpthread.so.0
#1 0x00000000004bfdbd in je_malloc_mutex_lock (arena=, chunk=0x7fff2e400000, ptr=0x7fff2e4aee80) at include/jemalloc/internal/mutex.h:85
#2 je_arena_dalloc_large (arena=, chunk=0x7fff2e400000, ptr=0x7fff2e4aee80) at src/arena.c:2602
#3 0x000000000042e7fa in sdsRemoveFreeSpace (s=0x7fff2e4aee85 "") at sds.c:265
#4 0x00000000004295ca in clientsCronResizeQueryBuffer (c=0x7fffd43c6000) at server.c:823
#5 0x000000000042b19f in clientsCron () at server.c:863
#6 0x000000000042b34e in serverCron (eventLoop=, id=, clientData=) at server.c:1016
#7 0x00000000004234fd in processTimeEvents (eventLoop=0x7ffff0e360a0, flags=3) at ae.c:322
#8 aeProcessEvents (eventLoop=0x7ffff0e360a0, flags=3) at ae.c:423
#9 0x000000000042368b in aeMain (eventLoop=0x7ffff0e360a0) at ae.c:455
#10 0x000000000042be2b in main (argc=, argv=0x7fffffffe438) at server.c:3739

antirez · 2017-04-26T09:33:04Z

@bhuvanl 3.9.102 is using the Jemalloc that is known to have issues, RC3 downgraded to Jemalloc 4.0.3. However strange enough, I got the hang above with 4.0.3 as well in the unstable branch. Never happend in the past AFAIK, not sure if maybe the make distclean was not perfored correctly by the CI or what else to think.

bhuvanl · 2017-04-26T15:31:32Z

@antirez Seems like using RC3 resolved my issue, thanks for quick response.

jasone · 2017-04-26T22:04:16Z

I've been poking at this issue off and on for the past few days, and none of the scenarios I can think of that blame to jemalloc seem plausible:

jemalloc isn't bootstrapped.
Some code outside jemalloc is adjusting the break by calling sbrk() with a negative argument (see Do not assume dss break never decreases jemalloc/jemalloc#802).
Note that as configured by redis, jemalloc does not resort to sbrk() unless mmap() fails. I can't quite put together a sequence of invalid calls that would corrupt jemalloc's data structures such that we'd see such behavior, but it is certainly worth verifying that these are valid attempts to use dss. We can potentially do this two different ways with relatively little effort:
- Configure jemalloc with --enable-debug, in which case assertions will almost certainly detect misuses that could lead to the failure mode.
- Configure jemalloc with --with-malloc-conf=dss:disabled, which completely shuts down use of sbrk(). This may just mask the issue, but it could also cause an easier-to-diagnose alternative failure mode.

If I had a live debugger session for the failure, it would probably be pretty quick work to figure out what's going on, but I'm out of ideas based on the incomplete evidence gathered so far.

Regarding the two stack traces recently posted with je_malloc_mutex_lock() in them, those failures look consistent with application-induced corruption, but they may well be completely unrelated to the older je_chunk_alloc_dss() issue.

antirez · 2017-04-27T06:57:47Z

Hello @jasone, thank you very much for the help with this issue, I realize that it is possible that hangs are potentially Redis bugs and not Jemalloc bugs, so your willingness to analyze the issue nonetheless is very welcomed here.

I want to make sure I understand what you said here in order to start looking myself for bugs, so here are a few questions I hope you could answer:

I understand that the original issue here was fixed in Jemalloc 4.5.0, is that correct? The OP @rwky apparently was able to toggle the problem on/off just reverting or re-enabling the commit with the Jemalloc upgrade, and your initial comment says that the deadlock looks consistent with the bug fixed.
However, the successive hang reports (but the last two), you said it should not be due to Jemalloc but they are bugs inside Redis and/or something external calling sbrk(). This sbrk() call could just be that Redis is linked with libraries that are using malloc() from libc, so that there is a mix, in the same process, of calls to libc malloc and jemalloc?
The other bugs that look like corruptions in Redis, at least the one I reported, is plausible, I'll investigate but I remember a recent fix there, maybe we are double-freeing or alike.
The --enable-debug option has serious speed penalty or is something I could enable in Redis normally?

Thank you, btw if my CI triggers the same bug again, I'll not stop the process and provide you with SSH access in case you want to perform a live debugging session.

jasone · 2017-04-27T18:40:34Z

(1) As far as I can tell, the original issue was fixed in jemalloc 4.5.0, and although it's possible additional issues remain, I cannot come up with any plausible explanations for how jemalloc could be messing up.
(2) If some other code besides jemalloc is concurrently calling sbrk(), it could in theory open up a race condition, but I don't think that's possible in (always single-threaded?) redis. If the system allocator is somehow being called in some cases, that would certainly cause serious problems.
(4) --enable-debug isn't ideal for performance-sensitive production deployment, but as long as you specify an optimization flag to the compiler, it's fast enough for most development uses. This is similar to how development versions of FreeBSD are built, and Facebook uses --enable-debug for most of its non-ASAN development/test builds.

antirez · 2017-04-27T20:38:19Z

Thanks @jasone, this is starting to get interesting:

Ok thanks. For Redis 4.0, would you kindly suggest a version that is currently the safest? I went back to 4.0.3, but perhaps given that old code is safe in certain regards but also has its issues for lack of updates, maybe 4.6.0 is a better pick?
Redis was never totally single threaded but now it is even less so... We definitely have malloc/free inside threads with Jemalloc. And... in the main thread, we have code that calls the system allocator (that is, the Lua interpreter, and potentially other stuff). So perhaps we should move Lua to use Jemalloc to avoid this problem?
Ok, looks like that at least in the unstable branch it makes sense for us to go for the debug mode.

Thank you

antirez · 2017-04-27T20:53:21Z

To clarify about threads, this is what we do in other threads:

Reclaiming of objects in backgroud (when the UNLINK command is used), so imagine a busy loop calling jemalloc free().
Slow system calls. However threads use a message-passing strategy to communicate instead of using mutexes, so the workers implementing the background syscalls will free() the messages after receiving them.

In the future we are going to use threading a lot more, I've an experimental patch implementing threaded I/O and hopefully this will get real soon or later. Also Redis Modules use threading already and will use it more as well. So the threading story is getting more complex, but that's another story just to clarify what we have / what we'll get.

jasone · 2017-04-28T16:43:16Z

Suggested current jemalloc version: 4.5.0 (would suggest stable-4 if Windows were a target).
Re: allocator to use in Lua, I'd definitely suggest using the same allocator as for the rest of redis, in order to eliminate the possibility of erroneous mixed allocator use. I don't know anything about how redis utilizes Lua; the risks may be very high/low depending on details.

antirez · 2017-04-28T16:45:02Z

Understood, thank you @jasone.

xhochy · 2017-08-04T16:54:54Z

Note that we also see the problem reported in #3799 (comment) in Apache Arrow using jemalloc 4.5.0: https://issues.apache.org/jira/browse/ARROW-1282

Building with --with-malloc-conf=dss:disabled avoids the hanging issue.

I started to build up a environment to reproduce this but probably won't be able to continue the work for two weeks. Any suggestion for what to look out would be very helpful.

This reverts commit 153f2f0. Jemalloc 4.4.0 is apparently causing deadlocks in certain systems. See for example redis#3799. As a cautionary step we are reverting the commit back and releasing a new stable Redis version.

filipecosta90 · 2021-07-22T14:20:31Z

@xhochy following up on arrow https://issues.apache.org/jira/browse/ARROW-1282 I can see that you provided a fix for jemalloc > 4.x as in jemalloc/jemalloc#1005 correct?
@redis/core-team proposing to close it if the above is correct.

oranagra · 2021-07-22T15:16:15Z

I think we can close this right away.
IIUC, the issue started when we upgraded to a new jemalloc (which made some assumptions on sbrk that were later improved?), but since then we upgraded again several times (AFAIK the current version we're using doesn't use sbrk at all anymore).

The problem is probably caused by the fact our Lua lib uses libc allocator while the rest of redis uses jemalloc, and that's still true, but hopefully it doesn't cause any serious issues anymore. (we can change that if we want, but there are some disadvantages, like causing false sense of fragmentation that the defragger won't be able to fix)

It could be that some distros are building redis to use an external allocator (not the one embedded into redis), and who knows which jemalloc version is being used, so it could still be 4.4.0. but also, even if we change something in the next redis version, these changes will not make a difference since these users may also be still using redis 3.2.

<ChangeLog> Upgrade urgency CRITICAL: This release reverts back the Jemalloc upgrade that is believed to potentially cause a server deadlock. A MIGRATE crash is also fixed. Two important bug fixes, the first of one is critical: 1. Apparently Jemalloc 4.4.0 may contain a deadlock under particular conditions. See redis/redis#3799. We reverted back to the previously used Jemalloc versions and plan to upgrade Jemalloc again after having more info about the cause of the bug. 2. MIGRATE could crash the server after a socket error. See for reference: redis/redis#3796. </ChangeLog>

antirez mentioned this issue Feb 9, 2017

deadlock in je_prof_boot2 jemalloc/jemalloc#585

Open

bwzhang2011 mentioned this issue Feb 13, 2017

4.5.0 release jemalloc/jemalloc#545

Closed

spinlock added a commit to CodisLabs/codis that referenced this issue Feb 14, 2017

extern: upgrade redis-3.2.7 to redis-3.2.8

7c5cebb

There're 2 bug fixes in redis-3.2.8: 1. redis/redis#3799 Downgrade jemalloc-4.4.0 to jemalloc 4.0.3 2. redis/redis#3796 Fix a crash in command MIGRATE

gnusi mentioned this issue Feb 14, 2018

use c++14 arangodb/arangodb#4581

Closed

filipecosta90 added the state:to-be-closed requesting the core team to close the issue label Jul 22, 2021

oranagra closed this as completed Jul 22, 2021

GammaPi mentioned this issue Apr 21, 2022

Search for Redis Bugs UTSASRG/Scaler#49

Closed

asfimport mentioned this issue Nov 7, 2017

Large memory reallocation by Arrow causes hang in jemalloc apache/arrow#17220

Closed

Upgraded Jemalloc causes hang on Debian 8 #3799

Upgraded Jemalloc causes hang on Debian 8 #3799

Comments

rwky commented Feb 9, 2017

antirez commented Feb 9, 2017

antirez commented Feb 9, 2017

rwky commented Feb 9, 2017

davidtgoldblatt commented Feb 9, 2017

rwky commented Feb 9, 2017

rwky commented Feb 9, 2017

antirez commented Feb 10, 2017

rwky commented Feb 10, 2017

antirez commented Feb 10, 2017

antirez commented Feb 12, 2017 • edited

davidtgoldblatt commented Feb 12, 2017

jasone commented Feb 13, 2017

rwky commented Feb 13, 2017

davidtgoldblatt commented Feb 14, 2017

rwky commented Feb 14, 2017

Venorcis commented Mar 1, 2017

antirez commented Mar 1, 2017

jasone commented Mar 1, 2017

antirez commented Mar 1, 2017 via email

rwky commented Mar 1, 2017

siyangy commented Apr 20, 2017

jasone commented Apr 21, 2017

siyangy commented Apr 21, 2017

jasone commented Apr 21, 2017

jasone commented Apr 21, 2017

siyangy commented Apr 22, 2017 • edited

siyangy commented Apr 22, 2017

antirez commented Apr 26, 2017

bhuvanl commented Apr 26, 2017

antirez commented Apr 26, 2017

bhuvanl commented Apr 26, 2017

jasone commented Apr 26, 2017 • edited

antirez commented Apr 27, 2017

jasone commented Apr 27, 2017

antirez commented Apr 27, 2017

antirez commented Apr 27, 2017

jasone commented Apr 28, 2017

antirez commented Apr 28, 2017

xhochy commented Aug 4, 2017

filipecosta90 commented Jul 22, 2021

oranagra commented Jul 22, 2021

antirez commented Feb 12, 2017 •

edited

siyangy commented Apr 22, 2017 •

edited

jasone commented Apr 26, 2017 •

edited