New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgraded Jemalloc causes hang on Debian 8 #3799

Open
rwky opened this Issue Feb 9, 2017 · 39 comments

Comments

Projects
None yet
8 participants
@rwky

rwky commented Feb 9, 2017

This is a weird one.

After upgrading to 3.2.7 redis just hangs using up all the CPU on one core, nothing is output to the log file, the rdb file isn't updated and all connections fail.

Since this only started happening with 3.2.7 and the biggest change there was Jemalloc I reverted 27e29f4 and tried it again and it worked fine. So it appears the Jemalloc upgrade is causing the hang.

Unfortunately I've not found an easy way of replicating this except by using production data.

I'm not sure what to do to debug further so I'm raising this issue.

@antirez

This comment has been minimized.

Owner

antirez commented Feb 9, 2017

Thanks @rwky, where you coming from 3.2.6?

@antirez

This comment has been minimized.

Owner

antirez commented Feb 9, 2017

p.s. also please report your glibc version if possible. Thank you.

@rwky

This comment has been minimized.

rwky commented Feb 9, 2017

Yep straight upgrade from 3.2.6 which was working solidly.

glibc is ldd (Debian GLIBC 2.19-18+deb8u7) 2.19

@davidtgoldblatt

This comment has been minimized.

davidtgoldblatt commented Feb 9, 2017

Two ideas:

  • Can you grab some stack traces to see where the spinning is happening?
  • Can you try building jemalloc with --enable-debug?
@rwky

This comment has been minimized.

rwky commented Feb 9, 2017

Will do, it'll be a little while before it acts up again it maybe tomorrow before I can grab the details.

@rwky

This comment has been minimized.

rwky commented Feb 9, 2017

We got lucky it happened before I went to bed. Attached is the redis log after sending it SIGSEGV hopefully it's useful.
redis.txt

@antirez

This comment has been minimized.

Owner

antirez commented Feb 10, 2017

@rwky thanks a lot. Based on your new observations, are you still confident that the problem only happens with Jemalloc 4.4.0? It may be safe to release Redis 3.2.8 with the commit reverted at this point... Thanks.

@rwky

This comment has been minimized.

rwky commented Feb 10, 2017

Yep since I reverted that commit it works fine so it's something Jemalloc related and it's 100% repeatable I'm just not sure exactly what is triggering it something in our production work load.

@antirez

This comment has been minimized.

Owner

antirez commented Feb 10, 2017

Thank you, I think it's better to release 3.2.8 ASAP.

@antirez

This comment has been minimized.

Owner

antirez commented Feb 12, 2017

@davidtgoldblatt in case you did not notice, we have a stack trace thanks to @rwky:

------ STACK TRACE ------
EIP:
/usr/local/bin/redis-server 127.0.0.1:6379(je_spin_adaptive+0x22)[0x4e3552]

Backtrace:
/usr/local/bin/redis-server 127.0.0.1:6379(logStackTrace+0x29)[0x4623a9]
/usr/local/bin/redis-server 127.0.0.1:6379(sigsegvHandler+0xa6)[0x462a46]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf890)[0x7f5ffbb60890]
/usr/local/bin/redis-server 127.0.0.1:6379(je_spin_adaptive+0x22)[0x4e3552]
/usr/local/bin/redis-server 127.0.0.1:6379(je_chunk_alloc_dss+0x1d8)[0x4c16f8]
/usr/local/bin/redis-server 127.0.0.1:6379(je_chunk_alloc_wrapper+0x948)[0x4c0a18]
/usr/local/bin/redis-server 127.0.0.1:6379(je_arena_chunk_ralloc_huge_expand+0x263)[0x4b7bd3]
/usr/local/bin/redis-server 127.0.0.1:6379[0x4d6ff0]
/usr/local/bin/redis-server 127.0.0.1:6379(je_huge_ralloc_no_move+0x314)[0x4d7804]
/usr/local/bin/redis-server 127.0.0.1:6379(je_huge_ralloc+0x5c)[0x4d7b8c]
/usr/local/bin/redis-server 127.0.0.1:6379(je_realloc+0xb2)[0x4ae072]
/usr/local/bin/redis-server 127.0.0.1:6379(zrealloc+0x26)[0x431f76]
/usr/local/bin/redis-server 127.0.0.1:6379(sdsMakeRoomFor+0x2bd)[0x42f91d]
/usr/local/bin/redis-server 127.0.0.1:6379(readQueryFromClient+0xae)[0x43ab0e]
/usr/local/bin/redis-server 127.0.0.1:6379(aeProcessEvents+0x133)[0x425463]
/usr/local/bin/redis-server 127.0.0.1:6379(aeMain+0x2b)[0x4257ab]
/usr/local/bin/redis-server 127.0.0.1:6379(main+0x40b)[0x42285b]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f5ffb7c7b45]
/usr/local/bin/redis-server 127.0.0.1:6379[0x4229fe]

antirez added a commit that referenced this issue Feb 12, 2017

Revert "Jemalloc updated to 4.4.0."
This reverts commit 153f2f0.

Jemalloc 4.4.0 is apparently causing deadlocks in certain
systems. See for example #3799.
As a cautionary step we are reverting the commit back and
releasing a new stable Redis version.
@davidtgoldblatt

This comment has been minimized.

davidtgoldblatt commented Feb 12, 2017

Thanks, I had missed that. I'll take a look sometime tomorrow.

@jasone

This comment has been minimized.

jasone commented Feb 13, 2017

This is likely due to the bug fixed here on the dev branch. jemalloc issue #618 is tracking the backport which will be part of the 4.5.0 release.

@rwky

This comment has been minimized.

rwky commented Feb 13, 2017

If @jasone's comment is the offending issue once 4.5.0 is released if someone wants to create a redis branch with 4.5.0 in it I'm happy to test it.

@davidtgoldblatt

This comment has been minimized.

davidtgoldblatt commented Feb 14, 2017

I've had some trouble replicating this at the commit before the suspected fix. But given that 4.5 is coming out soon, I'm included not to spend much time on it so long as you don't mind trying it out after the fix. I'll ping this issue once it's released?

uqs pushed a commit to freebsd/freebsd-ports that referenced this issue Feb 14, 2017

osa
Upgrade from 3.2.7 to 3.2.8.
<ChangeLog>

Upgrade urgency CRITICAL: This release reverts back the Jemalloc upgrade
                          that is believed to potentially cause a server
                          deadlock. A MIGRATE crash is also fixed.

Two important bug fixes, the first of one is critical:

1. Apparently Jemalloc 4.4.0 may contain a deadlock under particular
   conditions. See antirez/redis#3799.
   We reverted back to the previously used Jemalloc versions and plan
   to upgrade Jemalloc again after having more info about the
   cause of the bug.

2. MIGRATE could crash the server after a socket error. See for reference:
   antirez/redis#3796.

</ChangeLog>


git-svn-id: svn+ssh://svn.freebsd.org/ports/head@434063 35697150-7ecd-e111-bb59-0022644237b5

uqs pushed a commit to freebsd/freebsd-ports that referenced this issue Feb 14, 2017

Upgrade from 3.2.7 to 3.2.8.
<ChangeLog>

Upgrade urgency CRITICAL: This release reverts back the Jemalloc upgrade
                          that is believed to potentially cause a server
                          deadlock. A MIGRATE crash is also fixed.

Two important bug fixes, the first of one is critical:

1. Apparently Jemalloc 4.4.0 may contain a deadlock under particular
   conditions. See antirez/redis#3799.
   We reverted back to the previously used Jemalloc versions and plan
   to upgrade Jemalloc again after having more info about the
   cause of the bug.

2. MIGRATE could crash the server after a socket error. See for reference:
   antirez/redis#3796.

</ChangeLog>

spinlock pushed a commit to CodisLabs/codis that referenced this issue Feb 14, 2017

spinlock
extern: upgrade redis-3.2.7 to redis-3.2.8
    There're 2 bug fixes in redis-3.2.8:
    1. antirez/redis#3799
        Downgrade jemalloc-4.4.0 to jemalloc 4.0.3
    2. antirez/redis#3796
        Fix a crash in command MIGRATE
@rwky

This comment has been minimized.

rwky commented Feb 14, 2017

Sounds good. I had trouble replicating it without throwing real data at it, I don't know what exactly triggers it I just know reverting jemalloc fixed it. Ping me once the fix is released and I'll check it out.

mat813 pushed a commit to mat813/freebsd-ports that referenced this issue Feb 14, 2017

osa
Upgrade from 3.2.7 to 3.2.8.
<ChangeLog>

Upgrade urgency CRITICAL: This release reverts back the Jemalloc upgrade
                          that is believed to potentially cause a server
                          deadlock. A MIGRATE crash is also fixed.

Two important bug fixes, the first of one is critical:

1. Apparently Jemalloc 4.4.0 may contain a deadlock under particular
   conditions. See antirez/redis#3799.
   We reverted back to the previously used Jemalloc versions and plan
   to upgrade Jemalloc again after having more info about the
   cause of the bug.

2. MIGRATE could crash the server after a socket error. See for reference:
   antirez/redis#3796.

</ChangeLog>


git-svn-id: https://svn.freebsd.org/ports/head@434063 35697150-7ecd-e111-bb59-0022644237b5

jsonn pushed a commit to jsonn/pkgsrc that referenced this issue Feb 14, 2017

fhajny
Update databases/redis to 3.2.8.
================================================================================
Redis 3.2.8     Released Sun Feb 12 16:11:18 CET 2017
================================================================================

Two important bug fixes, the first of one is critical:

1. Apparently Jemalloc 4.4.0 may contain a deadlock under particular
   conditions. See antirez/redis#3799.
   We reverted back to the previously used Jemalloc versions and plan
   to upgrade Jemalloc again after having more info about the
   cause of the bug.

2. MIGRATE could crash the server after a socket error. See for reference:
   antirez/redis#3796.

================================================================================
Redis 3.2.7     Released Tue Jan 31 16:21:41 CET 2017
================================================================================

Main bugs fixes and improvements in this release:

1. MIGRATE could incorrectly move keys between Redis Cluster nodes by turning
   keys with an expire set into persisting keys. This bug was introduced with
   the multiple-keys migration recently. It is now fixed. Only applies to
   Redis Cluster users that use the resharding features of Redis Cluster.

2. As Redis 4.0 beta and the unstable branch already did (for some months at
   this point), Redis 3.2.7 also aliases the Host: and POST commands to QUIT
   avoiding to process the remaining pipeline if there are pending commands.
   This is a security protection against a "Cross Scripting" attack, that
   usually involves trying to feed Redis with HTTP in order to execute commands.
   Example: a developer is running a local copy of Redis for development
   purposes. She also runs a web browser in the same computer. The web browser
   could send an HTTP request to http://127.0.0.1:6379 in order to access the
   Redis instance, since a specially crafted HTTP requesta may also be partially
   valid Redis protocol. However if POST and Host: break the connection, this
   problem should be avoided. IMPORTANT: It is important to realize that it
   is not impossible that another way will be found to talk with a localhost
   Redis using a Cross Protocol attack not involving sending POST or Host: so
   this is only a layer of protection but not a definitive fix for this class
   of issues.

3. A ziplist bug that could cause data corruption, could crash the server and
   MAY ALSO HAVE SECURITY IMPLICATIONS was fixed. The bug looks complex to
   exploit, but attacks always get worse, never better (cit). The bug is very
   very hard to catch in practice, it required manual analysis of the ziplist
   code in order to be found. However it is also possible that rarely it
   happened in the wild. Upgrading is required if you use LINSERT and other
   in-the-middle list manipulation commands.

4. We upgraded to Jemalloc 4.4.0 since the version we used to ship with Redis
   was an early 4.0 release of Jemalloc. This version may have several
   improvements including the ability to better reclaim/use the memory of
   system.
@Venorcis

This comment has been minimized.

Venorcis commented Mar 1, 2017

Jemalloc 4.5.0 is out, which probably fixes this
https://github.com/jemalloc/jemalloc/releases/tag/4.5.0

@antirez

This comment has been minimized.

Owner

antirez commented Mar 1, 2017

Thanks @Venorcis, kindly asking @jasone to confirm the issue is fixed since reading the changelog is not obvious what is in the list, probably locking order one, but want to be sure before upgrading. However I'll let other few weeks pass without any patch release.

@jasone

This comment has been minimized.

jasone commented Mar 1, 2017

Yes, I think the issue is fixed (Fix chunk_alloc_dss() regression.). The stack trace above is consistent with test failures we experienced once @davidtgoldblatt implemented CI testing for FreeBSD.

@antirez

This comment has been minimized.

Owner

antirez commented Mar 1, 2017

@rwky

This comment has been minimized.

rwky commented Mar 1, 2017

@antirez if you want to create a 3.x branch with Jemalloc 4.5.0 I test if it works before you release it as a stable release.

joeylichang added a commit to ksarch-saas/redis that referenced this issue Mar 28, 2017

Revert "Jemalloc updated to 4.4.0."
This reverts commit 153f2f0.

Jemalloc 4.4.0 is apparently causing deadlocks in certain
systems. See for example antirez#3799.
As a cautionary step we are reverting the commit back and
releasing a new stable Redis version.
@siyangy

This comment has been minimized.

siyangy commented Apr 20, 2017

@jasone We hit this exactly same problem after upgrading to jemalloc 4.6.0(4.5.0 as well) when reallocating a large chunk of memory.

Stacktrace:
Program received signal SIGINT, Interrupt.
0x00000000006dfa9a in je_chunk_alloc_dss ()
(gdb) where
#0 0x00000000006dfa9a in je_chunk_alloc_dss ()
#1 0x00000000006df170 in je_chunk_alloc_wrapper ()
#2 0x00000000006d273a in je_arena_chunk_ralloc_huge_expand ()
#3 0x00000000006f22e5 in huge_ralloc_no_move_expand ()
#4 0x00000000006f282b in je_huge_ralloc_no_move.part ()
#5 0x00000000006f43e9 in je_huge_ralloc ()
#6 0x00000000006d77c4 in je_arena_ralloc ()
#7 0x00000000006be00f in je_realloc ()
#8 0x00000000006b9bff in zrealloc ()
#9 0x00000000006b9455 in sdsMakeRoomFor ()

@jasone

This comment has been minimized.

jasone commented Apr 21, 2017

@siyangy, can you please specify precisely which jemalloc revision(s) you're testing with? There is no 4.6.0 release, so I want to make sure we're talking about a version that has this fix.

Assuming you're testing with at least 4.5.0, it would be really helpful to get a printout of the primary variables that impact the loop logic in chunk_alloc_dss(), namely size, alignment, max_cur, dss_next, and dss_prev.

@siyangy

This comment has been minimized.

siyangy commented Apr 21, 2017

Sorry we tried 4.3.1, 4.4.0 and 4.5.0 hitting the same thing, and we have verified that the fix is included for 4.5.0.

Here are the variables you asked:
SIZE 2097152 ALIGNMENT 2097152 MAX_CUR 0x4022 DSS_NEXT 0x206 DSS_PREV 0xe39cd348198558bf
SIZE 2097152 ALIGNMENT 2097152 MAX_CUR 0x4022 DSS_NEXT 0x206 DSS_PREV 0xe39a00b5d8d90cbf
SIZE 2097152 ALIGNMENT 2097152 MAX_CUR 0x4022 DSS_NEXT 0x206 DSS_PREV 0xe396eed526b144bf
SIZE 2097152 ALIGNMENT 2097152 MAX_CUR 0x4022 DSS_NEXT 0x206 DSS_PREV 0xe389f58522d98cbf
SIZE 2097152 ALIGNMENT 2097152 MAX_CUR 0x4022 DSS_NEXT 0x206 DSS_PREV 0xe388069b823d53bf

One thing worth noting is that it stops printing instead of being stuck in the while loop (we put the print withing the while(true) loop) after a few iterations while the program hangs. During our previous debugging we found out that it gets calling zrealloc recursively (nested zrealloc within zrealloc).

@jasone

This comment has been minimized.

jasone commented Apr 21, 2017

Wow, is that dss_prev value for real, or is there perhaps an argument missing from the printf (or is the printf placed prior to the initialization)?

@jasone

This comment has been minimized.

jasone commented Apr 21, 2017

I'm having a hard time seeing how the max_cur and dss_next values could be correct either, assuming the backtrace from yesterday corresponds to when these numbers were collected. The backtrace is for in-place expanding huge reallocation, and that would require at least one chunk to have already been allocated from dss for us to get into the core of chunk_alloc_dss() (max_cur would be NULL, and we'd bail out to label_oom).

@siyangy

This comment has been minimized.

siyangy commented Apr 22, 2017

okay, so we use this to print

printf("SIZE %lu ALIGNMENT %lu MAX_CUR %p DSS_NEXT %p DSS_PREV %p\n",
                    size, alignment, max_cur, dss_next, dss_prev);

However, it is put in the very beginning of the while(true) loop, where dss_prev is not initialized. We put it there cuz we assumed that it went in infinite loop within this while(true). When we move this line after dss_prev is initialized there's nothing printed out.

@siyangy

This comment has been minimized.

siyangy commented Apr 22, 2017

@jasone After some more debugging I finally find out where the code gets stuck: there's a spin_adaptive in function chunk_dss_max_update, so that's why we didn't see a valid dss_prev and dss_next updated - it actually never gets out of the spin in chunk_dss_max_update. For some reason our stacktrace doesn't show the spin_adaptive call. The comment before spin_adaptive says 'Another thread optimistically updated dss_max. Wait for it to finish.' Apparently dss_max is not updated as expected.

joeylichang added a commit to ksarch-saas/redis that referenced this issue Apr 25, 2017

Revert "Jemalloc updated to 4.4.0."
This reverts commit 153f2f0.

Jemalloc 4.4.0 is apparently causing deadlocks in certain
systems. See for example antirez#3799.
As a cautionary step we are reverting the commit back and
releasing a new stable Redis version.
@antirez

This comment has been minimized.

Owner

antirez commented Apr 26, 2017

Just found this in the CI test, happening with Jemalloc 4.0.3 after a server restart. Not sure if it's related but looks like a Jemalloc deadlock at a first glance. It's worth to note that happened immediately after the Redis server was restarted so basically there is very little allocated at this time.

(gdb) bt
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007fd7b109d67f in _L_lock_1081 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007fd7b109d5f8 in __GI___pthread_mutex_lock (mutex=0x7fd7b0a006b0)
    at ../nptl/pthread_mutex_lock.c:134
#3  0x00000000004eb280 in je_malloc_mutex_lock (mutex=0x7fd7b0a006b0)
    at include/jemalloc/internal/mutex.h:85
#4  je_tcache_bin_flush_small (tsd=<optimized out>, tcache=<optimized out>,
    tbin=0x7fd7b080d040, binind=<optimized out>, rem=100) at src/tcache.c:115
#5  0x00000000004bef71 in je_tcache_dalloc_small (binind=<optimized out>,
    ptr=<optimized out>, tcache=<optimized out>, tsd=<optimized out>)
    at include/jemalloc/internal/tcache.h:376
#6  je_arena_dalloc (tcache=<optimized out>, ptr=<optimized out>, tsd=<optimized out>)
    at include/jemalloc/internal/arena.h:1271
#7  je_idalloctm (is_metadata=<optimized out>, tcache=<optimized out>,
    ptr=<optimized out>, tsd=<optimized out>)
    at include/jemalloc/internal/jemalloc_internal.h:1005
#8  je_iqalloc (tcache=<optimized out>, ptr=<optimized out>, tsd=<optimized out>)
    at include/jemalloc/internal/jemalloc_internal.h:1029
#9  ifree (tsd=<optimized out>, tcache=<optimized out>, ptr=<optimized out>)
    at src/jemalloc.c:1745
#10 je_free (ptr=0x7fd7b0227a70) at src/jemalloc.c:1839
#11 0x00000000004320dd in sdsfreesplitres (tokens=tokens@entry=0x7fd7b02c3f00,
    count=<optimized out>) at sds.c:851
#12 0x000000000046df16 in clusterLoadConfig (filename=<optimized out>)
    at cluster.c:269
#13 0x000000000046fb06 in clusterInit () at cluster.c:440
#14 0x000000000042f600 in initServer () at server.c:1911
#15 0x0000000000423473 in main (argc=<optimized out>, argv=0x7ffcd4163438)
    at server.c:3772
@bhuvanl

This comment has been minimized.

bhuvanl commented Apr 26, 2017

observed similar stack server struck (on of thread spinning with 100% CPU), while running redid-server version Redis 3.9.102 (00000000/0) 64 bit
(gdb) info stack
#0 0x00007ffff77411dd in __pthread_mutex_lock_full () from /lib64/libpthread.so.0
#1 0x00000000004bfdbd in je_malloc_mutex_lock (arena=, chunk=0x7fff2e400000, ptr=0x7fff2e4aee80) at include/jemalloc/internal/mutex.h:85
#2 je_arena_dalloc_large (arena=, chunk=0x7fff2e400000, ptr=0x7fff2e4aee80) at src/arena.c:2602
#3 0x000000000042e7fa in sdsRemoveFreeSpace (s=0x7fff2e4aee85 "") at sds.c:265
#4 0x00000000004295ca in clientsCronResizeQueryBuffer (c=0x7fffd43c6000) at server.c:823
#5 0x000000000042b19f in clientsCron () at server.c:863
#6 0x000000000042b34e in serverCron (eventLoop=, id=, clientData=) at server.c:1016
#7 0x00000000004234fd in processTimeEvents (eventLoop=0x7ffff0e360a0, flags=3) at ae.c:322
#8 aeProcessEvents (eventLoop=0x7ffff0e360a0, flags=3) at ae.c:423
#9 0x000000000042368b in aeMain (eventLoop=0x7ffff0e360a0) at ae.c:455
#10 0x000000000042be2b in main (argc=, argv=0x7fffffffe438) at server.c:3739

@antirez

This comment has been minimized.

Owner

antirez commented Apr 26, 2017

@bhuvanl 3.9.102 is using the Jemalloc that is known to have issues, RC3 downgraded to Jemalloc 4.0.3. However strange enough, I got the hang above with 4.0.3 as well in the unstable branch. Never happend in the past AFAIK, not sure if maybe the make distclean was not perfored correctly by the CI or what else to think.

@bhuvanl

This comment has been minimized.

bhuvanl commented Apr 26, 2017

@antirez Seems like using RC3 resolved my issue, thanks for quick response.

@jasone

This comment has been minimized.

jasone commented Apr 26, 2017

I've been poking at this issue off and on for the past few days, and none of the scenarios I can think of that blame to jemalloc seem plausible:

  • jemalloc isn't bootstrapped.
  • Some code outside jemalloc is adjusting the break by calling sbrk() with a negative argument (see jemalloc/jemalloc#802).
    Note that as configured by redis, jemalloc does not resort to sbrk() unless mmap() fails. I can't quite put together a sequence of invalid calls that would corrupt jemalloc's data structures such that we'd see such behavior, but it is certainly worth verifying that these are valid attempts to use dss. We can potentially do this two different ways with relatively little effort:
    • Configure jemalloc with --enable-debug, in which case assertions will almost certainly detect misuses that could lead to the failure mode.
    • Configure jemalloc with --with-malloc-conf=dss:disabled, which completely shuts down use of sbrk(). This may just mask the issue, but it could also cause an easier-to-diagnose alternative failure mode.

If I had a live debugger session for the failure, it would probably be pretty quick work to figure out what's going on, but I'm out of ideas based on the incomplete evidence gathered so far.

Regarding the two stack traces recently posted with je_malloc_mutex_lock() in them, those failures look consistent with application-induced corruption, but they may well be completely unrelated to the older je_chunk_alloc_dss() issue.

@antirez

This comment has been minimized.

Owner

antirez commented Apr 27, 2017

Hello @jasone, thank you very much for the help with this issue, I realize that it is possible that hangs are potentially Redis bugs and not Jemalloc bugs, so your willingness to analyze the issue nonetheless is very welcomed here.

I want to make sure I understand what you said here in order to start looking myself for bugs, so here are a few questions I hope you could answer:

  1. I understand that the original issue here was fixed in Jemalloc 4.5.0, is that correct? The OP @rwky apparently was able to toggle the problem on/off just reverting or re-enabling the commit with the Jemalloc upgrade, and your initial comment says that the deadlock looks consistent with the bug fixed.
  2. However, the successive hang reports (but the last two), you said it should not be due to Jemalloc but they are bugs inside Redis and/or something external calling sbrk(). This sbrk() call could just be that Redis is linked with libraries that are using malloc() from libc, so that there is a mix, in the same process, of calls to libc malloc and jemalloc?
  3. The other bugs that look like corruptions in Redis, at least the one I reported, is plausible, I'll investigate but I remember a recent fix there, maybe we are double-freeing or alike.
  4. The --enable-debug option has serious speed penalty or is something I could enable in Redis normally?

Thank you, btw if my CI triggers the same bug again, I'll not stop the process and provide you with SSH access in case you want to perform a live debugging session.

@jasone

This comment has been minimized.

jasone commented Apr 27, 2017

  • (1) As far as I can tell, the original issue was fixed in jemalloc 4.5.0, and although it's possible additional issues remain, I cannot come up with any plausible explanations for how jemalloc could be messing up.
  • (2) If some other code besides jemalloc is concurrently calling sbrk(), it could in theory open up a race condition, but I don't think that's possible in (always single-threaded?) redis. If the system allocator is somehow being called in some cases, that would certainly cause serious problems.
  • (4) --enable-debug isn't ideal for performance-sensitive production deployment, but as long as you specify an optimization flag to the compiler, it's fast enough for most development uses. This is similar to how development versions of FreeBSD are built, and Facebook uses --enable-debug for most of its non-ASAN development/test builds.
@antirez

This comment has been minimized.

Owner

antirez commented Apr 27, 2017

Thanks @jasone, this is starting to get interesting:

  1. Ok thanks. For Redis 4.0, would you kindly suggest a version that is currently the safest? I went back to 4.0.3, but perhaps given that old code is safe in certain regards but also has its issues for lack of updates, maybe 4.6.0 is a better pick?
  2. Redis was never totally single threaded but now it is even less so... We definitely have malloc/free inside threads with Jemalloc. And... in the main thread, we have code that calls the system allocator (that is, the Lua interpreter, and potentially other stuff). So perhaps we should move Lua to use Jemalloc to avoid this problem?
  3. Ok, looks like that at least in the unstable branch it makes sense for us to go for the debug mode.

Thank you

@antirez

This comment has been minimized.

Owner

antirez commented Apr 27, 2017

To clarify about threads, this is what we do in other threads:

  1. Reclaiming of objects in backgroud (when the UNLINK command is used), so imagine a busy loop calling jemalloc free().
  2. Slow system calls. However threads use a message-passing strategy to communicate instead of using mutexes, so the workers implementing the background syscalls will free() the messages after receiving them.

In the future we are going to use threading a lot more, I've an experimental patch implementing threaded I/O and hopefully this will get real soon or later. Also Redis Modules use threading already and will use it more as well. So the threading story is getting more complex, but that's another story just to clarify what we have / what we'll get.

@jasone

This comment has been minimized.

jasone commented Apr 28, 2017

  • Suggested current jemalloc version: 4.5.0 (would suggest stable-4 if Windows were a target).
  • Re: allocator to use in Lua, I'd definitely suggest using the same allocator as for the rest of redis, in order to eliminate the possibility of erroneous mixed allocator use. I don't know anything about how redis utilizes Lua; the risks may be very high/low depending on details.
@antirez

This comment has been minimized.

Owner

antirez commented Apr 28, 2017

Understood, thank you @jasone.

@xhochy

This comment has been minimized.

xhochy commented Aug 4, 2017

Note that we also see the problem reported in #3799 (comment) in Apache Arrow using jemalloc 4.5.0: https://issues.apache.org/jira/browse/ARROW-1282

Building with --with-malloc-conf=dss:disabled avoids the hanging issue.

I started to build up a environment to reproduce this but probably won't be able to continue the work for two weeks. Any suggestion for what to look out would be very helpful.

JackieXie168 pushed a commit to JackieXie168/redis that referenced this issue Jan 13, 2018

Revert "Jemalloc updated to 4.4.0."
This reverts commit 153f2f0.

Jemalloc 4.4.0 is apparently causing deadlocks in certain
systems. See for example antirez#3799.
As a cautionary step we are reverting the commit back and
releasing a new stable Redis version.

@gnusi gnusi referenced this issue Feb 14, 2018

Closed

use c++14 #4581

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment