Upgraded Jemalloc causes hang on Debian 8 #3799

Open
rwky opened this Issue Feb 9, 2017 · 15 comments

Projects

None yet

4 participants

@rwky
rwky commented Feb 9, 2017

This is a weird one.

After upgrading to 3.2.7 redis just hangs using up all the CPU on one core, nothing is output to the log file, the rdb file isn't updated and all connections fail.

Since this only started happening with 3.2.7 and the biggest change there was Jemalloc I reverted 27e29f4 and tried it again and it worked fine. So it appears the Jemalloc upgrade is causing the hang.

Unfortunately I've not found an easy way of replicating this except by using production data.

I'm not sure what to do to debug further so I'm raising this issue.

@antirez
Owner
antirez commented Feb 9, 2017

Thanks @rwky, where you coming from 3.2.6?

@antirez antirez referenced this issue in jemalloc/jemalloc Feb 9, 2017
Open

deadlock in je_prof_boot2 #585

@antirez
Owner
antirez commented Feb 9, 2017

p.s. also please report your glibc version if possible. Thank you.

@rwky
rwky commented Feb 9, 2017

Yep straight upgrade from 3.2.6 which was working solidly.

glibc is ldd (Debian GLIBC 2.19-18+deb8u7) 2.19

@davidtgoldblatt

Two ideas:

  • Can you grab some stack traces to see where the spinning is happening?
  • Can you try building jemalloc with --enable-debug?
@rwky
rwky commented Feb 9, 2017

Will do, it'll be a little while before it acts up again it maybe tomorrow before I can grab the details.

@rwky
rwky commented Feb 9, 2017

We got lucky it happened before I went to bed. Attached is the redis log after sending it SIGSEGV hopefully it's useful.
redis.txt

@antirez
Owner
antirez commented Feb 10, 2017

@rwky thanks a lot. Based on your new observations, are you still confident that the problem only happens with Jemalloc 4.4.0? It may be safe to release Redis 3.2.8 with the commit reverted at this point... Thanks.

@rwky
rwky commented Feb 10, 2017

Yep since I reverted that commit it works fine so it's something Jemalloc related and it's 100% repeatable I'm just not sure exactly what is triggering it something in our production work load.

@antirez
Owner
antirez commented Feb 10, 2017

Thank you, I think it's better to release 3.2.8 ASAP.

@antirez
Owner
antirez commented Feb 12, 2017 edited

@davidtgoldblatt in case you did not notice, we have a stack trace thanks to @rwky:

------ STACK TRACE ------
EIP:
/usr/local/bin/redis-server 127.0.0.1:6379(je_spin_adaptive+0x22)[0x4e3552]

Backtrace:
/usr/local/bin/redis-server 127.0.0.1:6379(logStackTrace+0x29)[0x4623a9]
/usr/local/bin/redis-server 127.0.0.1:6379(sigsegvHandler+0xa6)[0x462a46]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf890)[0x7f5ffbb60890]
/usr/local/bin/redis-server 127.0.0.1:6379(je_spin_adaptive+0x22)[0x4e3552]
/usr/local/bin/redis-server 127.0.0.1:6379(je_chunk_alloc_dss+0x1d8)[0x4c16f8]
/usr/local/bin/redis-server 127.0.0.1:6379(je_chunk_alloc_wrapper+0x948)[0x4c0a18]
/usr/local/bin/redis-server 127.0.0.1:6379(je_arena_chunk_ralloc_huge_expand+0x263)[0x4b7bd3]
/usr/local/bin/redis-server 127.0.0.1:6379[0x4d6ff0]
/usr/local/bin/redis-server 127.0.0.1:6379(je_huge_ralloc_no_move+0x314)[0x4d7804]
/usr/local/bin/redis-server 127.0.0.1:6379(je_huge_ralloc+0x5c)[0x4d7b8c]
/usr/local/bin/redis-server 127.0.0.1:6379(je_realloc+0xb2)[0x4ae072]
/usr/local/bin/redis-server 127.0.0.1:6379(zrealloc+0x26)[0x431f76]
/usr/local/bin/redis-server 127.0.0.1:6379(sdsMakeRoomFor+0x2bd)[0x42f91d]
/usr/local/bin/redis-server 127.0.0.1:6379(readQueryFromClient+0xae)[0x43ab0e]
/usr/local/bin/redis-server 127.0.0.1:6379(aeProcessEvents+0x133)[0x425463]
/usr/local/bin/redis-server 127.0.0.1:6379(aeMain+0x2b)[0x4257ab]
/usr/local/bin/redis-server 127.0.0.1:6379(main+0x40b)[0x42285b]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f5ffb7c7b45]
/usr/local/bin/redis-server 127.0.0.1:6379[0x4229fe]
@antirez antirez added a commit that referenced this issue Feb 12, 2017
@antirez Revert "Jemalloc updated to 4.4.0."
This reverts commit 153f2f0.

Jemalloc 4.4.0 is apparently causing deadlocks in certain
systems. See for example #3799.
As a cautionary step we are reverting the commit back and
releasing a new stable Redis version.
7178cac
@davidtgoldblatt

Thanks, I had missed that. I'll take a look sometime tomorrow.

@bwzhang2011 bwzhang2011 referenced this issue in jemalloc/jemalloc Feb 13, 2017
Open

4.5.0 release #545

@jasone
jasone commented Feb 13, 2017

This is likely due to the bug fixed here on the dev branch. jemalloc issue #618 is tracking the backport which will be part of the 4.5.0 release.

@rwky
rwky commented Feb 13, 2017

If @jasone's comment is the offending issue once 4.5.0 is released if someone wants to create a redis branch with 4.5.0 in it I'm happy to test it.

@davidtgoldblatt

I've had some trouble replicating this at the commit before the suspected fix. But given that 4.5 is coming out soon, I'm included not to spend much time on it so long as you don't mind trying it out after the fix. I'll ping this issue once it's released?

@uqs uqs pushed a commit to freebsd/freebsd-ports that referenced this issue Feb 14, 2017
osa Upgrade from 3.2.7 to 3.2.8.
<ChangeLog>

Upgrade urgency CRITICAL: This release reverts back the Jemalloc upgrade
                          that is believed to potentially cause a server
                          deadlock. A MIGRATE crash is also fixed.

Two important bug fixes, the first of one is critical:

1. Apparently Jemalloc 4.4.0 may contain a deadlock under particular
   conditions. See antirez/redis#3799.
   We reverted back to the previously used Jemalloc versions and plan
   to upgrade Jemalloc again after having more info about the
   cause of the bug.

2. MIGRATE could crash the server after a socket error. See for reference:
   antirez/redis#3796.

</ChangeLog>


git-svn-id: svn+ssh://svn.freebsd.org/ports/head@434063 35697150-7ecd-e111-bb59-0022644237b5
065c1a8
@uqs uqs pushed a commit to freebsd/freebsd-ports that referenced this issue Feb 14, 2017
@osokin osokin Upgrade from 3.2.7 to 3.2.8.
<ChangeLog>

Upgrade urgency CRITICAL: This release reverts back the Jemalloc upgrade
                          that is believed to potentially cause a server
                          deadlock. A MIGRATE crash is also fixed.

Two important bug fixes, the first of one is critical:

1. Apparently Jemalloc 4.4.0 may contain a deadlock under particular
   conditions. See antirez/redis#3799.
   We reverted back to the previously used Jemalloc versions and plan
   to upgrade Jemalloc again after having more info about the
   cause of the bug.

2. MIGRATE could crash the server after a socket error. See for reference:
   antirez/redis#3796.

</ChangeLog>
0ec9c53
@spinlock spinlock added a commit to CodisLabs/codis that referenced this issue Feb 14, 2017
@spinlock spinlock extern: upgrade redis-3.2.7 to redis-3.2.8
    There're 2 bug fixes in redis-3.2.8:
    1. antirez/redis#3799
        Downgrade jemalloc-4.4.0 to jemalloc 4.0.3
    2. antirez/redis#3796
        Fix a crash in command MIGRATE
7c5cebb
@rwky
rwky commented Feb 14, 2017

Sounds good. I had trouble replicating it without throwing real data at it, I don't know what exactly triggers it I just know reverting jemalloc fixed it. Ping me once the fix is released and I'll check it out.

@mat813 mat813 pushed a commit to mat813/freebsd-ports that referenced this issue Feb 14, 2017
osa Upgrade from 3.2.7 to 3.2.8.
<ChangeLog>

Upgrade urgency CRITICAL: This release reverts back the Jemalloc upgrade
                          that is believed to potentially cause a server
                          deadlock. A MIGRATE crash is also fixed.

Two important bug fixes, the first of one is critical:

1. Apparently Jemalloc 4.4.0 may contain a deadlock under particular
   conditions. See antirez/redis#3799.
   We reverted back to the previously used Jemalloc versions and plan
   to upgrade Jemalloc again after having more info about the
   cause of the bug.

2. MIGRATE could crash the server after a socket error. See for reference:
   antirez/redis#3796.

</ChangeLog>


git-svn-id: https://svn.freebsd.org/ports/head@434063 35697150-7ecd-e111-bb59-0022644237b5
3b0dcd8
@jsonn jsonn pushed a commit to jsonn/pkgsrc that referenced this issue Feb 14, 2017
fhajny Update databases/redis to 3.2.8.
================================================================================
Redis 3.2.8     Released Sun Feb 12 16:11:18 CET 2017
================================================================================

Two important bug fixes, the first of one is critical:

1. Apparently Jemalloc 4.4.0 may contain a deadlock under particular
   conditions. See antirez/redis#3799.
   We reverted back to the previously used Jemalloc versions and plan
   to upgrade Jemalloc again after having more info about the
   cause of the bug.

2. MIGRATE could crash the server after a socket error. See for reference:
   antirez/redis#3796.

================================================================================
Redis 3.2.7     Released Tue Jan 31 16:21:41 CET 2017
================================================================================

Main bugs fixes and improvements in this release:

1. MIGRATE could incorrectly move keys between Redis Cluster nodes by turning
   keys with an expire set into persisting keys. This bug was introduced with
   the multiple-keys migration recently. It is now fixed. Only applies to
   Redis Cluster users that use the resharding features of Redis Cluster.

2. As Redis 4.0 beta and the unstable branch already did (for some months at
   this point), Redis 3.2.7 also aliases the Host: and POST commands to QUIT
   avoiding to process the remaining pipeline if there are pending commands.
   This is a security protection against a "Cross Scripting" attack, that
   usually involves trying to feed Redis with HTTP in order to execute commands.
   Example: a developer is running a local copy of Redis for development
   purposes. She also runs a web browser in the same computer. The web browser
   could send an HTTP request to http://127.0.0.1:6379 in order to access the
   Redis instance, since a specially crafted HTTP requesta may also be partially
   valid Redis protocol. However if POST and Host: break the connection, this
   problem should be avoided. IMPORTANT: It is important to realize that it
   is not impossible that another way will be found to talk with a localhost
   Redis using a Cross Protocol attack not involving sending POST or Host: so
   this is only a layer of protection but not a definitive fix for this class
   of issues.

3. A ziplist bug that could cause data corruption, could crash the server and
   MAY ALSO HAVE SECURITY IMPLICATIONS was fixed. The bug looks complex to
   exploit, but attacks always get worse, never better (cit). The bug is very
   very hard to catch in practice, it required manual analysis of the ziplist
   code in order to be found. However it is also possible that rarely it
   happened in the wild. Upgrading is required if you use LINSERT and other
   in-the-middle list manipulation commands.

4. We upgraded to Jemalloc 4.4.0 since the version we used to ship with Redis
   was an early 4.0 release of Jemalloc. This version may have several
   improvements including the ability to better reclaim/use the memory of
   system.
3b52612
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment