New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgraded Jemalloc causes hang on Debian 8 #3799
Comments
Thanks @rwky, where you coming from 3.2.6? |
p.s. also please report your glibc version if possible. Thank you. |
Yep straight upgrade from 3.2.6 which was working solidly. glibc is |
Two ideas:
|
Will do, it'll be a little while before it acts up again it maybe tomorrow before I can grab the details. |
We got lucky it happened before I went to bed. Attached is the redis log after sending it SIGSEGV hopefully it's useful. |
@rwky thanks a lot. Based on your new observations, are you still confident that the problem only happens with Jemalloc 4.4.0? It may be safe to release Redis 3.2.8 with the commit reverted at this point... Thanks. |
Yep since I reverted that commit it works fine so it's something Jemalloc related and it's 100% repeatable I'm just not sure exactly what is triggering it something in our production work load. |
Thank you, I think it's better to release 3.2.8 ASAP. |
@davidtgoldblatt in case you did not notice, we have a stack trace thanks to @rwky:
|
Thanks, I had missed that. I'll take a look sometime tomorrow. |
This is likely due to the bug fixed here on the dev branch. jemalloc issue #618 is tracking the backport which will be part of the 4.5.0 release. |
If @jasone's comment is the offending issue once 4.5.0 is released if someone wants to create a redis branch with 4.5.0 in it I'm happy to test it. |
I've had some trouble replicating this at the commit before the suspected fix. But given that 4.5 is coming out soon, I'm included not to spend much time on it so long as you don't mind trying it out after the fix. I'll ping this issue once it's released? |
<ChangeLog> Upgrade urgency CRITICAL: This release reverts back the Jemalloc upgrade that is believed to potentially cause a server deadlock. A MIGRATE crash is also fixed. Two important bug fixes, the first of one is critical: 1. Apparently Jemalloc 4.4.0 may contain a deadlock under particular conditions. See redis/redis#3799. We reverted back to the previously used Jemalloc versions and plan to upgrade Jemalloc again after having more info about the cause of the bug. 2. MIGRATE could crash the server after a socket error. See for reference: redis/redis#3796. </ChangeLog> git-svn-id: svn+ssh://svn.freebsd.org/ports/head@434063 35697150-7ecd-e111-bb59-0022644237b5
<ChangeLog> Upgrade urgency CRITICAL: This release reverts back the Jemalloc upgrade that is believed to potentially cause a server deadlock. A MIGRATE crash is also fixed. Two important bug fixes, the first of one is critical: 1. Apparently Jemalloc 4.4.0 may contain a deadlock under particular conditions. See redis/redis#3799. We reverted back to the previously used Jemalloc versions and plan to upgrade Jemalloc again after having more info about the cause of the bug. 2. MIGRATE could crash the server after a socket error. See for reference: redis/redis#3796. </ChangeLog>
There're 2 bug fixes in redis-3.2.8: 1. redis/redis#3799 Downgrade jemalloc-4.4.0 to jemalloc 4.0.3 2. redis/redis#3796 Fix a crash in command MIGRATE
Sounds good. I had trouble replicating it without throwing real data at it, I don't know what exactly triggers it I just know reverting jemalloc fixed it. Ping me once the fix is released and I'll check it out. |
================================================================================ Redis 3.2.8 Released Sun Feb 12 16:11:18 CET 2017 ================================================================================ Two important bug fixes, the first of one is critical: 1. Apparently Jemalloc 4.4.0 may contain a deadlock under particular conditions. See redis/redis#3799. We reverted back to the previously used Jemalloc versions and plan to upgrade Jemalloc again after having more info about the cause of the bug. 2. MIGRATE could crash the server after a socket error. See for reference: redis/redis#3796. ================================================================================ Redis 3.2.7 Released Tue Jan 31 16:21:41 CET 2017 ================================================================================ Main bugs fixes and improvements in this release: 1. MIGRATE could incorrectly move keys between Redis Cluster nodes by turning keys with an expire set into persisting keys. This bug was introduced with the multiple-keys migration recently. It is now fixed. Only applies to Redis Cluster users that use the resharding features of Redis Cluster. 2. As Redis 4.0 beta and the unstable branch already did (for some months at this point), Redis 3.2.7 also aliases the Host: and POST commands to QUIT avoiding to process the remaining pipeline if there are pending commands. This is a security protection against a "Cross Scripting" attack, that usually involves trying to feed Redis with HTTP in order to execute commands. Example: a developer is running a local copy of Redis for development purposes. She also runs a web browser in the same computer. The web browser could send an HTTP request to http://127.0.0.1:6379 in order to access the Redis instance, since a specially crafted HTTP requesta may also be partially valid Redis protocol. However if POST and Host: break the connection, this problem should be avoided. IMPORTANT: It is important to realize that it is not impossible that another way will be found to talk with a localhost Redis using a Cross Protocol attack not involving sending POST or Host: so this is only a layer of protection but not a definitive fix for this class of issues. 3. A ziplist bug that could cause data corruption, could crash the server and MAY ALSO HAVE SECURITY IMPLICATIONS was fixed. The bug looks complex to exploit, but attacks always get worse, never better (cit). The bug is very very hard to catch in practice, it required manual analysis of the ziplist code in order to be found. However it is also possible that rarely it happened in the wild. Upgrading is required if you use LINSERT and other in-the-middle list manipulation commands. 4. We upgraded to Jemalloc 4.4.0 since the version we used to ship with Redis was an early 4.0 release of Jemalloc. This version may have several improvements including the ability to better reclaim/use the memory of system.
Jemalloc 4.5.0 is out, which probably fixes this |
Yes, I think the issue is fixed (Fix chunk_alloc_dss() regression.). The stack trace above is consistent with test failures we experienced once @davidtgoldblatt implemented CI testing for FreeBSD. |
Thanks, we will upgrade 4.0 RC asap and the stable release with some delay.
If we notice anything strange I'll write a note. Thank you!
…On Mar 1, 2017 6:03 PM, "Jason Evans" ***@***.***> wrote:
Yes, I think the issue is fixed (Fix chunk_alloc_dss() regression.
<jemalloc/jemalloc@adae7cf>).
The stack trace above
<#3799 (comment)> is
consistent with test failures we experienced once @davidtgoldblatt
<https://github.com/davidtgoldblatt> implemented CI testing for FreeBSD.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3799 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAEAYHe85oqjQcAOuzNxqIVr4Cv22PGWks5rhaSmgaJpZM4L8FeJ>
.
|
@antirez if you want to create a 3.x branch with Jemalloc 4.5.0 I test if it works before you release it as a stable release. |
This reverts commit 153f2f0. Jemalloc 4.4.0 is apparently causing deadlocks in certain systems. See for example redis#3799. As a cautionary step we are reverting the commit back and releasing a new stable Redis version.
@jasone We hit this exactly same problem after upgrading to jemalloc 4.6.0(4.5.0 as well) when reallocating a large chunk of memory. Stacktrace: |
@siyangy, can you please specify precisely which jemalloc revision(s) you're testing with? There is no 4.6.0 release, so I want to make sure we're talking about a version that has this fix. Assuming you're testing with at least 4.5.0, it would be really helpful to get a printout of the primary variables that impact the loop logic in |
Sorry we tried 4.3.1, 4.4.0 and 4.5.0 hitting the same thing, and we have verified that the fix is included for 4.5.0. Here are the variables you asked: One thing worth noting is that it stops printing instead of being stuck in the while loop (we put the print withing the while(true) loop) after a few iterations while the program hangs. During our previous debugging we found out that it gets calling zrealloc recursively (nested zrealloc within zrealloc). |
Wow, is that |
I'm having a hard time seeing how the |
okay, so we use this to print
However, it is put in the very beginning of the while(true) loop, where dss_prev is not initialized. We put it there cuz we assumed that it went in infinite loop within this while(true). When we move this line after dss_prev is initialized there's nothing printed out.
|
@jasone After some more debugging I finally find out where the code gets stuck: there's a spin_adaptive in function chunk_dss_max_update, so that's why we didn't see a valid dss_prev and dss_next updated - it actually never gets out of the spin in chunk_dss_max_update. For some reason our stacktrace doesn't show the spin_adaptive call. The comment before spin_adaptive says 'Another thread optimistically updated dss_max. Wait for it to finish.' Apparently dss_max is not updated as expected. |
This reverts commit 153f2f0. Jemalloc 4.4.0 is apparently causing deadlocks in certain systems. See for example redis#3799. As a cautionary step we are reverting the commit back and releasing a new stable Redis version.
Just found this in the CI test, happening with Jemalloc 4.0.3 after a server restart. Not sure if it's related but looks like a Jemalloc deadlock at a first glance. It's worth to note that happened immediately after the Redis server was restarted so basically there is very little allocated at this time.
|
observed similar stack server struck (on of thread spinning with 100% CPU), while running redid-server version Redis 3.9.102 (00000000/0) 64 bit |
@bhuvanl 3.9.102 is using the Jemalloc that is known to have issues, RC3 downgraded to Jemalloc 4.0.3. However strange enough, I got the hang above with 4.0.3 as well in the |
@antirez Seems like using RC3 resolved my issue, thanks for quick response. |
I've been poking at this issue off and on for the past few days, and none of the scenarios I can think of that blame to jemalloc seem plausible:
If I had a live debugger session for the failure, it would probably be pretty quick work to figure out what's going on, but I'm out of ideas based on the incomplete evidence gathered so far. Regarding the two stack traces recently posted with |
Hello @jasone, thank you very much for the help with this issue, I realize that it is possible that hangs are potentially Redis bugs and not Jemalloc bugs, so your willingness to analyze the issue nonetheless is very welcomed here. I want to make sure I understand what you said here in order to start looking myself for bugs, so here are a few questions I hope you could answer:
Thank you, btw if my CI triggers the same bug again, I'll not stop the process and provide you with SSH access in case you want to perform a live debugging session. |
|
Thanks @jasone, this is starting to get interesting:
Thank you |
To clarify about threads, this is what we do in other threads:
In the future we are going to use threading a lot more, I've an experimental patch implementing threaded I/O and hopefully this will get real soon or later. Also Redis Modules use threading already and will use it more as well. So the threading story is getting more complex, but that's another story just to clarify what we have / what we'll get. |
|
Understood, thank you @jasone. |
Note that we also see the problem reported in #3799 (comment) in Apache Arrow using jemalloc 4.5.0: https://issues.apache.org/jira/browse/ARROW-1282 Building with I started to build up a environment to reproduce this but probably won't be able to continue the work for two weeks. Any suggestion for what to look out would be very helpful. |
This reverts commit 153f2f0. Jemalloc 4.4.0 is apparently causing deadlocks in certain systems. See for example redis#3799. As a cautionary step we are reverting the commit back and releasing a new stable Redis version.
@xhochy following up on arrow https://issues.apache.org/jira/browse/ARROW-1282 I can see that you provided a fix for jemalloc > 4.x as in jemalloc/jemalloc#1005 correct? |
I think we can close this right away. The problem is probably caused by the fact our Lua lib uses libc allocator while the rest of redis uses jemalloc, and that's still true, but hopefully it doesn't cause any serious issues anymore. (we can change that if we want, but there are some disadvantages, like causing false sense of fragmentation that the defragger won't be able to fix) It could be that some distros are building redis to use an external allocator (not the one embedded into redis), and who knows which jemalloc version is being used, so it could still be 4.4.0. but also, even if we change something in the next redis version, these changes will not make a difference since these users may also be still using redis 3.2. |
<ChangeLog> Upgrade urgency CRITICAL: This release reverts back the Jemalloc upgrade that is believed to potentially cause a server deadlock. A MIGRATE crash is also fixed. Two important bug fixes, the first of one is critical: 1. Apparently Jemalloc 4.4.0 may contain a deadlock under particular conditions. See redis/redis#3799. We reverted back to the previously used Jemalloc versions and plan to upgrade Jemalloc again after having more info about the cause of the bug. 2. MIGRATE could crash the server after a socket error. See for reference: redis/redis#3796. </ChangeLog>
This is a weird one.
After upgrading to 3.2.7 redis just hangs using up all the CPU on one core, nothing is output to the log file, the rdb file isn't updated and all connections fail.
Since this only started happening with 3.2.7 and the biggest change there was Jemalloc I reverted 27e29f4 and tried it again and it worked fine. So it appears the Jemalloc upgrade is causing the hang.
Unfortunately I've not found an easy way of replicating this except by using production data.
I'm not sure what to do to debug further so I'm raising this issue.
The text was updated successfully, but these errors were encountered: