-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
process_info/1 gets stuck and never returns #7801
Comments
@nickva I have seen similar problems when running Erlang/OTP on Linux, when the Linux scheduler is not set to either (multiqueue) "none" with Linux ≥ 5.0 or (non-multiqueue) "noop". Are you using Linux with a different scheduler (like Linux ≥ 5.0 with "mq-deadline" or "bfq")? If you are, it would be best to retest with the "none" or "noop" scheduler before considering it an Erlang/OTP problem. |
@okeuday that's interesting. Thanks for the tip to check. I checked the cluster and our IO schedulers are set to |
Do you still have the core files? If so, what does |
Thanks for taking a look, Rickard. Here is the output of
|
I attempted to create a reproducer which I think at least shows the difference between the OTP 24 and 25. https://gist.github.com/nickva/d784043bee8e28f447bb3e66336712df It spawns a set of gen_servers which open and hold a I ran these in parallel on the same test machine with the hardware matching our production clusters. So far I noticed a large difference in stats gathering times when the pread calls block to fetch the data from the disk. It maybe necessary to run the benchmark once to write the data then stop it, issue a OTP 25
OTP 24
|
Thanks! I know why this latency occur in OTP 25 and it is quite an easy fix. I'll make a PR later today. I, however, do not understand how the process |
Unfortunately it comes from a production system and may have customer data embedded in it. I'll try to capture a few more of those core files, the issue happens every almost every day or so. If you have a any gdb scripts like etp-... ones I can try running those against the core. More details about the cluster: that particular cluster has a very spiky workload. It's mostly idle, but periodically, almost every hour, for 10 minutes there is constant 1k to 2k request per second which opens and inspects database and index files. Some details about the application structure in case there is something you might think about that could trigger the behavior:
|
I think I understand why the process is getting stuck forever, and will try to provide a fix for it as soon as possible, but I cannot give you an exact date. I've made a branch |
That's great news!
I tried it out and indeed it fixes the latency issue. Now The top is your branch and the bottom is OTP 24, both operations run within 15-30 msec instead of 30-40k msec as before. Thank you for the fix, Rickard |
@rickard-green thank you, it's much appreciated. We'll try to make a test release with the patch and give it a try on the cluster where we see the issue happen most often. |
I had tried a bit more to create a reproducer for the permanent stuck state, but so far wasn't able to. There is a chance the couch_file (gen_server with the file IO handle) sometimes also is called across a dist channel, so tried that in the reproducer but with no luck. There seems to be is a specific set of timings or feature interactions that make it difficult. On the positive note, we tested the patch in production on the problematic 6 node cluster for three days, and so far no blockages were detected. Previously between 6 nodes, at least one used to get "stuck" every day and had to be restarted. This hasn't yet happened with your fix, so it looks promising! |
Probably not related to this issue, but when attempting to reproduce the blocking behavior the reproducer seems to generate a large size binary memory leak across a dist channel. I created a separate issue for it #7834 |
So far the original patch we applied has been running for about 10 days. The good news is that we didn't encounter any cases where the node gets stuck like before, however, there were two unexplained instances where memory usage climbed sharply from about 10GB to 60GB and then node was rebooted by the oomkiller. This is one of those cases based showing the free/cached/used memory metric: The Erlang uptime metric for all 6 nodes over the 10 day period looks like: The vertical line on the left is when the patch was applied and the two "dips" in the graph are when the memory issue occurred. I was wondering if there is anything in the original patch which we applied: maint-25...cloudant:otp:OTP-25.3.2.7-signal-sched-fixes that might cause this behavior. It's almost seem like a deadlock of some sort? |
@nickva Thanks! I doubt that #7822 causes these memory issues, but there can of course always be something that one misses. If it is involved, I think it is indirectly involved by changing timing in the system. You don't think the memory leak you found with the reproducer #7834 could be the cause of this?
How come you say it seems like a deadlock? Have you seen anything not making progress as expected? |
Thanks for confirming! I wasn't sure so thought to share what we found and ask for your opinion. Good point, that it could be the memory issue from #7834. We were mainly concerned as the behavior seems to be new on cluster, seemingly after we patched beam.smp file. But since it happened only 2 times in 12 days, it may still be a coincidence and it's something unrelated.
It's was mainly a guess based on how sudden the memory increase happened and noticing that distribution couldn't send packets out. So it seems like node should be handling the incoming requests as always but at some point it stops, and memory starts piling up. I got a closer, zoomed in, picture of one of the incidents. The right axis is the dist metrics stop early because those are queried from the VM, so it stopped responding to the metrics poll requests. Eventually, when the OOM killer restarted the node, the blue 60GB line also stops. After reboot, memory went back to normal: in the range of 10GBs or so. |
Just to be sure, I've gone through all changes again and feel quite confident that the #7822 PR won't cause these issues. At least not directly. What kind of distribution transport do you use? The our default tcp, our tls, or another implementation? Could it be outgoing buffers piling up when the distribution isn't able to send? Are you using the |
We use the default tcp transport with a buffer (a The buffer ( It's probably variation in traffic which may have triggered an unrelated bug or like you mentioned it could be even that same memory leak bug #7834 just manifesting a lot quicker depending on some subtle timing. Thank you for taking another look, Rickard. |
This issue has been fixed in OTP 26.2 and OTP 25.2.3.8 |
Thank you, Rickard |
After deploying the latest OTP 25 everything continued to work well, however the we keep seeing the sudden memory leak issue discussed above. I managed to get a few core dumps when it starts happening, and described it in more details in #8044. I created a new issue as it doesn't seem to be related to this issue. Just mentioning it here since the discussion started here. |
Describe the bug
In Apache CouchDB we have a metrics gathering module. It gets the message queue lengths and dicts from processes. After upgrading from 24.3.4.13 to 25.3.2.7 that metrics gathering module started to get stuck when calling
process_info
on some processes. This happens randomly once every one or two days. We haven't noticed it happen previously on 24.3.4.13 and earlier Erlang versions.When this condition happens, the node is up and it's possible to remsh into it. Getting
process_info/1
for other processes seems to work well. On the stuck process, even callingexit(Pid, kill
) has no effect.Some examples of calls which get stuck:
Trying to find other ways to get more details, I gave
erts_debug
a try. Surprisingly, after callingerts_debug:get_internal_state({link_list, <0.18841.4434>})
it unblocked the stuckprocess_info/1
call in another terminal(!).In other words, one remsh instance had the prompt hung on
erlang:process_info(<0.32140.3880>).
. Launching a separate remsh prompt and runningerts_debug:set_internal_state(available_internal_state, true), erts_debug:get_internal_state({link_list, <0.18841.4434>})
, made theprocess_info/1
in the first terminal return a result.The "stuck" process is a
gen_server
in CouchDB which performs file IO.It seems to get stuck in the pread file IO call.
After it returns a result, another call to
process_info/1
would be stuck again untilget_internal_state/1
makes it "unstuck" and return a result, and so on.To Reproduce
So far there is no way to easily reproduce the issue. It happens in production on a particular cluster every one or two days.
Expected behavior
process_info/1
andexit(Pid, kill)
would function as designed.Affected versions
So far noticed it happens on 25.3.2.7. That's the first OTP 25 version we tried in production. The issue doesn't seem to happen on 24.3.4.13 (note: that's also not the latest 24 version currently; the latest is 24.3.4.14.
Additional context
I was able to attach gdb and inspect more of the process state via some
etp-commands
.I tried a few more process info getting stuck calls and erts_debug get_link to get them unstuck then dumped the core (I have it available for further investigation).
Using the core file later the same process info looks a bit different:
The process struct:
The text was updated successfully, but these errors were encountered: