-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
erts_no_of_hidden_dist_entries=-1 when racing happens between erts_do_net_exits and abort_pending_connection #6247
Comments
dist table is completely messed up
|
Ah I think it just happened that dep->state == ERTS_DE_STATE_IDLE |
I think I found the problem. PR #6258 contains a preliminary fix. |
Thanks a lot for the quick fix! #6258 seems fixed a legit bug but I'm not sure if it solves our problem. We would need several weeks to test it, because the problem currently happens approximately once per week in our entire cluster. Is it still possible that net_kernel shutdown when there is a pending connection, and the pending connection is also shutdown because port closed before erts_do_net_exits call abort_pending_connection, so that erts_set_dist_entry_not_connected is called twice? |
I still think this could explain your problem. A terminating process (net_kernel) is market with ERTS_PSFLG_EXITING before erts_continue_process() is called where it detects F_DISTRIBUTION and calls erts_do_net_exits(). It may even yield in between and execute other things, like terminating ports. That would allow (I think) for erts_internal:create_dist_channel to fail send msg to exiting net_kernel process, leave the DistEntry PENDING and getting erts_do_net_exits called on the port before it's called on net_kernel. I will get a second opinion on the fix though... |
I've updated the fix #6258 slightly. Would be good if you could try it. |
I can't say that I understand exactly how this bug would explain the particular messed up state of the DistEntry lists and their counters. In case there still is some other bug messing up the DistEntry lists and you don't want/can run debug emulator, a much more lightweight alternative could be to enable
|
I still haven't been able to do a local reproduce, we will test this fix on our prod, but it may take few weeks to know if it works. You patch currently does following things twice, is it expected? |
If in any case that one dep in "pending" list happened that "dep->state == ERTS_DE_STATE_IDLE and dep is currently in erts_not_connected_dist_entries list". I think it will result in follow things: head = &erts_hidden_dist_entries dep will be moved to the head of the erts_not_connected_dist_entries |
No. If sending the message to net_kernel fails, it resets the values: |
Sorry, I don't follow.
This is not allowed, only CONNECTED can be set to EXITING. It will then call |
What I would imagine the steps that triggered the issue would be:
I think you fix solved that
I think both case could happen and you fix is solving one of them, am I understanding right? |
A DistEntry (including its state) is protected by its read-write lock:
and the DistEntry lists and counters are further protected by one single read-write lock:
You seem to think that erts_set_dist_entry_not_connected() gets called with dep->state == IDLE. I don't think that was/is possible. erts_set_dist_entry_not_connected() is called in two places:
My bug scenario was
|
|
I think #6258 will fix this. Closing issue. |
Describe the bug
We occasionally see coredumps with backtraces like:
The coredump happens on node shutdown when there is pending connections. In both case, we see that erts_no_of_hidden_dist_entries=-1. This means there are duplicated calls of erts_set_dist_entry_not_connected, causing wrong value of erts_no_of_hidden_dist_entries. (Even though there is rwlock for dist table, it is still possible that one operation is blocked and the other operation pass through. Then blocked operation decreased the value again.)
We don't know exactly why unlink_carrier of local variable "DistEntry** pending" would fail, or why beam/dist.c:947 abort_pending_connection(pending[i], pending[i]->connection_id, NULL) would fail, but the pending shutdown racing surely worth fixing.
To Reproduce
This is a rare racing, but logically it surely can happen.
Expected behavior
erts_no_of_hidden_dist_entries should never drop below 0
Affected versions
25.0.2
Additional information
We are running duel dist protocol, non-hidden node connection is using inet_drv, and hidden node connection is using inet_tls_dist
Also, on recent coredumps, we all have erts_no_of_pending_dist_entries = 1
The text was updated successfully, but these errors were encountered: