Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

erts_no_of_hidden_dist_entries=-1 when racing happens between erts_do_net_exits and abort_pending_connection #6247

Closed
zzydxm opened this issue Aug 25, 2022 · 15 comments
Assignees
Labels
bug Issue is reported as a bug in progress team:VM Assigned to OTP team VM

Comments

@zzydxm
Copy link
Contributor

zzydxm commented Aug 25, 2022

Describe the bug
We occasionally see coredumps with backtraces like:

#0  0x00000000006139bf in erts_do_net_exits (dep=<optimized out>, reason=<optimized out>) at beam/dist.c:947
#1  0x000000000046c6fb in erts_continue_exit_process (p=0x17798b8) at beam/erl_process.c:13975
#2  0x00007f7c6a663fe0 in erts_beamasm:continue_exit/0 ()
#3  0x00007f7c61b347a8 in proc_lib:exit_p/3 () at proc_lib.erl:261
#4  0x00007f7c6a663f38 in erts_beamasm:normal_exit/0-CodeInfoPrologue ()
#0  0x0000000000523107 in unlink_carrier (crr=0x7fc1411a8ec0, cl=0x7fc0f385bee0) at beam/erl_alloc_util.c:1258
#1  destroy_carrier (allctr=0x7fc0f385bd80, blk=0x7fc1411a8ee0, busy_pcrr_pp=<optimized out>) at beam/erl_alloc_util.c:4305
#2  0x0000000000530e8e in do_erts_alcu_free (busy_pcrr_pp=0x0, p=0x7fc1411a8ee8, allctr=0x7fc0f385bd80, type=1250)
    at beam/erl_alloc_util.c:5953
#3  do_erts_alcu_free (busy_pcrr_pp=0x0, p=<optimized out>, allctr=0x7fc0f385bd80, type=1250) at beam/erl_alloc_util.c:5929
#4  erts_alcu_free_thr_spec (type=1250, extra=<optimized out>, p=<optimized out>) at beam/erl_alloc_util.c:5993
#5  0x0000000000613b7e in erts_do_net_exits (dep=<optimized out>, reason=<optimized out>) at beam/dist.c:950
#6  0x000000000046c6fb in erts_continue_exit_process (p=0x17b6750) at beam/erl_process.c:13975

The coredump happens on node shutdown when there is pending connections. In both case, we see that erts_no_of_hidden_dist_entries=-1. This means there are duplicated calls of erts_set_dist_entry_not_connected, causing wrong value of erts_no_of_hidden_dist_entries. (Even though there is rwlock for dist table, it is still possible that one operation is blocked and the other operation pass through. Then blocked operation decreased the value again.)

We don't know exactly why unlink_carrier of local variable "DistEntry** pending" would fail, or why beam/dist.c:947 abort_pending_connection(pending[i], pending[i]->connection_id, NULL) would fail, but the pending shutdown racing surely worth fixing.

To Reproduce
This is a rare racing, but logically it surely can happen.

Expected behavior
erts_no_of_hidden_dist_entries should never drop below 0

Affected versions
25.0.2

Additional information
We are running duel dist protocol, non-hidden node connection is using inet_drv, and hidden node connection is using inet_tls_dist

Also, on recent coredumps, we all have erts_no_of_pending_dist_entries = 1

@zzydxm zzydxm added the bug Issue is reported as a bug label Aug 25, 2022
@zzydxm
Copy link
Contributor Author

zzydxm commented Aug 26, 2022

dist table is completely messed up

(gdb) set $cnt=0
(gdb) set $de=erts_pending_dist_entries
(gdb) while $de
 >    set $cnt=$cnt+1
 >    set $de=$de->next
 >end
(gdb) p $cnt
$29 = 8109
(gdb) p erts_no_of_pending_dist_entries
$30 = 1
(gdb) set $cnt=0
(gdb) set $de=erts_not_connected_dist_entries
(gdb) while $de
 >    set $cnt=$cnt+1
 >    set $de=$de->next
 >end
(gdb) p $cnt
$32 = 3818
(gdb) p erts_no_of_not_connected_dist_entries
$33 = 11930
(gdb) set $cnt=0
(gdb) set $de=erts_visible_dist_entries
(gdb) while $de
 >    set $cnt=$cnt+1
 >    set $de=$de->next
 >end
(gdb) p $cnt
$35 = 10809
(gdb) p erts_no_of_visible_dist_entries
$36 = 10809
(gdb) set $cnt=0
(gdb) set $de=erts_hidden_dist_entries
(gdb) while $de
 >    set $cnt=$cnt+1
 >    set $de=$de->next
 >end
(gdb) p $cnt
$38 = 0
(gdb) p erts_no_of_hidden_dist_entries
$39 = -1
(gdb)

@zzydxm
Copy link
Contributor Author

zzydxm commented Aug 26, 2022

Ah I think it just happened that dep->state == ERTS_DE_STATE_IDLE
So if dep->state == ERTS_DE_STATE_IDLE, we can just return without doing anything

@sverker
Copy link
Contributor

sverker commented Aug 29, 2022

I think I found the problem. PR #6258 contains a preliminary fix.

@zzydxm
Copy link
Contributor Author

zzydxm commented Aug 29, 2022

I think I found the problem. PR #6258 contains a preliminary fix.

Thanks a lot for the quick fix! #6258 seems fixed a legit bug but I'm not sure if it solves our problem. We would need several weeks to test it, because the problem currently happens approximately once per week in our entire cluster.

Is it still possible that net_kernel shutdown when there is a pending connection, and the pending connection is also shutdown because port closed before erts_do_net_exits call abort_pending_connection, so that erts_set_dist_entry_not_connected is called twice?

@sverker
Copy link
Contributor

sverker commented Aug 30, 2022

I still think this could explain your problem.

A terminating process (net_kernel) is market with ERTS_PSFLG_EXITING before erts_continue_process() is called where it detects F_DISTRIBUTION and calls erts_do_net_exits(). It may even yield in between and execute other things, like terminating ports.

That would allow (I think) for erts_internal:create_dist_channel to fail send msg to exiting net_kernel process, leave the DistEntry PENDING and getting erts_do_net_exits called on the port before it's called on net_kernel.

I will get a second opinion on the fix though...

@sverker
Copy link
Contributor

sverker commented Aug 30, 2022

I've updated the fix #6258 slightly. Would be good if you could try it.

@sverker
Copy link
Contributor

sverker commented Aug 30, 2022

I can't say that I understand exactly how this bug would explain the particular messed up state of the DistEntry lists and their counters.

In case there still is some other bug messing up the DistEntry lists and you don't want/can run debug emulator, a much more lightweight alternative could be to enable ASSERT's by changing them to ERTS_ASSERT in

erts_set_dist_entry_not_connected()
erts_set_dist_entry_pending()
erts_set_dist_entry_connected()

@zzydxm
Copy link
Contributor Author

zzydxm commented Aug 31, 2022

I still haven't been able to do a local reproduce, we will test this fix on our prod, but it may take few weeks to know if it works.

You patch currently does following things twice, is it expected?
erts_atomic32_read_bor_nob(&pp->state, ERTS_PORT_SFLG_DISTRIBUTION);
erts_prtsd_set(pp, ERTS_PRTSD_DIST_ENTRY, dep);
erts_prtsd_set(pp, ERTS_PRTSD_CONN_ID, (void*)(UWord)dep->connection_id);

@zzydxm
Copy link
Contributor Author

zzydxm commented Aug 31, 2022

If in any case that one dep in "pending" list happened that "dep->state == ERTS_DE_STATE_IDLE and dep is currently in erts_not_connected_dist_entries list". I think it will result in follow things:

head = &erts_hidden_dist_entries
erts_no_of_pending_dist_entries end up with 1 when erts_do_net_exits is done
erts_no_of_hidden_dist_entries end up with -1 when erts_do_net_exits is done
*head != dep so erts_hidden_dist_entries won't be changed and eventually be empty

dep will be moved to the head of the erts_not_connected_dist_entries
What I don't understand is why the entries after dep previously will be moved to erts_pending_dist_entries,

@sverker
Copy link
Contributor

sverker commented Aug 31, 2022

You patch currently does following things twice, is it expected?

No. If sending the message to net_kernel fails, it resets the values:
if (set_res == 0) {
erts_atomic32_read_band_nob(&pp->state, ~ERTS_PORT_SFLG_DISTRIBUTION);
erts_prtsd_set(pp, ERTS_PRTSD_DIST_ENTRY, NULL);
erts_prtsd_set(pp, ERTS_PRTSD_CONN_ID, NULL);
goto badarg;
}

@sverker
Copy link
Contributor

sverker commented Aug 31, 2022

If in any case that one dep in "pending" list happened that "dep->state == ERTS_DE_STATE_IDLE and dep is currently in erts_not_connected_dist_entries list". I think it will result in follow things:

Sorry, I don't follow.
The bug I fixed would make erts_do_net_exits() change dep->state from PENDING to EXITING here:

if (dep->state == ERTS_DE_STATE_EXITING) {
    ASSERT(erts_atomic32_read_nob(&dep->qflgs) & ERTS_DE_QFLG_EXIT);
}
else {
    dep->state = ERTS_DE_STATE_EXITING;

This is not allowed, only CONNECTED can be set to EXITING.

It will then call erts_set_dist_entry_not_connected() which will treat the dep as if it is not in pending list, but rather in either visible or hidden lists. If dep is first in list, it will mess up the lists and it gets really complicated (for me at least) to figure out how subsequent list operations can possibly further mess it up the lists and if what you saw is a possible outcome.

@zzydxm
Copy link
Contributor Author

zzydxm commented Aug 31, 2022

What I would imagine the steps that triggered the issue would be:

  1. A dist connection is in pending state
  2. net_kernel is shutting down and triggered erts_do_net_exits, which store the dist entry dep in the pending variable
  3. The inet_drv port of the dist connection closed due to some reason (e.g. remote node closed the connection) and also triggered erts_do_net_exits, which eventually did erts_set_dist_entry_not_connected and set dep->state = ERTS_DE_STATE_IDLE, dep -> dflag = 0
  4. erts_do_net_exits for net_kernel calls abort_pending_connection on dep
  5. erts_set_dist_entry_not_connected happens on a dep where dep->state = ERTS_DE_STATE_IDLE, which is unexpected
  6. dist table is messed up and erts_no_of_hidden_dist_entries is set to -1 because when dep -> dflag = 0, head variable in erts_set_dist_entry_not_connected will be set to erts_hidden_dist_entries
    Is this possible?

I think you fix solved that

  1. A dist connection is in pending state
  2. net_kernel is shutting down and triggered erts_do_net_exits, which store the dist entry in the "pending" variable
  3. dist_util.erl triggers erlang:setnode and then create_dist_channel but it failed and goes to exiting state
  4. dep->state was set to ERTS_DE_STATE_IDLE
  5. erts_set_dist_entry_not_connected happens on a dep where dep->state = ERTS_DE_STATE_IDLE
  6. dist table is messed up and erts_no_of_hidden_dist_entries is set to -1

I think both case could happen and you fix is solving one of them, am I understanding right?

@sverker
Copy link
Contributor

sverker commented Sep 1, 2022

A DistEntry (including its state) is protected by its read-write lock:

erts_de_rwlock(dep);
erts_de_rwunlock(dep);

and the DistEntry lists and counters are further protected by one single read-write lock:

erts_rwmtx_rwlock(&erts_dist_table_rwmtx);
erts_rwmtx_rwunlock(&erts_dist_table_rwmtx);

You seem to think that erts_set_dist_entry_not_connected() gets called with dep->state == IDLE. I don't think that was/is possible.

erts_set_dist_entry_not_connected() is called in two places:

  1. In erts_abort_pending_connection_rwunlock() where it only gets called if dep->state == PENDING
  2. In erts_do_net_exits() if found via a port or process. A DistEntry is only hooked into a port with erts_prtsd_set(prt, ERTS_PRTSD_DIST_ENTRY, dep) by erts_internal_create_dist_channel_3() were we change from PENDING to CONNECTED. And there is a similar case when hooked into a controlling process.

My bug scenario was

  1. dep in PENDING state.
  2. Node start shutting down. Start terminate net_kernel marking it as EXITING.
  3. Other scheduler runs erts_internal_create_dist_channel_3
  4. hooks PENDING dep into port
  5. but then fails to send msg to net_kernel as it's EXITING.
  6. The bug leaves PENDING dep while still hooked into port.
  7. erts_do_net_exit() is called for the port with PENDING dep (as the node is shutting down).
  8. dep->state is changed from PENDING to EXITING (which is wrong as it's still in pending list)
  9. erts_set_dist_entry_not_connected() is called which sees EXITING and assumes dep is in visible or hidden lists and messes up counters and the lists.

@zzydxm
Copy link
Contributor Author

zzydxm commented Sep 1, 2022

erts_dist_table_rwmtx is locked during setting pending list, but the lock is released before running abort_pending_connection, during that time, dep is also not locked, so state change could happen on dep
So I wound imagine dep can go to CONNECT and then IDLE during that period, (EXITING also introduces problem)

@sverker
Copy link
Contributor

sverker commented Sep 13, 2022

I think #6258 will fix this. Closing issue.

@sverker sverker closed this as completed Sep 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug in progress team:VM Assigned to OTP team VM
Projects
None yet
Development

No branches or pull requests

3 participants