ERL-1337: Beam crash for dist connection in idle state #4364

OTP-Maintainer · 2020-08-31T21:06:09Z

Original reporter: JIRAUSER16503
Affected versions: OTP-21.0, OTP-22.0, OTP-23
Fixed in versions: OTP-23.2, OTP-22.3.4.13
Component: Not Specified
Migrated from: https://bugs.erlang.org/browse/ERL-1337

Here is the backtrace for the crash

OTP Version: otp23

(gdb) bt
#0 ethr_native_atomic64_inc_mb (var=0x10) at ../include/internal/x86_64/../i386/atomic.h:205
#1 ethr_atomic_inc (var=0x10) at ../include/internal/ethr_atomics.h:5117
#2 erts_mon_link_dist_inc_refc (mld=0x0) at beam/erl_monitor_link.h:557
#3 erts_monitor_dist_insert (mon=mon@entry=0x7f682409e5b0, dist=0x0) at beam/erl_monitor_link.h:1453
#4 0x0000000000554dc1 in erts_net_message (prt=prt@entry=0x0, dep=dep@entry=0x7f679c923288, conn_id=<optimized out>, hbuf=hbuf@entry=0x0, hlen=hlen@entry=0, bin=<optimized out>, buf=buf@entry=0x7f6633f37910 "\203D\002\a", len=len@entry=62)
 at beam/dist.c:2110
#5 0x000000000055557a in dist_ctrl_put_data_2 (A__p=0x7f66df2ff6b8, BIF__ARGS=<optimized out>, A__I=<optimized out>) at beam/dist.c:3899
#6 0x0000000000441a11 in process_main (x_reg_array=0x7f698d10a380, f_reg_array=0xfffffffe) at x86_64-unknown-linux-gnu/opt/smp/beam_hot.h:331
#7 0x000000000045a422 in sched_thread_func (vesdp=0x7f698d6ccec0) at beam/erl_process.c:8520
#8 0x00000000006a9a81 in thr_wrapper (vtwd=<optimized out>) at pthread/ethread.c:118
#9 0x00007f69dc4b9ea5 in start_thread () from /lib64/libpthread.so.0
#10 0x00007f69dbfda8dd in clone () from /lib64/libc.so.6

$14 = {
 hash_bucket = {
 next = 0x7f672344da10,
 hvalue = 81170127
 },
 next = 0x7f677c505628,
 prev = 0x0,
 rwmtx = {
 rwmtx = {
 mtxb = {
 flgs = {
 counter = 0
 },
 aux_scnt = 50,
 main_scnt = 2000,
 qlck = {
 __data = {
 __lock = 0,
 __count = 0,
 __owner = 0,
 __nusers = 0,
 __kind = 0,
 __spins = 0,
 __elision = 0,
 __list = {
 __prev = 0x0,
 __next = 0x0
 }
 },
 __size = '\000' <repeats 39 times>,
 __align = 0
 },
 q = 0x0
 },
 type = ETHR_RWMUTEX_TYPE_EXTREMELY_FREQUENT_READ,
 rq_end = 0x0,
 tdata = {
 ra = 0x7f679c9943c0,
 rs = -1667677248
 }
 }
 },
 sysname = 6290507,
 creation = 1,
 input_handler = {
 counter = -5
 },
 cid = 18446744073709551611,
 connection_id = 1,
 state = ERTS_DE_STATE_IDLE,
 pending_nodedown = 0,
 suspended_nodeup = 0x0,
 dflags = 0,
 opts = 0,
 mld = 0x0,
 qlock = {
 mtx = {
 pt_mtx = {
---Type <return> to continue, or q <return> to quit---
 __data = {
 __lock = 0,
 __count = 0,
 __owner = 0,
 __nusers = 0,
 __kind = 0,
 __spins = 0,
 __elision = 0,
 __list = {
 __prev = 0x0,
 __next = 0x0
 }
 },
 __size = '\000' <repeats 39 times>,
 __align = 0
 }
 }
 },
 qflgs = {
 counter = 0
 },
 qsize = {
 counter = 0
 },
 in = {
 counter = 0
 },
 out = {
 counter = 0
 },
 out_queue = {
 first = 0x0,
 last = 0x0
 },
 suspended = 0x0,
 tmp_out_queue = {
 first = 0x0,
 last = 0x0
 },
 finalized_out_queue = {
 first = 0x0,
 last = 0x0
 },
 dist_cmd_scheduled = {
 counter = 0
 },
 dist_cmd = {
 counter = 0
 },
 send = 0x0,
 cache = 0x0,
 later_op = {
 later = 5095115,
 func = 0x7f679c9233fa,
 data = 0xf0,
 next = 0x7f680b30e428
 },
 sequences = 0x0
---Type <return> to continue, or q <return> to quit---
}

(gdb) p mdp
$15 = (ErtsMonitorData *) 0x7f682409e5b0
(gdb) p *mdp
$16 = {
 origin = {
 node = {
 signal = {
 next = 0x7f6824040000,
 specific = {
 next = 0x7f6824043050,
 attachment = 0x7f6824043050
 },
 tag = 140085258825368
 },
 tree = {
 parent = 140085257568256,
 right = 0x7f6824043050,
 left = 0x7f6824172e98
 },
 list = {
 next = 0x7f6824040000,
 prev = 0x7f6824043050
 }
 },
 other = {
 item = 12163347387939,
 ptr = 0xb1000001623
 },
 offset = 0,
 key_offset = 80,
 flags = 16,
 type = 3
 },
 target = {
 node = {
 signal = {
 next = 0x0,
 specific = {
 next = 0x0,
 attachment = 0x0
 },
 tag = 0
 },
 tree = {
 parent = 0,
 right = 0x0,
 left = 0x0
 },
 list = {
 next = 0x0,
 prev = 0x0
 }
 },
 other = {
 item = 140085257954898,
 ptr = 0x7f682409e652
 },
 offset = 40,
 key_offset = 40,
 flags = 17,
 type = 3
---Type <return> to continue, or q <return> to quit---
 },
 ref = 140085257954858,
 refc = {
 counter = 2
 }
}

Please do let me know if you need any additional information.

The text was updated successfully, but these errors were encountered:

OTP-Maintainer · 2020-09-07T08:37:45Z

sverker said:

Can you share the core dump file and beam.smp?

OTP-Maintainer · 2020-09-07T22:07:56Z

JIRAUSER16503 said:

Unfortunately our company policy does not allow us to share core, but I can run commands and share the results from core file with you.

OTP-Maintainer · 2020-09-08T21:25:41Z

JIRAUSER16503 said:

Looking more into the crash, this thread is trying to monitor the same peer in spg.

(gdb) bt
#0 0x00000000005754fd in get_match_pseudo_process (heap_size=40, c_p=0x7f69951c0e78) at beam/erl_db_util.c:461
#1 db_prog_match (c_p=c_p@entry=0x7f69951c0e78, self=self@entry=0x7f69951c0e78, bprog=0x7f6777f4b6a8, term=term@entry=140084277043058, termp=termp@entry=0x0, arity=arity@entry=0,
 in_flags=in_flags@entry=(ERTS_PAM_COPY_RESULT | ERTS_PAM_CONTIGUOUS_TUPLE), return_flags=return_flags@entry=0x7f69824ad8f0) at beam/erl_db_util.c:2048
#2 0x000000000057c16b in db_match_dbterm (tb=tb@entry=0x7f68272ddef0, c_p=0x7f69951c0e78, bprog=<optimized out>, obj=0x7f67e9925f60, hpp=hpp@entry=0x7f69824adc18, extra=extra@entry=2) at beam/erl_db_util.c:5428
#3 0x0000000000584bf5 in match_traverse (ctx=ctx@entry=0x7f69824adbd0, pattern=<optimized out>, chunk_size=chunk_size@entry=0, hpp=hpp@entry=0x7f69824adc18, lock_for_write=lock_for_write@entry=0, ret=0x7f69824adc50, iterations_left=106,
 extra_match_validator=0x0) at beam/erl_db_hash.c:1601
#4 0x000000000058538a in db_select_chunk_hash (safety=ITER_SAFE, ret=<optimized out>, reverse=<optimized out>, chunk_size=0, pattern=<optimized out>, tid=<optimized out>, tbl=<optimized out>, p=<optimized out>) at beam/erl_db_hash.c:2028
#5 db_select_hash (p=<optimized out>, tbl=<optimized out>, tid=<optimized out>, pattern=<optimized out>, reverse=<optimized out>, ret=<optimized out>, safety=ITER_SAFE) at beam/erl_db_hash.c:2006
#6 0x0000000000560e78 in ets_select2 (p=p@entry=0x7f69951c0e78, tb=0x7f68272ddef0, tid=<optimized out>, ms=<optimized out>) at beam/erl_db.c:3459
#7 0x000000000056b113 in ets_select_2 (A__p=0x7f69951c0e78, BIF__ARGS=0x7f698d167280, A__I=<optimized out>) at beam/erl_db.c:3440
#8 0x00000000004437b6 in process_main (x_reg_array=0x7f698d167280, f_reg_array=0x7f66c75d0960) at x86_64-unknown-linux-gnu/opt/smp/beam_hot.h:491
#9 0x000000000045a422 in sched_thread_func (vesdp=0x7f698d7b7f40) at beam/erl_process.c:8520
#10 0x00000000006a9a81 in thr_wrapper (vtwd=<optimized out>) at pthread/ethread.c:118
#11 0x00007f69dc4b9ea5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f69dbfda8dd in clone () from /lib64/libc.so.6



(gdb) etp-process-info p
 Pid: <0.354.0>
 State: active-sys | sig-in-q | running | active | prq-prio-normal | usr-prio-normal | act-prio-normal

Flags: using-db
 Registered name: spg
 Current function: unknown
 I: #Cp<gen_server:loop/7+0x1e0>
 Heap size: 121536
 Old-heap size: 833026
 Mbuf size: 2362
 Msgq len: 1 (inner=0, outer=1)
 Msgq Flags: on-heap
 Parent: <0.333.0>
 Pointer: (Process*)0x7f69951c0e78

(gdb) etp-stacktrace p
%%% WARNING: The process is currently running, so c_p->stop will not be correct
%%% Consider using -emu variant instead
% Stacktrace (26)
I: #Cp<gen_server:loop/7+0x1e0>.
0: #Cp<spg:handle_info/2+0xbb0>.
5: #Cp<gen_server:try_dispatch/4+0x80>.
14: #Cp<gen_server:handle_msg/6+0x4b0>.
21: #Cp<proc_lib:init_p_do_apply/3+0x50>.
25: #Cp<terminate process normally>.

 

(gdb) etp-stackdump p
%%% WARNING: The process is currently running, so c_p->stop will not be correct
%%% Consider using -emu variant instead
% Stacktrace (26)
I: #Cp<gen_server:loop/7+0x1e0>.
0: #Cp<spg:handle_info/2+0xbb0>.
1: [].
2: <0.354.0>.
3: <remote/1.205.0>.
4: {state,spg,

Could it be that we are monitoring a pid who could have been disconnected from spg, and it is race between disconnect and monitor.

The crash is on the beam with pid <0.354.0> and the segfault thread was trying to connect to <remote/1.205.0>.

OTP-Maintainer · 2020-09-09T09:20:35Z

sverker said:

We think we know the problem. There is a race between handling incoming data (the crashing thread) and another thread handling outgoing data on the same distribution channel causing the connection  to fail.

OTP-Maintainer · 2020-10-05T15:23:59Z

sverker said:

Proposed fix: [https://github.com/erlang/otp/pull/2780]

OTP-Maintainer added bug Issue is reported as a bug team:VM Assigned to OTP team VM priority:medium labels Feb 10, 2021

OTP-Maintainer added this to the OTP-23.2 milestone Feb 10, 2021

OTP-Maintainer assigned sverker Feb 10, 2021

OTP-Maintainer closed this as completed Feb 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERL-1337: Beam crash for dist connection in idle state #4364

ERL-1337: Beam crash for dist connection in idle state #4364

OTP-Maintainer commented Aug 31, 2020

OTP-Maintainer commented Sep 7, 2020

OTP-Maintainer commented Sep 7, 2020

OTP-Maintainer commented Sep 8, 2020

OTP-Maintainer commented Sep 9, 2020

OTP-Maintainer commented Oct 5, 2020

ERL-1337: Beam crash for dist connection in idle state #4364

ERL-1337: Beam crash for dist connection in idle state #4364

Comments

OTP-Maintainer commented Aug 31, 2020

OTP-Maintainer commented Sep 7, 2020

OTP-Maintainer commented Sep 7, 2020

OTP-Maintainer commented Sep 8, 2020

OTP-Maintainer commented Sep 9, 2020

OTP-Maintainer commented Oct 5, 2020