Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERL-1337: Beam crash for dist connection in idle state #4364

Closed
OTP-Maintainer opened this issue Aug 31, 2020 · 5 comments
Closed

ERL-1337: Beam crash for dist connection in idle state #4364

OTP-Maintainer opened this issue Aug 31, 2020 · 5 comments
Assignees
Labels
bug Issue is reported as a bug priority:medium team:VM Assigned to OTP team VM
Milestone

Comments

@OTP-Maintainer
Copy link

Original reporter: JIRAUSER16503
Affected versions: OTP-21.0, OTP-22.0, OTP-23
Fixed in versions: OTP-23.2, OTP-22.3.4.13
Component: Not Specified
Migrated from: https://bugs.erlang.org/browse/ERL-1337


Here is the backtrace for the crash

OTP Version: otp23

(gdb) bt
#0 ethr_native_atomic64_inc_mb (var=0x10) at ../include/internal/x86_64/../i386/atomic.h:205
#1 ethr_atomic_inc (var=0x10) at ../include/internal/ethr_atomics.h:5117
#2 erts_mon_link_dist_inc_refc (mld=0x0) at beam/erl_monitor_link.h:557
#3 erts_monitor_dist_insert (mon=mon@entry=0x7f682409e5b0, dist=0x0) at beam/erl_monitor_link.h:1453
#4 0x0000000000554dc1 in erts_net_message (prt=prt@entry=0x0, dep=dep@entry=0x7f679c923288, conn_id=<optimized out>, hbuf=hbuf@entry=0x0, hlen=hlen@entry=0, bin=<optimized out>, buf=buf@entry=0x7f6633f37910 "\203D\002\a", len=len@entry=62)
 at beam/dist.c:2110
#5 0x000000000055557a in dist_ctrl_put_data_2 (A__p=0x7f66df2ff6b8, BIF__ARGS=<optimized out>, A__I=<optimized out>) at beam/dist.c:3899
#6 0x0000000000441a11 in process_main (x_reg_array=0x7f698d10a380, f_reg_array=0xfffffffe) at x86_64-unknown-linux-gnu/opt/smp/beam_hot.h:331
#7 0x000000000045a422 in sched_thread_func (vesdp=0x7f698d6ccec0) at beam/erl_process.c:8520
#8 0x00000000006a9a81 in thr_wrapper (vtwd=<optimized out>) at pthread/ethread.c:118
#9 0x00007f69dc4b9ea5 in start_thread () from /lib64/libpthread.so.0
#10 0x00007f69dbfda8dd in clone () from /lib64/libc.so.6

$14 = {
 hash_bucket = {
 next = 0x7f672344da10,
 hvalue = 81170127
 },
 next = 0x7f677c505628,
 prev = 0x0,
 rwmtx = {
 rwmtx = {
 mtxb = {
 flgs = {
 counter = 0
 },
 aux_scnt = 50,
 main_scnt = 2000,
 qlck = {
 __data = {
 __lock = 0,
 __count = 0,
 __owner = 0,
 __nusers = 0,
 __kind = 0,
 __spins = 0,
 __elision = 0,
 __list = {
 __prev = 0x0,
 __next = 0x0
 }
 },
 __size = '\000' <repeats 39 times>,
 __align = 0
 },
 q = 0x0
 },
 type = ETHR_RWMUTEX_TYPE_EXTREMELY_FREQUENT_READ,
 rq_end = 0x0,
 tdata = {
 ra = 0x7f679c9943c0,
 rs = -1667677248
 }
 }
 },
 sysname = 6290507,
 creation = 1,
 input_handler = {
 counter = -5
 },
 cid = 18446744073709551611,
 connection_id = 1,
 state = ERTS_DE_STATE_IDLE,
 pending_nodedown = 0,
 suspended_nodeup = 0x0,
 dflags = 0,
 opts = 0,
 mld = 0x0,
 qlock = {
 mtx = {
 pt_mtx = {
---Type <return> to continue, or q <return> to quit---
 __data = {
 __lock = 0,
 __count = 0,
 __owner = 0,
 __nusers = 0,
 __kind = 0,
 __spins = 0,
 __elision = 0,
 __list = {
 __prev = 0x0,
 __next = 0x0
 }
 },
 __size = '\000' <repeats 39 times>,
 __align = 0
 }
 }
 },
 qflgs = {
 counter = 0
 },
 qsize = {
 counter = 0
 },
 in = {
 counter = 0
 },
 out = {
 counter = 0
 },
 out_queue = {
 first = 0x0,
 last = 0x0
 },
 suspended = 0x0,
 tmp_out_queue = {
 first = 0x0,
 last = 0x0
 },
 finalized_out_queue = {
 first = 0x0,
 last = 0x0
 },
 dist_cmd_scheduled = {
 counter = 0
 },
 dist_cmd = {
 counter = 0
 },
 send = 0x0,
 cache = 0x0,
 later_op = {
 later = 5095115,
 func = 0x7f679c9233fa,
 data = 0xf0,
 next = 0x7f680b30e428
 },
 sequences = 0x0
---Type <return> to continue, or q <return> to quit---
}

(gdb) p mdp
$15 = (ErtsMonitorData *) 0x7f682409e5b0
(gdb) p *mdp
$16 = {
 origin = {
 node = {
 signal = {
 next = 0x7f6824040000,
 specific = {
 next = 0x7f6824043050,
 attachment = 0x7f6824043050
 },
 tag = 140085258825368
 },
 tree = {
 parent = 140085257568256,
 right = 0x7f6824043050,
 left = 0x7f6824172e98
 },
 list = {
 next = 0x7f6824040000,
 prev = 0x7f6824043050
 }
 },
 other = {
 item = 12163347387939,
 ptr = 0xb1000001623
 },
 offset = 0,
 key_offset = 80,
 flags = 16,
 type = 3
 },
 target = {
 node = {
 signal = {
 next = 0x0,
 specific = {
 next = 0x0,
 attachment = 0x0
 },
 tag = 0
 },
 tree = {
 parent = 0,
 right = 0x0,
 left = 0x0
 },
 list = {
 next = 0x0,
 prev = 0x0
 }
 },
 other = {
 item = 140085257954898,
 ptr = 0x7f682409e652
 },
 offset = 40,
 key_offset = 40,
 flags = 17,
 type = 3
---Type <return> to continue, or q <return> to quit---
 },
 ref = 140085257954858,
 refc = {
 counter = 2
 }
}

Please do let me know if you need any additional information.
@OTP-Maintainer
Copy link
Author

sverker said:

Can you share the core dump file and beam.smp?

@OTP-Maintainer
Copy link
Author

JIRAUSER16503 said:

Unfortunately our company policy does not allow us to share core, but I can run commands and share the results from core file with you.

@OTP-Maintainer
Copy link
Author

JIRAUSER16503 said:

Looking more into the crash, this thread is trying to monitor the same peer in spg.

(gdb) bt
#0 0x00000000005754fd in get_match_pseudo_process (heap_size=40, c_p=0x7f69951c0e78) at beam/erl_db_util.c:461
#1 db_prog_match (c_p=c_p@entry=0x7f69951c0e78, self=self@entry=0x7f69951c0e78, bprog=0x7f6777f4b6a8, term=term@entry=140084277043058, termp=termp@entry=0x0, arity=arity@entry=0,
 in_flags=in_flags@entry=(ERTS_PAM_COPY_RESULT | ERTS_PAM_CONTIGUOUS_TUPLE), return_flags=return_flags@entry=0x7f69824ad8f0) at beam/erl_db_util.c:2048
#2 0x000000000057c16b in db_match_dbterm (tb=tb@entry=0x7f68272ddef0, c_p=0x7f69951c0e78, bprog=<optimized out>, obj=0x7f67e9925f60, hpp=hpp@entry=0x7f69824adc18, extra=extra@entry=2) at beam/erl_db_util.c:5428
#3 0x0000000000584bf5 in match_traverse (ctx=ctx@entry=0x7f69824adbd0, pattern=<optimized out>, chunk_size=chunk_size@entry=0, hpp=hpp@entry=0x7f69824adc18, lock_for_write=lock_for_write@entry=0, ret=0x7f69824adc50, iterations_left=106,
 extra_match_validator=0x0) at beam/erl_db_hash.c:1601
#4 0x000000000058538a in db_select_chunk_hash (safety=ITER_SAFE, ret=<optimized out>, reverse=<optimized out>, chunk_size=0, pattern=<optimized out>, tid=<optimized out>, tbl=<optimized out>, p=<optimized out>) at beam/erl_db_hash.c:2028
#5 db_select_hash (p=<optimized out>, tbl=<optimized out>, tid=<optimized out>, pattern=<optimized out>, reverse=<optimized out>, ret=<optimized out>, safety=ITER_SAFE) at beam/erl_db_hash.c:2006
#6 0x0000000000560e78 in ets_select2 (p=p@entry=0x7f69951c0e78, tb=0x7f68272ddef0, tid=<optimized out>, ms=<optimized out>) at beam/erl_db.c:3459
#7 0x000000000056b113 in ets_select_2 (A__p=0x7f69951c0e78, BIF__ARGS=0x7f698d167280, A__I=<optimized out>) at beam/erl_db.c:3440
#8 0x00000000004437b6 in process_main (x_reg_array=0x7f698d167280, f_reg_array=0x7f66c75d0960) at x86_64-unknown-linux-gnu/opt/smp/beam_hot.h:491
#9 0x000000000045a422 in sched_thread_func (vesdp=0x7f698d7b7f40) at beam/erl_process.c:8520
#10 0x00000000006a9a81 in thr_wrapper (vtwd=<optimized out>) at pthread/ethread.c:118
#11 0x00007f69dc4b9ea5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f69dbfda8dd in clone () from /lib64/libc.so.6



(gdb) etp-process-info p
 Pid: <0.354.0>
 State: active-sys | sig-in-q | running | active | prq-prio-normal | usr-prio-normal | act-prio-normal

Flags: using-db
 Registered name: spg
 Current function: unknown
 I: #Cp<gen_server:loop/7+0x1e0>
 Heap size: 121536
 Old-heap size: 833026
 Mbuf size: 2362
 Msgq len: 1 (inner=0, outer=1)
 Msgq Flags: on-heap
 Parent: <0.333.0>
 Pointer: (Process*)0x7f69951c0e78

(gdb) etp-stacktrace p
%%% WARNING: The process is currently running, so c_p->stop will not be correct
%%% Consider using -emu variant instead
% Stacktrace (26)
I: #Cp<gen_server:loop/7+0x1e0>.
0: #Cp<spg:handle_info/2+0xbb0>.
5: #Cp<gen_server:try_dispatch/4+0x80>.
14: #Cp<gen_server:handle_msg/6+0x4b0>.
21: #Cp<proc_lib:init_p_do_apply/3+0x50>.
25: #Cp<terminate process normally>.

 

(gdb) etp-stackdump p
%%% WARNING: The process is currently running, so c_p->stop will not be correct
%%% Consider using -emu variant instead
% Stacktrace (26)
I: #Cp<gen_server:loop/7+0x1e0>.
0: #Cp<spg:handle_info/2+0xbb0>.
1: [].
2: <0.354.0>.
3: <remote/1.205.0>.
4: {state,spg,

Could it be that we are monitoring a pid who could have been disconnected from spg, and it is race between disconnect and monitor.

The crash is on the beam with pid <0.354.0> and the segfault thread was trying to connect to <remote/1.205.0>.

@OTP-Maintainer
Copy link
Author

sverker said:

We think we know the problem. There is a race between handling incoming data (the crashing thread) and another thread handling outgoing data on the same distribution channel causing the connection  to fail.

@OTP-Maintainer
Copy link
Author

sverker said:

Proposed fix: [https://github.com/erlang/otp/pull/2780]

@OTP-Maintainer OTP-Maintainer added bug Issue is reported as a bug team:VM Assigned to OTP team VM priority:medium labels Feb 10, 2021
@OTP-Maintainer OTP-Maintainer added this to the OTP-23.2 milestone Feb 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug priority:medium team:VM Assigned to OTP team VM
Projects
None yet
Development

No branches or pull requests

2 participants