Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERL-1064: SIGBUS occured at erts_emasculate_writable_binary #3920

Closed
OTP-Maintainer opened this issue Oct 10, 2019 · 15 comments
Closed

ERL-1064: SIGBUS occured at erts_emasculate_writable_binary #3920

OTP-Maintainer opened this issue Oct 10, 2019 · 15 comments
Assignees
Labels
bug Issue is reported as a bug priority:medium team:VM Assigned to OTP team VM
Milestone

Comments

@OTP-Maintainer
Copy link

Original reporter: shino
Affected version: OTP-22.1
Fixed in versions: OTP-22.1.7, OTP-21.3.8.11
Component: erts
Migrated from: https://bugs.erlang.org/browse/ERL-1064


SIGBUS occurred at erts_emasculate_writable_binary.
Because the application is a commercial software, sharing full core dump is not possible.
I have core dump locally, any additional information except commercial part can be shared.
Also, the error is reproducible and can be debugged (by me) with some patches if needed.

Some notes on the commercial application under the case:
- Roughly, an UDP proxy/filter, intensively uses inet. Concretely, it's an WebRTC SFU.
- Binary operations are used in many parts, creations, binary matches, concat some values to binaries
- No native code (NIFs, drivers) other than OTP/BEAM
- SIGBUS errors are reproducible with some steps, within 1 minute
- Details below are SIGBUS case on Linux, but errors happened on macOS too

Environment information:
- OTP version: OTP 22.1
- Configure flags when building
  - --enable-kernel-poll --enable-threads --enable-dynamic-ssl-lib --enable-dirty-schedulers  --enable-sharing-preserving --without-javac --without-odbc --disable-hipe --disable-native-libs --disable-sctp
- Application versions if different than the main distribution
  - all apps are default
- Any source patches that may be relevant
  - no patches, just cloned from github
- Operating system/distribution
  - x86_64 GNU/Linux, ArchLinux

BEAM related information:
- Any NIF modules (including OTP-supplied modules) that are loaded when the error occurs
  - only OTP-supplied ones: crypto
  - not sure inside BEAM nifs, but prim_file is used (not intensively)
- Any dynamically loaded drivers (including OTP-supplied) that are in use when the error occurs
  - inet driver (not sure they are dynamically loaded)
- Any parameters used when starting the VM (e.g. -smpdisable +A 1024)
  - +Q 65536 +P 200000 +e 10000 +spp true +Mulmbcs 32767 +Mumbcgs 1 +Musmbcs 2047
- Hardware configuration (e.g. 4 core Intel i7 with 4 GB of memory)
  - CPU: Intel Core i9 9980HK 2.4GHz with 32GB RAM
- If there is a core dump, debug the core and do a backtrace for each thread:
  - will attach

Some narrowing cause effort:
- By removing --enable-sharing-preserving, the error disappears, works fine completely
- Using OTP 22.0.7, no error happens, regardless --enable-sharing-preserving or not


Core dump examination by gdb from here.

* Message of gdb just after core is loaded:

Core was generated by `/home/shino/g/sora/_build/dev/rel/sora/bin/sora -Q 65536 -P 200000 -e 10000 -sp'.
Program terminated with signal SIGBUS, Bus error.
#0  erts_emasculate_writable_binary (pb=pb@entry=0x7f7648a744b0) at beam/erl_bits.c:1647
1647        if (unused >= 8) {
[Current thread is 1 (Thread 0x7f764ab73700 (LWP 620397))]


* The thread backtrace which triggered SIGBUS:

(gdb) bt
#0  erts_emasculate_writable_binary (pb=pb@entry=0x7f7648a744b0) at beam/erl_bits.c:1647
#1  0x0000557b3ef07ca1 in copy_shared_perform_x (obj=<optimized out>, obj@entry=140146001791186, size=size@entry=115, info=info@entry=0x7f764ab72a10, hpp=hpp@entry=0x7f764ab72a08, off_heap=0x7f764f240d18)
    at beam/copy.c:1665
#2  0x0000557b3ef3b6ba in erts_send_message (sender=sender@entry=0x7f7648411838, receiver=receiver@entry=0x7f764f240b10, receiver_locks=receiver_locks@entry=0x7f764ab72c44, message=message@entry=140146001791186)
    at beam/erl_message.c:760
#3  0x0000557b3ef166fe in do_send (p=p@entry=0x7f7648411838, to=to@entry=11098195498035, msg=msg@entry=140146001791186, return_term=return_term@entry=140146001791186, refp=refp@entry=0x7f764ab72ca8,
    dist_ctx=dist_ctx@entry=0x7f764ab72cb0, connect=1, suspend=1) at beam/bif.c:2094
#4  0x0000557b3ef1b027 in erl_send (p=0x7f7648411838, to=11098195498035, msg=140146001791186) at beam/bif.c:2263
#5  0x0000557b3ee6b33f in process_main (x_reg_array=0x7f7648a744b0, f_reg_array=0x7f7648a477da) at x86_64-unknown-linux-gnu/opt/smp/beam_hot.h:5297
#6  0x0000557b3ee71447 in sched_thread_func (vesdp=0x7f764def1f80) at beam/erl_process.c:8465
#7  0x0000557b3f0a4f8a in thr_wrapper (vtwd=0x7ffde6b48860) at pthread/ethread.c:118
#8  0x00007f76927a74cf in start_thread () from /usr/lib/libpthread.so.0
#9  0x00007f76926c92d3 in clone () from /usr/lib/libc.so.6


* Disassemble erts_emasculate_writable_binary:

(gdb) set disassembly-flavor intel
(gdb) disas erts_emasculate_writable_binary
Dump of assembler code for function erts_emasculate_writable_binary:
   0x0000557b3efcb7e0 <+0>:     push   r12
   0x0000557b3efcb7e2 <+2>:     push   rbp
   0x0000557b3efcb7e3 <+3>:     push   rbx
   0x0000557b3efcb7e4 <+4>:     mov    rbp,QWORD PTR [rdi+0x18]
   0x0000557b3efcb7e8 <+8>:     mov    r12,QWORD PTR [rdi+0x8]
   0x0000557b3efcb7ec <+12>:    mov    QWORD PTR [rdi+0x28],0x0
=> 0x0000557b3efcb7f4 <+20>:    mov    rax,QWORD PTR [rbp+0x10]
   0x0000557b3efcb7f8 <+24>:    sub    rax,r12
   0x0000557b3efcb7fb <+27>:    cmp    rax,0x7
   0x0000557b3efcb7ff <+31>:    ja     0x557b3efcb810 <erts_emasculate_writable_binary+48>
   0x0000557b3efcb801 <+33>:    pop    rbx
   0x0000557b3efcb802 <+34>:    pop    rbp
   0x0000557b3efcb803 <+35>:    pop    r12
   0x0000557b3efcb805 <+37>:    ret
   0x0000557b3efcb806 <+38>:    nop    WORD PTR cs:[rax+rax*1+0x0]
   0x0000557b3efcb810 <+48>:    mov    rbx,rdi
   0x0000557b3efcb813 <+51>:    test   BYTE PTR [rbp+0x0],0x2
   0x0000557b3efcb817 <+55>:    jne    0x557b3efcb860 <erts_emasculate_writable_binary+128>
   0x0000557b3efcb819 <+57>:    movabs rax,0x1fffffffffffffff
   0x0000557b3efcb823 <+67>:    cmp    r12,rax
   0x0000557b3efcb826 <+70>:    ja     0x557b3efcb8a0 <erts_emasculate_writable_binary+192>
   0x0000557b3efcb828 <+72>:    mov    edi,0x632a
   0x0000557b3efcb82d <+77>:    mov    rsi,QWORD PTR [rip+0x1bc544]        # 0x557b3f187d78 <erts_allctrs+344>
   0x0000557b3efcb834 <+84>:    lea    rcx,[r12+0x1f]
   0x0000557b3efcb839 <+89>:    mov    rdx,rbp
   0x0000557b3efcb83c <+92>:    call   QWORD PTR [rip+0x1bc526]        # 0x557b3f187d68 <erts_allctrs+328>
   0x0000557b3efcb842 <+98>:    test   rax,rax
   0x0000557b3efcb845 <+101>:   je     0x557b3efcb880 <erts_emasculate_writable_binary+160>
   0x0000557b3efcb847 <+103>:   mov    QWORD PTR [rax+0x10],r12
   0x0000557b3efcb84b <+107>:   mov    QWORD PTR [rbx+0x18],rax
   0x0000557b3efcb84f <+111>:   add    rax,0x18
   0x0000557b3efcb853 <+115>:   mov    QWORD PTR [rbx+0x20],rax
   0x0000557b3efcb857 <+119>:   pop    rbx
   0x0000557b3efcb858 <+120>:   pop    rbp
   0x0000557b3efcb859 <+121>:   pop    r12
   0x0000557b3efcb85b <+123>:   ret
   0x0000557b3efcb85c <+124>:   nop    DWORD PTR [rax+0x0]
   0x0000557b3efcb860 <+128>:   movabs rax,0x1fffffffffffffff
   0x0000557b3efcb86a <+138>:   cmp    r12,rax
   0x0000557b3efcb86d <+141>:   ja     0x557b3efcb886 <erts_emasculate_writable_binary+166>
   0x0000557b3efcb86f <+143>:   mov    edi,0x62aa
   0x0000557b3efcb874 <+148>:   jmp    0x557b3efcb82d <erts_emasculate_writable_binary+77>
   0x0000557b3efcb876 <+150>:   nop    WORD PTR cs:[rax+rax*1+0x0]
   0x0000557b3efcb880 <+160>:   test   BYTE PTR [rbp+0x0],0x2
   0x0000557b3efcb884 <+164>:   je     0x557b3efcb8a0 <erts_emasculate_writable_binary+192>
   0x0000557b3efcb886 <+166>:   mov    rdx,r12
   0x0000557b3efcb889 <+169>:   mov    rsi,rbp
   0x0000557b3efcb88c <+172>:   mov    edi,0x62aa
   0x0000557b3efcb891 <+177>:   call   0x557b3eeb22a0 <erts_realloc_enomem>
   0x0000557b3efcb896 <+182>:   nop    WORD PTR cs:[rax+rax*1+0x0]
   0x0000557b3efcb8a0 <+192>:   mov    rdx,r12
   0x0000557b3efcb8a3 <+195>:   mov    rsi,rbp
   0x0000557b3efcb8a6 <+198>:   mov    edi,0x632a
   0x0000557b3efcb8ab <+203>:   call   0x557b3eeb22a0 <erts_realloc_enomem>
End of assembler dump.


* Corresponding source code of the "=>" mark in the above disassemble

binp->orig_size at the line
https://github.com/erlang/otp/blob/OTP-21.1/erts/emulator/beam/erl_bits.c#L1603
(disclaimer: I'm not good C programmer, this information may be wrong...)


* Erlang object to be copied

Some erlang terms are shown as BoxedError.
Line breaks and spaces are inserted by band for readability.

(gdb) frame 3
#3  0x0000557b3ef166fe in do_send (p=p@entry=0x7f7648411838, to=to@entry=11098195498035,
    msg=msg@entry=140146001791186, return_term=return_term@entry=140146001791186, refp=refp@entry=0x7f764ab72ca8,
    dist_ctx=dist_ctx@entry=0x7f764ab72cb0, connect=1, suspend=1) at beam/bif.c:2094
2094            erts_send_message(p, rp, &rp_locks, msg);
(gdb) etp msg
{multistream_rtp,{rtp,2,0,1,0,1,120,403,571602579,2109553208,[],
 #HeapBinary<0x18,0x601f9220500debe,0xf000f000200018c,0x1200>,
 {48862,[{rtp_ext_abs_send_time,16318726},{rtp_ext_video_timing,#BoxedError<0x7f7648a73b92>,
          2,15,15,18,0,0}]},
 #HeapBinary<0x34,0x4f18054c4a2e0b0,0x442c2c0c241c1900,0x19890000a84124cc,0x3ef4ab0b783c04,
         0x707480dcc037bb09,0x60d6db535fadacc5,0x7f760084c397>},
 #{Keys:{connection_id,rtp_metadata}
 Values:{
   #HeapBinary<0x24,0x3065316164623530,0x65342d333261382d,0x2d626635622d6238,0x6666363334613863,0x7f7631663232>
   {rtp_metadata,false,1570602800542723,vp8,false,undefined,{rtp_vp8,1,1,1,0,1,1,1,0,8900,84,2,0,
    undefined,7,1 ,0,...},
 #BoxedError<0x7f7648a744b2>}}}}.
@OTP-Maintainer
Copy link
Author

sverker said:

You say OTP-21.1 but function copy_shared_perform_x indicates this is OTP-22.1. Have you reproduced it on both?


@OTP-Maintainer
Copy link
Author

shino said:

Sorry. There were two typos.

Corrections:
- OTP-21.1 was wrong. SIGBUS happens at OTP-22.1 with sharing preserving
-  "Using OTP 20.0.7, no error happens, regardless --enable-sharing-preserving or not" was wrong.
  Correction is "Using OTP 22.0.7, no error happens, regardless --enable-sharing-preserving or not"


@OTP-Maintainer
Copy link
Author

sverker said:

Thank you. That makes this bug report much more interesting to look at. :)

@OTP-Maintainer
Copy link
Author

sverker said:

Can you run the following commands in gdb on core dump:

{noformat}
(gdb) frame 0
(gdb) info registers
(gdb) print *pb
{noformat}

@OTP-Maintainer
Copy link
Author

shino said:

Thanks for your investigation.

{noformat}
(gdb) frame 0
#0  erts_emasculate_writable_binary (pb=pb@entry=0x7f7648a744b0) at beam/erl_bits.c:1647
1647        if (unused >= 8) {
(gdb) info registers
rax            0x7f7648a477d8      140146001606616
rbx            0x7f7648a744b0      140146001790128
rcx            0x2b                43
rdx            0x3                 3
rsi            0x7f7648a477da      140146001606618
rdi            0x7f7648a744b0      140146001790128
rbp            0xa84124cc442c      0xa84124cc442c
rsp            0x7f764ab728b0      0x7f764ab728b0
r8             0x0                 0
r9             0x0                 0
r10            0x1                 1
r11            0x0                 0
r12            0x2e                46
r13            0x7f764ab72a88      140146036386440
r14            0x2b                43
r15            0x7f764ab72a10      140146036386320
rip            0x557b3efcb7f4      0x557b3efcb7f4 <erts_emasculate_writable_binary+20>
eflags         0x10282             [ SF IF RF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0
(gdb) print *pb
$1 = {thing_word = 3, size = 46, next = 0x2c0c241c190004f1, val = 0xa84124cc442c, bytes = 0xf4ab0b783c041989 <error: Cannot access memory at address 0xf4ab0b783c041989>, flags = 0}
{noformat}

@OTP-Maintainer
Copy link
Author

shino said:

Tried to print ProcBin members, but all failed. (I'm not sure this information is relevant)


{noformat}
(gdb) print *pb->next
Cannot access memory at address 0x2c0c241c190004f1
(gdb) print *pb->val
Cannot access memory at address 0xa84124cc442c
(gdb) print *pb->bytes
Cannot access memory at address 0xf4ab0b783c041989
{noformat}


@OTP-Maintainer
Copy link
Author

sverker said:

The ProcBin structure looks corrupted, that why you can't follow the pointers.

Doing human remote controlled debugging like this is not the most effective way, but I have one more idea.
Could you do
{noformat}
(gdb) frame 2
(gdb) print *sender
{noformat}


Other ways forward:
* Use {{git bisect}} between OTP-22.0.7 and OTP-22.1 to see which commit that introduced the problem.
* Try reproduce with debug compiled Erlang VM in hope of getting a nicer earlier crash. Build debug VM with (export ERL_TOP=<OTP source dir> && cd $ERL_TOP/erts/emulator && make debug) and run with bin/cerl -debug in source tree.
* Try reproduce with https://rr-project.org. Once reproduced, you can replay until crash, set memory watch point on corrupted data and then execute backwards to find the code that did the corruption. The Erlang VM can be run under rr with bin/cerl -rr in a built OTP source tree. Combining -debug and -rr gives the best debugging experience.

@OTP-Maintainer
Copy link
Author

shino said:

Totally agree for inefficient debugging, especially with large time zone difference :(
Thanks a lot for your patience and suggestions!
I will try the ways and update this thread.

Information about sender object:

{noformat}
(gdb) frame 2
#2  0x0000557b3ef3b6ba in erts_send_message (sender=sender@entry=0x7f7648411838,
 receiver=receiver@entry=0x7f764f240b10,
 receiver_locks=receiver_locks@entry=0x7f764ab72c44,
 message=message@entry=140146001791186) at beam/erl_message.c:760
760                     message = copy_shared_perform(message, msize, &info, &hp, ohp);
(gdb) print *sender
$3 = {common = {id = 10582799422275, refc = {atmc = {counter = 2}, sint = 2}, tracer = 18446744073709551611,
 trace_flags = 0, timer = {counter = 0}, u = {alive = {started_interval = 266, reg = 0x0,
 links = 0x7f76484122c8, lt_monitors = 0x7f7648411e28, monitors = 0x0},
 release = {later = 266, func = 0x0, data = 0x7f76484122c8, next = 0x7f7648411e28}}},
 htop = 0x7f7648a748f0, stop = 0x7f7648a784e0, heap = 0x7f7648a73558, hend = 0x7f7648a78628,
 abandoned_heap = 0x0, heap_sz = 2586, min_heap_size = 233, min_vheap_size = 46422, max_heap_size = 3,
 arity = 0, arg_reg = 0x7f76484118e8, max_arg_reg = 6,
 def_arg_reg = {140146143985504, 386635, 140146007670809, 0, 0, 4000}, cp = 0x0, i = 0x7f764bbd89a8,
 catches = 2, fcalls = 3849, rcount = 0, schedule_count = 0, reds = 2122712, group_leader = 7730941136403,
 flags = 134225920, fvalue = 18446744073709551611, freason = 0, ftrace = 18446744073709551611, next = 0x0,
 sig_qs = {first = 0x0, last = 0x7f7648411978, save = 0x7f7648411978, cont = 0x0, cont_last = 0x7f7648411990,
 nmsigs = {next = 0x0, last = 0x0}, saved_last = 0x0, len = 0},
 bif_timers = 0x7f7651532d30, dictionary = 0x7f75900002c8, seq_trace_clock = 0, seq_trace_lastcnt = 0,
 seq_trace_token = 18446744073709551611,
 u = {terminate = 0x2730b, initial = {module = 160523, function = 386635, arity = 5}},
 current = 0x7f764bbd8990, parent = 8074538520243, static_flags = 0, high_water = 0x7f7648a73ae8,
 old_hend = 0x7f76491ea710, old_htop = 0x7f76491e93c0, old_heap = 0x7f76491e2448, gen_gcs = 122,
 max_gen_gcs = 65535, off_heap = {first = 0x7f7648a74890, overhead = 31}, mbuf = 0x7f7648a0c778,
 live_hf_end = 0xfffffffffffffff8, msg_frag = 0x0, mbuf_sz = 13, psd = {counter = 0}, bin_vheap_sz = 46422,
 bin_old_vheap_sz = 46422, bin_old_vheap = 4656, sys_task_qs = 0x0, dirty_sys_tasks = 0x0,
 state = {counter = 41002}, dirty_state = {counter = 0},
 sig_inq = {first = 0x0, last = 0x7f7648411aa8, len = 0, nmsigs = {next = 0x0, last = 0x0}},
 trace_msg_q = 0x0, lock = {flags = {counter = 1}, queue = {0x0, 0x0, 0x0, 0x0, 0x0}},
 scheduler_data = 0x7f764def1f80, run_queue = {counter = 140146090201664}}
{noformat}

@OTP-Maintainer
Copy link
Author

sverker said:

Here's my conclusions so far.

We got what looks like a corrupted ProcBin, which is an on-heap structure of a binary term pointing to an off-heap shared binary payload structure Binary.

This ProcBin lies between sender->high_water and sender->htop which means it was created after the last garbage collection of the sender process.

The size of the ProcBin (46 bytes) could be correct but the three pointers 'next', 'val' and 'bytes' look totally corrupted. They don't look like pointers, terms, ascii, utf8 or anything else I can recognize. It could maybe be payload data of an on-heap binary (ErlHeapBin), but how it would have got inside a ProcBin I don't know.

@OTP-Maintainer
Copy link
Author

shino said:

Some bisect work executed.

1. bisect along with first parent (--first-parent)

I'm not sure how much intermediate commits within a topic branch are qualified,
tried to apply bisect steps for first parent merge commits.
The first merge commit that leads to crash:

{noformat}
commit 65874ca12f22c0135399c9161b3091a39e40e20f
Merge: d203e84754 9603aed003
Author: John Högberg <john@erlang.org>
Date:   Mon Aug 12 15:18:56 2019 +0200

    Merge branch 'john/erts/bs_get_binary2-heap-binaries/OTP-15977' into maint
{noformat}

https://github.com/erlang/otp/commit/65874ca12f22c0135399c9161b3091a39e40e20f

2. bisect in the branch

The first commit that leads to crash:

{noformat}
commit 9f73f98ad6ddf70d30dfc0a1daf7c992dcc5061a
Author: John Högberg <john@erlang.org>
Date:   Thu Aug 8 10:06:25 2019 +0200

    erts: Create heap binaries in bs_get_binary2

    ErlSubBin is a large struct that often dwarfs the region of memory
    it points at, and it's common for them to refer to a ProcBin which
    must be kept around as long as the SubBin lives, using up even more
    heap space and keeping the referenced binary alive regardless of
    how small the sub-binary is.
{noformat}

https://github.com/erlang/otp/commit/9f73f98ad6ddf70d30dfc0a1daf7c992dcc5061a

Hope this add some infomation :fingerscrossed:

@OTP-Maintainer
Copy link
Author

sverker said:

Thank you. That help me find the bug.

But the bug was actually not in the bisected commit. That commit just causes more binaries to be created on-heap, which in turn triggered an already existing bug in copy_shared_perform_x. I managed to reproduce on both 22.1 and OTP-21.3.8.10 with this code and {{--enable-sharing-preserving}}:

{code:java}
-module(t).
-export([go/0]).

go() ->
    To = spawn_link(fun() -> receive _ -> ok end end),
    HeapBin = list_to_binary(lists:seq(1,63)),
    Len = bit_size(HeapBin) - 1,
    <<SubBin:Len/bits, _/bits>> = HeapBin,
    erlang:display(SubBin),
    To ! {HeapBin, SubBin, HeapBin},
    ok.
{code}

I will work on a fix...

@OTP-Maintainer
Copy link
Author

shino said:

Great work!

> But the bug was actually not in the bisected commit.

Makes sense. The commit does not seem harmful at all.

> I will work on a fix...

:smile:

@OTP-Maintainer
Copy link
Author

sverker said:

Proposed fix:
https://github.com/erlang/otp/pull/2446

@OTP-Maintainer
Copy link
Author

shino said:

I confirmed our application continues working with the branch.

Thank you very much!!


@OTP-Maintainer
Copy link
Author

sverker said:

Fix released in OTP-22.1.7 and OTP-21.3.8.11.

 

@OTP-Maintainer OTP-Maintainer added bug Issue is reported as a bug team:VM Assigned to OTP team VM priority:medium labels Feb 10, 2021
@OTP-Maintainer OTP-Maintainer added this to the OTP-22.1.7 milestone Feb 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug priority:medium team:VM Assigned to OTP team VM
Projects
None yet
Development

No branches or pull requests

2 participants