Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERL-1044: Segmentation fault occur regularly in test #4166

Closed
OTP-Maintainer opened this issue Sep 13, 2019 · 20 comments
Closed

ERL-1044: Segmentation fault occur regularly in test #4166

OTP-Maintainer opened this issue Sep 13, 2019 · 20 comments
Assignees
Labels
bug Issue is reported as a bug priority:medium team:VM Assigned to OTP team VM
Milestone

Comments

@OTP-Maintainer
Copy link

Original reporter: siri
Affected versions: OTP-22.1.3, OTP-21.1.1
Fixed in versions: OTP-22.1.7, OTP-21.3.8.11
Component: Not Specified
Migrated from: https://bugs.erlang.org/browse/ERL-1044


Dear friends!

We got a segfault and a core dump from OTP-21.1.1 running on Red Hat RHEL 6.  We wonder if it's a known bug, and if it possibly has been corrected in a later release?

The problem occurs regularly in our test environment, and we have several core files, size 200-800 MB - thus to big to attach here. Below are the backtraces from two of them. Please let me know how to proceed or who to send the file(s) to. 

Thanks!

/siri

{noformat}
Core was generated by `/usr/lib64/erlang/erts-10.1.1/bin/beam.smp -K true -W w -Bc -- -root /usr/lib64'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000601412 in rbt_delete ()
[Current thread is 1 (process 3813)]
(gdb) bt
#0  0x0000000000601412 in rbt_delete ()
#1  0x0000000000601ee2 in aoff_get_free_block ()
#2  0x00000000004d8632 in mbc_alloc ()
#3  0x00000000004dc9f8 in erts_alcu_alloc_thr_pref ()
#4  0x00000000005cf23e in erts_monitor_create ()
#5  0x0000000000579902 in erts_net_message ()
#6  0x0000000000538919 in driver_output2 ()
#7  0x0000000000649764 in tcp_reply_data ()
#8  0x0000000000649c2a in tcp_deliver ()
#9  0x000000000064a1a4 in tcp_recv ()
#10 0x000000000064aeaf in tcp_inet_input ()
#11 0x000000000055709b in erts_port_task_execute ()
#12 0x00000000004a530f in erts_schedule ()
#13 0x0000000000489891 in process_main ()
#14 0x0000000000497f81 in sched_thread_func ()
#15 0x00000000008121fa in thr_wrapper ()
#16 0x00007fd2f1a35aa1 in ?? ()
#17 0x00007fd2a8a7d700 in ?? ()
#18 0x0000000000000000 in ?? ()
(gdb)
{noformat}







{noformat}
Core was generated by `/usr/lib64/erlang/erts-10.1.1/bin/beam.smp -K true -W w -Bc -- -root /usr/lib64'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000601481 in rbt_delete ()
[Current thread is 1 (process 1124)]
(gdb) bt
#0  0x0000000000601481 in rbt_delete ()
#1  0x0000000000601ee2 in aoff_get_free_block ()
#2  0x00000000004d8632 in mbc_alloc ()
#3  0x00000000004dc9f8 in erts_alcu_alloc_thr_pref ()
#4  0x00000000005e5939 in erts_erts_pcre_malloc ()
#5  0x0000000000809244 in erts_pcre_exec ()
#6  0x00000000005e6fb3 in re_run ()
#7  0x000000000048a547 in process_main ()
#8  0x0000000000497f81 in sched_thread_func ()
#9  0x00000000008121fa in thr_wrapper ()
#10 0x00007f624d18caa1 in ?? ()
#11 0x00007f61ffefd700 in ?? ()
#12 0x0000000000000000 in ?? ()
(gdb)
{noformat}





@OTP-Maintainer
Copy link
Author

lukas said:

Hello, both crashes are faults in the memory allocators which means that somewhere someone mismanaged memory. I don't recognize any of these crashes.

Do you run any nifs and or linked-in driver? If you do, then most likely there is something wrong one of those.

@OTP-Maintainer
Copy link
Author

siri said:

I've looked through all deps (a few open source things like lager, cowboy, goldrush, ranch and a few internal) but I can't see anything but erlang code :( :)

@OTP-Maintainer
Copy link
Author

sverker said:

I think the next step on this matter is to try reproduce same kind of crash on latest OTP-21 (currently OTP-21.3.8.7).

That would be a better motivation to spend some serious post-mortem detective work on a core dump. :( :)


@OTP-Maintainer
Copy link
Author

siri said:

Hi Sverker/VM - I finally got the same up and running with 22.1.3. Sadly, we still see segfaults :( 

Examples:

{code:java}
Core was generated by `/usr/lib/erlang/erts-10.5.2/bin/beam.smp -K true -W w -Bc -- -root /usr/lib/erl'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000000005d4e18 in ?? ()
[Current thread is 1 (process 25080)]
(gdb) bt
#0  0x00000000005d4e18 in ?? ()
#1  0x00000000005d593f in ?? ()
#2  0x00000000004a4f02 in ?? ()
#3  0x00000000004a8428 in erts_alcu_alloc_thr_pref ()
#4  0x00000000005a1f6e in erts_monitor_create ()
#5  0x0000000000547afc in erts_net_message ()
#6  0x000000000050a292 in driver_output2 ()
#7  0x000000000062a00a in ?? ()
#8  0x000000000062cd0e in ?? ()
#9  0x000000000062d2a4 in ?? ()
#10 0x000000000062d65f in ?? ()
#11 0x0000000000528953 in erts_port_task_execute ()
#12 0x0000000000471bde in erts_schedule ()
#13 0x0000000000453b61 in process_main ()
#14 0x0000000000462944 in ?? ()
#15 0x00000000006879ea in ?? ()
#16 0x00007fde5ccf7aa1 in ?? ()
#17 0x00007fde0fdff700 in ?? ()
#18 0x0000000000000000 in ?? ()
(gdb)
{code}


{code:java}
Core was generated by `/usr/lib/erlang/erts-10.5.2/bin/beam.smp -K true -W w -Bc -- -root /usr/lib/erl'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000000005d3cf9 in ?? ()
[Current thread is 1 (process 25946)]
(gdb) bt
#0  0x00000000005d3cf9 in ?? ()
#1  0x00000000005d44a0 in ?? ()
#2  0x00000000005d4b4f in ?? ()
#3  0x00000000004a2136 in ?? ()
#4  0x00000000004a5317 in ?? ()
#5  0x00000000004a7cf1 in ?? ()
#6  0x00000000004a860c in erts_alcu_check_delayed_dealloc ()
#7  0x000000000048f90b in erts_alloc_scheduler_handle_delayed_dealloc ()
#8  0x000000000046b856 in ?? ()
#9  0x000000000046c745 in ?? ()
#10 0x0000000000471587 in erts_schedule ()
#11 0x0000000000453b61 in process_main ()
#12 0x0000000000462944 in ?? ()
#13 0x00000000006879ea in ?? ()
#14 0x00007f8e80dccaa1 in ?? ()
#15 0x00007f8e33dff700 in ?? ()
#16 0x0000000000000000 in ?? ()
(gdb)
{code}

@OTP-Maintainer
Copy link
Author

sverker said:

Ways forward:
* Share core dumps from these crashes with us. Either just attach here or more privately of they might contain sensitive information.
* Try reproduce with a debug built VM and hope for an earlier more informative crash (dump). Build with {{ERL_TOP=<otp-source-dir> && cd $ERL_TOP/erts/emulator && make debug}} and then you can do a quick and dirty install by copy $ERL_TOP/bin/*/beam.debug.smp over the installed beam.smp and copy erl_child_setup.debug as-is to same directory.

@OTP-Maintainer
Copy link
Author

siri said:

Hi, I sent a mail to John earlier today - now I sent it to you as well, Sverker.

Will try to build a debug VM and let you know how it goes.

@OTP-Maintainer
Copy link
Author

sverker said:

We need the beam.smp file as well, or point us to exactly which download you used.

@OTP-Maintainer
Copy link
Author

siri said:

It's 22.1.3, downloaded from erlang solutions. For centos6.

[^beam.smp]

@OTP-Maintainer
Copy link
Author

sverker said:

Do you have more core dumps from 22.1.3?

Erlang Solutions seems to strip their beam.smp files from debug info, which made things more difficult. But after some successful fumbling in the dark I managed to draw some conclusions.

@OTP-Maintainer
Copy link
Author

siri said:

Yes, I will send you a couple by mail. I also have a debug compiled beam.smp running now, and waiting for it to core. It has been running since yesterday evening but still nothing - which is a bit disturbing since the "normal" beam goes down at least once during the night. But I'll keep it running and hope for the "best"....

@OTP-Maintainer
Copy link
Author

siri said:

Attached [^beam.debug.smp] - used for core.3863 and core.15635 which are sent by mail to Sverker.

@OTP-Maintainer
Copy link
Author

siri said:

And here is a snippet from erlang.log.*

{noformat}
===== ALIVE Fri Oct 25 00:14:51 CEST 2019

===== Fri Oct 25 00:20:18 CEST 2019
erts_refc_read(): Bad refc found (refc=0 < 1)!
heart: Fri Oct 25 00:20:18 2019: Erlang has closed.^M
heart: Fri Oct 25 00:20:18 2019: Wait 5 seconds for Erlang to terminate nicely^M
heart: Fri Oct 25 00:20:23 2019: Erlang still alive, kill it^M

=====
===== LOGGING STARTED Fri Oct 25 00:20:25 CEST 2019
=====
heart_beat_kill_pid = 15635
Erlang/OTP 22 [erts-10.5.2] [source] [64-bit] [smp:2:2] [ds:2:2:10] [async-threads:1] [hipe] [type-assertions] [debug-compiled] [lock-c
hecking]

Eshell V10.5.2  (abort with ^G)
(cdb@service-cdb1.ci.nordnet.se)1> heart: Fri Oct 25 00:20:27 2019: Executed "/opt/service-cdb/bin/ctl start" -> 0. Terminating.^M

===== ALIVE Fri Oct 25 00:35:27 CEST 2019

===== ALIVE Fri Oct 25 00:50:27 CEST 2019
Performing operations on uninitialized lock!
Currently no locks are locked by the scheduler 1 thread.
heart: Fri Oct 25 00:53:16 2019: Erlang has closed.
heart: Fri Oct 25 00:53:16 2019: Wait 5 seconds for Erlang to terminate nicely
heart: Fri Oct 25 00:53:21 2019: Erlang still alive, kill it

=====
===== LOGGING STARTED Fri Oct 25 00:53:23 CEST 2019
=====
heart_beat_kill_pid = 16540
Erlang/OTP 22 [erts-10.5.2] [source] [64-bit] [smp:2:2] [ds:2:2:10] [async-threads:1] [hipe] [type-assertions] [debug-compiled] [lock-c
hecking]

Eshell V10.5.2  (abort with ^G)
(cdb@service-cdb1.ci.nordnet.se)1> heart: Fri Oct 25 00:53:26 2019: Executed "/opt/service-cdb/bin/ctl start" -> 0. Terminating.
{noformat}


@OTP-Maintainer
Copy link
Author

sverker said:

I think we found it. Proposed fix:
https://github.com/erlang/otp/pull/2432

Would be nice if you could give it a spin.

@OTP-Maintainer
Copy link
Author

siri said:

Great! I'll do a rebuild. If I add the diff on top of 22.1.3, can I just replace the beam.smp again - as I did with the debug build?

@OTP-Maintainer
Copy link
Author

sverker said:

Yes, that will work.

@OTP-Maintainer
Copy link
Author

siri said:

Hi,

the fix has been running over the weekend and it's the only one of four nodes that did not have at least one core file this morning. I will leave it running for a couple of more days before finally concluding, but it definitely seems promising. :)

@OTP-Maintainer
Copy link
Author

siri said:

Still running nicely - no cores since mid day Friday  :) I've now deployed this to all our 8 test machines and will leave it there for a while.

Could you please say something about what would trigger the fault that you've fixed? Just curious, since we have a few other systems still running on 21.1.3  where we haven't seen the crash but wondering if it is a potential problem.

@OTP-Maintainer
Copy link
Author

sverker said:

The necessary scenario to trigger the fault is
 # a connection toward a remote node has to be established  (or at least attempted)
 # then disconnected
 # and then a connection attempt made again to the same node name

and the reconnect attempt has to happen at the same time as some process, holding the the last reference to distribution entry of the remote node, is either garbage collected or terminated.

I can't point to any specific reason why there should be a lower risk of happening on 21.1.3, but there are a lot of changes between 21 and 22 and the race for this to happen is quite narrow.

 

@OTP-Maintainer
Copy link
Author

sverker said:

Fix released in 22.1.7.

 

@OTP-Maintainer
Copy link
Author

sverker said:

and also fixed in 21.3.8.11.

@OTP-Maintainer OTP-Maintainer added bug Issue is reported as a bug team:VM Assigned to OTP team VM priority:medium labels Feb 10, 2021
@OTP-Maintainer OTP-Maintainer added this to the OTP-22.1.7 milestone Feb 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug priority:medium team:VM Assigned to OTP team VM
Projects
None yet
Development

No branches or pull requests

2 participants