Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BEAM crashes with segmentation fault #7683

Closed
dvic opened this issue Sep 26, 2023 · 17 comments · Fixed by #7712
Closed

BEAM crashes with segmentation fault #7683

dvic opened this issue Sep 26, 2023 · 17 comments · Fixed by #7712
Assignees
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM
Milestone

Comments

@dvic
Copy link

dvic commented Sep 26, 2023

Describe the bug
We're getting randomly segmentation faults on our project, mostly when running an import script.

To Reproduce
I can share access to our private repo where it can be reproduced (with some luck :).

Here is the crash report:
beam.smp-2023-09-26-134305_ips.txt (text formatted)

beam.smp-2023-09-26-134305_ips_json.txt

More crash reports:
beam.smp-2023-09-27-222920.txt

Here is the crash report with debug enabled:
beam.debug.smp-2023-09-29-112754.txt

beam.debug.smp-2023-09-29-113729.txt

OTP 24.3.4.13 crash reports:

Expected behavior
No segmentation faults.

Affected versions
Debian Bookworm (docker), Mac OS, tried Elixir 1.15.6 with OTP 24, 25, 26

Additional context
The app makes heavy use of ETS tables and we're suspecting it has to do something with that.

Is there anything else we can do to debug where this is coming from?

@dvic dvic added the bug Issue is reported as a bug label Sep 26, 2023
@dvic
Copy link
Author

dvic commented Sep 26, 2023

I've narrowed it down to a :etz:select_replace call, if I comment that one out, the import script runs fine without any segfaults 🤨

@IngelaAndin IngelaAndin added the team:VM Assigned to OTP team VM label Sep 27, 2023
@dvic
Copy link
Author

dvic commented Sep 27, 2023

Update: we’ve narrowed it down to the usage of the following library: https://github.com/evadne/gen_magic

Replacing gen_magic with a rustler project calling libmagic directly resolves the segfaults.

Not sure if this should remain open? As far as I know, gen_magic uses ports and contains no NIF stuff, so it still might be a bug in OTP.

Scratch that, just encountered a segfault again and random values appearing in our ets tables :(

@mikpe
Copy link
Contributor

mikpe commented Sep 27, 2023

It would help if you could provide a reproducer in Erlang. Looks like it's happening in copy_struct_x(). I don't see any out-of-tree NIFs. Since OTP-24 w/ ARM64 is affected I think we can discount the JIT.

@dvic
Copy link
Author

dvic commented Sep 27, 2023

Yeah, I'm trying to reproduce it but I'm having hard times in Elixir let alone Erlang :( I now have a script that reproduces it fairly reliably on OTP26 but not OTP24 and OTP25. We'll keep trying, and keep posting updates here.

Everytime it crashes though it crashes with at copy_struct_x (copy.c:955), that's the only constant...

@janwillemvd
Copy link

That’s a better question for the Erlang/OTP team. :)

Anyone with some advice or tips for debugging segfaults in erlang? The copy_struct_x still looks like the only constant factor out here.

@dvic
Copy link
Author

dvic commented Sep 28, 2023

Update, I now encountered an example where the crash reason was different:

Thread 7 Crashed:: erts_sched_4
0   ???                           	       0x141f50da4 ???
1   ???                           	       0x141eed98c ???

(not sure if this helps)

I'll attach a full log in the description.

@dvic
Copy link
Author

dvic commented Sep 28, 2023

I've updated also the description: the crash also happens in production (debian bookworm running in docker).

@garazdawi
Copy link
Contributor

Anyone with some advice or tips for debugging segfaults in erlang?

Check the Types and Flavors section in the development guide.

If you can reproduce the error in the debug emulator, that would help a lot as the resulting core file will contain a lot more information.

@janwillemvd
Copy link

Thanks @garazdawi!

@dvic
Copy link
Author

dvic commented Sep 29, 2023

I've managed to reproduce it with a debug enabled beam.

I used export KERL_RELEASE_TARGET="debug asan", however, I can't reproduce it with cerl -asan, only with cerl -debug. Before it exits, I see varying assertions failing, like

  • beam/erl_gc.c:4062:checked_header_arity() Assertion failed: TYPE ASSERTION: is_header(x)
  • beam/utils.c:1450:eq() Assertion failed: !"Unknown boxed subtab in EQ"

I've attached in the description new dumps, but here are the two example thread crashes in question:

Thread 4 Crashed:: erts_sched_1
0   libsystem_kernel.dylib        	       0x187cec764 __pthread_kill + 8
1   libsystem_pthread.dylib       	       0x187d23c28 pthread_kill + 288
2   libsystem_c.dylib             	       0x187c31ae8 abort + 180
3   beam.debug.smp                	       0x100b54650 erl_assert_error + 128 (sys.c:959)
4   beam.debug.smp                	       0x100a9b164 checked_header_arity + 76 (erl_term.c:119)
5   beam.debug.smp                	       0x100a81d18 check_no_empty_boxed_non_literal_term + 96 (erl_gc.c:4062)
6   beam.debug.smp                	       0x100a81a74 check_all_heap_terms_in_range + 204 (erl_gc.c:3964)
7   beam.debug.smp                	       0x100a81874 erts_dbg_check_heap_terms + 412 (erl_gc.c:4030)
8   beam.debug.smp                	       0x100a81cac erts_dbg_check_no_empty_boxed_non_literal_on_heap + 40 (erl_gc.c:4086)
9   beam.debug.smp                	       0x100a7ae74 garbage_collect + 2160 (erl_gc.c:737)
10  beam.debug.smp                	       0x100a7bb0c erts_garbage_collect_nobump + 140 (erl_gc.c:902)
11  ???                           	       0x1018b55cc ???
12  ???                           	       0x101caf000 ???
13  ???                           	       0x101cb2ea8 ???
Thread 5 Crashed:: erts_sched_2
0   libsystem_kernel.dylib        	       0x187cec764 __pthread_kill + 8
1   libsystem_pthread.dylib       	       0x187d23c28 pthread_kill + 288
2   libsystem_c.dylib             	       0x187c31ae8 abort + 180
3   beam.debug.smp                	       0x104db0650 erl_assert_error + 128 (sys.c:959)
4   beam.debug.smp                	       0x104ba14c4 eq + 5976 (utils.c:1450)
5   ???                           	       0x105efa5f0 ???
6   ???                           	       0x106b66fa8 ???

Let me know if there's anything else that I can do.

@dvic
Copy link
Author

dvic commented Sep 29, 2023

Update: I can confirm again that the crashes occur on OTP 24 and OTP 25. However, on OTP 24 the reasons are different:

Thread 4 Crashed:: 1_scheduler
0   libsystem_kernel.dylib        	       0x187cec764 __pthread_kill + 8
1   libsystem_pthread.dylib       	       0x187d23c28 pthread_kill + 288
2   libsystem_c.dylib             	       0x187c31ae8 abort + 180
3   beam.debug.smp                	       0x10445c170 erl_assert_error + 128 (sys.c:955)
4   beam.debug.smp                	       0x10439b1a8 do_minor + 1408 (erl_gc.c:1607)
5   beam.debug.smp                	       0x104399730 minor_collection + 876 (erl_gc.c:1417)
6   beam.debug.smp                	       0x10439247c garbage_collect + 2476 (erl_gc.c:748)
7   beam.debug.smp                	       0x104392fe4 erts_garbage_collect_nobump + 140 (erl_gc.c:890)
8   beam.debug.smp                	       0x10412a21c scheduler_gc_proc + 140 (erl_process.c:9405)
9   beam.debug.smp                	       0x104121594 erts_schedule + 9696 (erl_process.c:10161)
10  beam.debug.smp                	       0x104141110 process_main + 972 (beam_emu.c:356)
11  beam.debug.smp                	       0x10411c484 sched_thread_func + 684 (erl_process.c:8656)
12  beam.debug.smp                	       0x10453cf6c thr_wrapper + 296 (ethread.c:122)
13  libsystem_pthread.dylib       	       0x187d23fa8 _pthread_start + 148
14  libsystem_pthread.dylib       	       0x187d1eda0 thread_start + 8
  • segfault

beam.debug.smp-2023-09-29-122001.txt

Thread 13 Crashed:: 10_scheduler
0   beam.debug.smp                	       0x1026752dc move_boxed + 604 (erl_gc.h:91)
1   beam.debug.smp                	       0x1027c7268 do_minor + 1600 (erl_gc.c:1612)
2   beam.debug.smp                	       0x1027c5730 minor_collection + 876 (erl_gc.c:1417)
3   beam.debug.smp                	       0x1027be47c garbage_collect + 2476 (erl_gc.c:748)
4   beam.debug.smp                	       0x1027bd928 erts_gc_after_bif_call_lhf + 512 (erl_gc.c:456)
5   beam.debug.smp                	       0x10256f2c4 process_main + 9600 (beam_hot.h:376)
6   beam.debug.smp                	       0x102548484 sched_thread_func + 684 (erl_process.c:8656)
7   beam.debug.smp                	       0x102968f6c thr_wrapper + 296 (ethread.c:122)
8   libsystem_pthread.dylib       	       0x187d23fa8 _pthread_start + 148
9   libsystem_pthread.dylib       	       0x187d1eda0 thread_start + 8

@garazdawi
Copy link
Contributor

So, there is a term in the heap after GC that is corrupt. Can you make the beam.debug.smp and core file (from a debian if possible) available to me somehow? If you don't want to post it here, you can e-mail lukas@erlang.org.

If you cannot give it to me, then we'll have to do this the slow way by me posting gdb command to dig out more information.

@dvic
Copy link
Author

dvic commented Sep 29, 2023

So, there is a term in the heap after GC that is corrupt. Can you make the beam.debug.smp and core file (from a debian if possible) available to me somehow? If you don't want to post it here, you can e-mail lukas@erlang.org.

If you cannot give it to me, then we'll have to do this the slow way by me posting gdb command to dig out more information.

I will try to replicate this on a debian and get back to you. But don't you also need the source code of the project to trigger this? This is currently the command I'm using to trigger it:

cerl -debug -noshell -elixir_root /Users/dvic/.asdf/installs/elixir/1.15.6-otp-26/bin/../lib -pa /Users/dvic/.asdf/installs/elixir/1.15.6-otp-26/bin/../lib/elixir/ebin -s elixir start_cli -elixir ansi_enabled true -extra -S $MIX run scripts/benchmarks/pluto_benchmark.exs

I can setup a debian machine with the full source code and give you ssh access (the reproduction case does not involve any of our customer data), if that's an option.

@garazdawi
Copy link
Contributor

I can setup a debian machine with the full source code and give you ssh access (the reproduction case does not involve any of our customer data), if that's an option.

That would be even better. If we can get it to reproduce there we can hopefully use the beautiful tool rr and this will be fixed in no time.

@dvic
Copy link
Author

dvic commented Sep 29, 2023

I can setup a debian machine with the full source code and give you ssh access (the reproduction case does not involve any of our customer data), if that's an option.

That would be even better. If we can get it to reproduce there we can hopefully use the beautiful tool rr and this will be fixed in no time.

Nice! I'll get on this right away, I'll let you know by email once I got this set up.

@max-au
Copy link
Contributor

max-au commented Sep 29, 2023

Anyone with some advice or tips for debugging segfaults in erlang? The copy_struct_x still looks like the only constant factor out here.

(shameless plug) if you're able to reproduce it locally, then you're likely to be able to use https://max-au.com/debugging-the-beam/ technique to run the BEAM under debugger.

@garazdawi garazdawi self-assigned this Oct 2, 2023
garazdawi added a commit to garazdawi/otp that referenced this issue Oct 3, 2023
If the body of a matchspec would return a flatmap with
a variable ('$1', '$_' etc) as one of the keys and the
variable was not an immidiate, the key term would not
be copied to the receiving processes heap. This would
later corrupt the term in the table as the GC could
place move markers in it.

Also fixed a bug in the stack estimation logic when
a flatmap with all constant values, but not constant
keys was encountered.

Closes erlang#7683
garazdawi added a commit to garazdawi/otp that referenced this issue Oct 3, 2023
If the body of a matchspec would return a hashmap with
a variable ('$1', '$_' etc) as one of the keys or values
and the variable was not an immidiate, the term would not
be copied to the receiving processes heap. This would
later corrupt the term in the table as the GC could
place move markers in it.

Also fixed an issue with the stack-estimation logic for
when such a hashmap was encountered.

Closes: erlang#7683
@garazdawi garazdawi linked a pull request Oct 3, 2023 that will close this issue
garazdawi added a commit to garazdawi/otp that referenced this issue Oct 10, 2023
If the body of a matchspec would return a hashmap with
a variable ('$1', '$_' etc) as one of the keys or values
and the variable was not an immidiate, the term would not
be copied to the receiving processes heap. This would
later corrupt the term in the table as the GC could
place move markers in it.

Also fixed an issue with the stack-estimation logic for
when such a hashmap was encountered.

Closes: erlang#7683
@garazdawi garazdawi added this to the OTP-26.1.2 milestone Oct 10, 2023
@janwillemvd
Copy link

Thanks, @garazdawi!! 🙌🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants