Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Variadic function cast causing segfaults on Apple ARM64 #2708

Merged

Conversation

tehprofessor
Copy link
Contributor

Overview

Howdy!

This is a fix for segfaults happening on Apple ARM64. The segfaults are caused by differences in how Apple ARM64 treats variadic functions. x86 will only use the stack when necessary, after filling the available registers; while, 🍎 ARM64 will always places variadic arguments onto the stack regardless of many registers are available. The segfault manifests as what look like errors in bif functions (e.g. is_integer_1) and do not leave an erl crash dump (only available logs were from the system's crash report).

This was originally developed against to master (otp-24) but has been back-ported here to maint. The updated build configuration provided by #2700 is required so it's currently included in this branch.

I'm running the tests now and will get back with the results when they finish (in the morning).

FWIW, so far, Cowboy's test suite[1] and cowlib passed all tests without error on master (otp-24). Using this branch (based off maint) with Elixir's master-otp-23 (from asdf I'll get the elixir sha later) Phoenix's test suite passed all tests without error.

Issue

The problem itself (seems) to boil down to variadic casts happening when calling bifs in erts/emulator/beam/erl_db_util.c - db_prog_match:

bif = (Eterm (*)(Process*, ...)) *pc++;

bif points to functions from erts/emulator/beam/erl_bif_op.c which have a fixed number of arguments e.g.

BIF_RETTYPE is_integer_1(BIF_ALIST_1)
{
    if (is_integer(BIF_ARG_1)) {
	BIF_RET(am_true);
    }
    BIF_RET(am_false);
}

Casting the fixed arguments to be variadic then causes the registers to be incorrectly marshaled and the VM segfaults.

The Fix

All of the functions in erts/emulator/beam/erl_bif_op.c are defined using a set of macros defined in erts/emulator/beam/bif.h with the relevant parts being:

#define BIF_RETTYPE Eterm

#define BIF_P A__p

#define BIF_ALIST Process* A__p, Eterm* BIF__ARGS, BeamInstr *A__I

#define BIF_ALIST_0 BIF_ALIST
/* omitted for brevity */
#define BIF_ALIST_4 BIF_ALIST

#define BIF_ARG_1  (BIF__ARGS[0])
/* omitted for brevity */
#define BIF_ARG_4  (BIF__ARGS[3])

Using these macros (on 🍎 ARM64), we were then able to declare function pointers in db_prog_match matching the expected function signature:

BIF_RETTYPE (*bif_call_1)(BIF_ALIST_1);

and then call it:

/* omitted for brevity */
	case matchCall1:
             bif_call_1 = (BIF_RETTYPE (*)(BIF_ALIST_1)) *pc++;
             t = (*bif_call_1)(build_proc, esp-1, bif_args);

Closing Remarks

I know, after battling these segfaults for a week. this is a very hot path of code execution , and I'll gladly admit I don't have the expertise required right now to know if there is a better approach for this fix or if this poses its own set of problems. This has been way out of my comfort zone (but damn it felt good! I'll be back for more); so, I'd like to emphasize I'm very receptive to feedback. The change in the PR as of opening it, is the simplest and most obvious (?) solution I could muster... and so that's exactly what's here.

I have a gigantic amount of symbolicated crash logs, I'd normally attach to help others without the platform see the issues, but... I'm not sure how much I can and can't share since I'm bound to the 🍎 developer kit NDA. I'm going to ask in their forums later if I can share some of the crash logs here or via email (if Apple needs that) for posterity.

Finally I'm happy, and waiting 😄 , to rebase after #2700 can be backported to maint as that's a dependency for this patch to work... Thank you for time!

[1] I'm not really sure how cowboy's test suite ran as it looks (based on the output) like it needs go and I definitely do not have a working instance of go on the machine.

Co-Authored-By: mjc mjc@kernel.org

@CLAassistant
Copy link

CLAassistant commented Aug 7, 2020

CLA assistant check
All committers have signed the CLA.

@tehprofessor tehprofessor force-pushed the erts/apple-silicon-variadic-fix-maint branch 4 times, most recently from 2a88d90 to 43fbd08 Compare August 9, 2020 02:22
     Fix for Segfaults on Apple ARM64. Apple silicon does not support
     re-casting functions with fixed params as variadic, and instead
     requires the the function pointer to match the signature of the
     definition. If they do not match arguments will not be correctly
     marshaled, causing the function implementation to look for
     parameters in the wrong places. To enable Apple ARM64 support, we
     must explicitly cast to the correct function signature instead of
     using varargs `...`.

Co-Authored-By: mjc <mjc@kernel.org>
@tehprofessor tehprofessor force-pushed the erts/apple-silicon-variadic-fix-maint branch from 43fbd08 to af8a2f5 Compare August 10, 2020 06:14
@bjorng bjorng self-assigned this Aug 10, 2020
@bjorng bjorng added fix team:VM Assigned to OTP team VM labels Aug 10, 2020
bif = (Eterm (*)(Process*, ...)) *pc++;
t = (*bif)(build_proc, esp-1);
bif = (BIF_RETTYPE (*)(BIF_ALIST)) *pc++;
t = (*bif)(build_proc, esp-1, NULL);
Copy link
Contributor Author

@tehprofessor tehprofessor Aug 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've got everything updated correctly. The one thing I'm not entirely sure of is if this should be copied into bif_args, if this is an optimization for this specific case, or something else? I'm new to c; so, if I'm understanding this correctly, esp-1 works here without copying into bif_args because esp-1 and bif_args[0] essentially point to the same value and both are pointers so semantically it's all good? Then by not copying into bif_args[0] we save an allocation/step?

And, please feel free to tell me to just google more 😄 !

Thank you both for your time reviewing this, and please let me know if there's anything else I can change, improve, or have missed 😅 here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copying of the arguments to bif_args when there are 2 or 3 arguments is only necessary because the arguments would be in the wrong order if we would pass esp-2 or esp-3, respectively. When there is a single argument, there is no need to copy it because it can't be in the wrong order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @bjorng

@bjorng bjorng added testing currently being tested, tag is used by OTP internal CI and removed testing currently being tested, tag is used by OTP internal CI labels Aug 10, 2020
@bjorng bjorng changed the base branch from master to maint August 10, 2020 06:57
@bjorng bjorng added the testing currently being tested, tag is used by OTP internal CI label Aug 10, 2020
@bjorng
Copy link
Contributor

bjorng commented Aug 10, 2020

Many thanks for working on this.

I have added this pull request to our daily builds. If everything goes well, we will merge it within a few days.

@bjorng bjorng merged commit d22e3fc into erlang:maint Aug 13, 2020
@bjorng
Copy link
Contributor

bjorng commented Aug 13, 2020

Thanks for your pull request.

@tehprofessor
Copy link
Contributor Author

Just wanted to say "thank you" @bjorng and @mikpe for the code reviews and feedback! I really appreciate you both taking the time to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix team:VM Assigned to OTP team VM testing currently being tested, tag is used by OTP internal CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants