Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid argument in wait__ (ethr_event.c) #5339

Closed
evnu opened this issue Nov 2, 2021 · 5 comments · Fixed by #5345
Closed

Invalid argument in wait__ (ethr_event.c) #5339

evnu opened this issue Nov 2, 2021 · 5 comments · Fixed by #5345
Assignees
Labels
bug Issue is reported as a bug in progress team:VM Assigned to OTP team VM

Comments

@evnu
Copy link

evnu commented Nov 2, 2021

Describe the bug

When running our test suite on macOS (and only there!), we are sometimes hit with this:

pthread/ethr_event.c:584: Fatal error in wait__(): Invalid argument (22)

It seems that the select_timeout value can be too large, see this LLDB session (using a debug VM built with 947eb1f) :

% "$ERL_TOP/bin/cerl" -lldb \
  -pa $(find /usr/local/Cellar -name "ebin" -type d | grep elixir) \
  -elixir ansi_enabled true -noshell -s elixir start_cli \
  -- -extra /usr/local/bin/mix test
(lldb) target create "beam.smp"
Current executable set to 'beam.smp' (x86_64).
(lldb) settings set -- target.run-args  "--" "-root" "/Users/vpp/otp" "-progname" "/Users/vpp/otp/bin/cerl" "--" "-home" "/Users/vpp" "--" "-pa" "/usr/local/Cellar/elixir/1.12.3/lib/logger/ebin" "/usr/local/Cellar/elixir/1.12.3/lib/elixir/ebin" "/usr/local/Cellar/elixir/1.12.3/lib/ex_unit/ebin" "/usr/local/Cellar/elixir/1.12.3/lib/mix/ebin" "/usr/local/Cellar/elixir/1.12.3/lib/eex/ebin" "/usr/local/Cellar/elixir/1.12.3/lib/iex/ebin" "-elixir" "ansi_enabled" "true" "-noshell" "-s" "elixir" "start_cli" "--" "-extra" "/usr/local/bin/mix" "test"
(lldb) command source -s 0 '/tmp/.cerllldb.62540'
Executing commands in '/tmp/.cerllldb.62540'.
(lldb) env TERM=dumb
(lldb) command script import /Users/vpp/otp/erts/etc/unix/etp.py
(lldb) breakpoint set --file ethr_event.c --line 584 -c "errno == 22 && sres < 0"   Breakpoint 1: where = beam.smp`wait__ + 2066 at ethr_event.c:584:10, address = 0x000000010044a9f2
(lldb) r
...
.Process 62635 stopped
* thread #10, name = '6_scheduler', stop reason = breakpoint 1.1
    frame #0: 0x000000010044a9f2 beam.smp`wait__(e=0x00000001423ec890, spincount=0, timeout=511635978304965000) at ethr_event.c:584:10
   581 		else {
   582 		    res = EINTR;
   583 		    if (sres < 0 && errno != EINTR)
-> 584 			ETHR_FATAL_ERROR__(errno);
   585 		    /* else:
   586 		     *   Event is *probably* set, but it can be a
   587 		     *   lingering writer. That is, it is important
Target 0: (beam.smp) stopped.
(lldb) p select_timeout
(timeval) $0 = (tv_sec = 511635978, tv_usec = 289965)

The fd sets in select() do not look suspicious. There aren't too many of them:

(lldb) p fd + 1
(int) $0 = 24
(lldb) p *rsetp
(fd_set) $1 = {
  fds_bits = {
    [0] = 8388608
    [1] = 0
    [2] = 0
    [3] = 0
    [4] = 0
    [5] = 0
    [6] = 0
    [7] = 0
    [8] = 0
    [9] = 0
    [10] = 0
    [11] = 0
    [12] = 0
    [13] = 0
    [14] = 0
    [15] = 0
    [16] = 0
    [17] = 0
    [18] = 0
    [19] = 0
    [20] = 0
    [21] = 0
    [22] = 0
    [23] = 0
    [24] = 0
    [25] = 0
    [26] = 0
    [27] = 0
    [28] = 0
    [29] = 0
    [30] = 0
    [31] = 0
  }
}
(lldb) p *esetp
(fd_set) $2 = {
  fds_bits = {
    [0] = 8388608
    [1] = 0
    [2] = 0
    [3] = 0
    [4] = 0
    [5] = 0
    [6] = 0
    [7] = 0
    [8] = 0
    [9] = 0
    [10] = 0
    [11] = 0
    [12] = 0
    [13] = 0
    [14] = 0
    [15] = 0
    [16] = 0
    [17] = 0
    [18] = 0
    [19] = 0
    [20] = 0
    [21] = 0
    [22] = 0
    [23] = 0
    [24] = 0
    [25] = 0
    [26] = 0
    [27] = 0
    [28] = 0
    [29] = 0
    [30] = 0
    [31] = 0
  }
}

To Reproduce
This happens during our large test suite. It does not always happen during the same tests, so I cannot provide a minimal example.

Affected versions
I can reliably reproduce this with OTP 24, but it also happens with earlier versions. We encountered it with OTP 22 and OTP 21 as well.

Additional context

I can currently trigger this reliably and also provide more debugging information as needed. Providing a core dump would also be possible if we can find a secure channel to do that. Note that our software uses NIFs implemented with rustler. Some of the NIFs run on dirty schedulers. However, we did not yet see any other indicators of corruption due to NIFs (no segfaults for example; we have been runnning the NIFs in one form or another for about 3 years).

macOS version information:

% sw_vers
ProductName:	macOS
ProductVersion:	11.1
BuildVersion:	20C69

EINVAL can be triggered for select() with the following C example on our test machine (empty fd sets!):

#include <stdio.h>
#include <errno.h>
#include <sys/time.h>

int main(void) {
  struct timeval select_timeout;
  int res;

  select_timeout.tv_sec = 511693940;
  select_timeout.tv_usec = 724117;

  res = select(1, NULL, NULL, NULL, &select_timeout);

  printf("res: %i, errno: %i\n", res, errno);

  return(0);
}
% make large_val_in_select
cc     large_val_in_select.c   -o large_val_in_select
% ./large_val_in_select 647836493
res: -1, errno: 22
@evnu evnu added the bug Issue is reported as a bug label Nov 2, 2021
@rickard-green rickard-green self-assigned this Nov 2, 2021
@rickard-green rickard-green added the team:VM Assigned to OTP team VM label Nov 2, 2021
@rickard-green rickard-green linked a pull request Nov 2, 2021 that will close this issue
@rickard-green
Copy link
Contributor

Thanks for the detailed bug report!

It turns out that that the tv_sec field is not allowed to be larger than 100'000'000 on MacOS. Unfortunately this is not documented.

I've made a pull request #5342 which should fix this issue. The branch of the pull request is based on OTP 22.3.4. This so we can reuse the branch for patches on OTP 22 and onwards. It should merge cleanly to any OTP version later than OTP 22.3.4. Note that I have only done very limited testing of it, but I'm quite confident that it will solve this issue. Please try it out in your test environment and confirm that it fixes the crashes you see.

@rickard-green
Copy link
Contributor

Hmm, I just looked at the posix documentation of select() and it seems that the maximum timeout value supported is not required to be larger than 31 days for select() implementations, so I think I'll modify the #5342 pull request to not use timeout values larger than that on any platform. That will be work for tomorrow though...

@evnu
Copy link
Author

evnu commented Nov 3, 2021

Thanks for the quick reply and PR @rickard-green! I applied e90aa6f on top of OTP-24.1.4. With that, our tests no longer crash!

@rickard-green
Copy link
Contributor

Thanks for the quick reply and PR @rickard-green! I applied e90aa6f on top of OTP-24.1.4. With that, our tests no longer crash!

Great, thanks!

I've replaced that PR with PR #5345 which will treat select() the same regardless of platform and limit the maximum timeout value to 31 days.

rickard-green added a commit that referenced this issue Nov 11, 2021
* rickard/select-timeout-fix/GH-5339/OTP-17735:
  Limit timeout value in calls to select()
rickard-green pushed a commit that referenced this issue Nov 11, 2021
…t-24

* rickard/select-timeout-fix/GH-5339/OTP-17735:
  Limit timeout value in calls to select()
rickard-green pushed a commit that referenced this issue Nov 11, 2021
…t-23

* rickard/select-timeout-fix/GH-5339/OTP-17735:
  Limit timeout value in calls to select()
rickard-green pushed a commit that referenced this issue Nov 11, 2021
…t-22

* rickard/select-timeout-fix/GH-5339/OTP-17735:
  Limit timeout value in calls to select()
@rickard-green
Copy link
Contributor

The fix has been released in the patch packages OTP 24.1.5, OTP 23.3.4.9, and OTP 22.3.4.23 now.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug in progress team:VM Assigned to OTP team VM
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants