Invalid argument in wait__ (ethr_event.c) #5339

evnu · 2021-11-02T09:25:41Z

Describe the bug

When running our test suite on macOS (and only there!), we are sometimes hit with this:

pthread/ethr_event.c:584: Fatal error in wait__(): Invalid argument (22)

It seems that the select_timeout value can be too large, see this LLDB session (using a debug VM built with 947eb1f) :

% "$ERL_TOP/bin/cerl" -lldb \
  -pa $(find /usr/local/Cellar -name "ebin" -type d | grep elixir) \
  -elixir ansi_enabled true -noshell -s elixir start_cli \
  -- -extra /usr/local/bin/mix test
(lldb) target create "beam.smp"
Current executable set to 'beam.smp' (x86_64).
(lldb) settings set -- target.run-args  "--" "-root" "/Users/vpp/otp" "-progname" "/Users/vpp/otp/bin/cerl" "--" "-home" "/Users/vpp" "--" "-pa" "/usr/local/Cellar/elixir/1.12.3/lib/logger/ebin" "/usr/local/Cellar/elixir/1.12.3/lib/elixir/ebin" "/usr/local/Cellar/elixir/1.12.3/lib/ex_unit/ebin" "/usr/local/Cellar/elixir/1.12.3/lib/mix/ebin" "/usr/local/Cellar/elixir/1.12.3/lib/eex/ebin" "/usr/local/Cellar/elixir/1.12.3/lib/iex/ebin" "-elixir" "ansi_enabled" "true" "-noshell" "-s" "elixir" "start_cli" "--" "-extra" "/usr/local/bin/mix" "test"
(lldb) command source -s 0 '/tmp/.cerllldb.62540'
Executing commands in '/tmp/.cerllldb.62540'.
(lldb) env TERM=dumb
(lldb) command script import /Users/vpp/otp/erts/etc/unix/etp.py
(lldb) breakpoint set --file ethr_event.c --line 584 -c "errno == 22 && sres < 0"   Breakpoint 1: where = beam.smp`wait__ + 2066 at ethr_event.c:584:10, address = 0x000000010044a9f2
(lldb) r
...
.Process 62635 stopped
* thread #10, name = '6_scheduler', stop reason = breakpoint 1.1
    frame #0: 0x000000010044a9f2 beam.smp`wait__(e=0x00000001423ec890, spincount=0, timeout=511635978304965000) at ethr_event.c:584:10
   581 		else {
   582 		    res = EINTR;
   583 		    if (sres < 0 && errno != EINTR)
-> 584 			ETHR_FATAL_ERROR__(errno);
   585 		    /* else:
   586 		     *   Event is *probably* set, but it can be a
   587 		     *   lingering writer. That is, it is important
Target 0: (beam.smp) stopped.
(lldb) p select_timeout
(timeval) $0 = (tv_sec = 511635978, tv_usec = 289965)

The fd sets in select() do not look suspicious. There aren't too many of them:

(lldb) p fd + 1
(int) $0 = 24
(lldb) p *rsetp
(fd_set) $1 = {
  fds_bits = {
    [0] = 8388608
    [1] = 0
    [2] = 0
    [3] = 0
    [4] = 0
    [5] = 0
    [6] = 0
    [7] = 0
    [8] = 0
    [9] = 0
    [10] = 0
    [11] = 0
    [12] = 0
    [13] = 0
    [14] = 0
    [15] = 0
    [16] = 0
    [17] = 0
    [18] = 0
    [19] = 0
    [20] = 0
    [21] = 0
    [22] = 0
    [23] = 0
    [24] = 0
    [25] = 0
    [26] = 0
    [27] = 0
    [28] = 0
    [29] = 0
    [30] = 0
    [31] = 0
  }
}
(lldb) p *esetp
(fd_set) $2 = {
  fds_bits = {
    [0] = 8388608
    [1] = 0
    [2] = 0
    [3] = 0
    [4] = 0
    [5] = 0
    [6] = 0
    [7] = 0
    [8] = 0
    [9] = 0
    [10] = 0
    [11] = 0
    [12] = 0
    [13] = 0
    [14] = 0
    [15] = 0
    [16] = 0
    [17] = 0
    [18] = 0
    [19] = 0
    [20] = 0
    [21] = 0
    [22] = 0
    [23] = 0
    [24] = 0
    [25] = 0
    [26] = 0
    [27] = 0
    [28] = 0
    [29] = 0
    [30] = 0
    [31] = 0
  }
}

To Reproduce
This happens during our large test suite. It does not always happen during the same tests, so I cannot provide a minimal example.

Affected versions
I can reliably reproduce this with OTP 24, but it also happens with earlier versions. We encountered it with OTP 22 and OTP 21 as well.

Additional context

I can currently trigger this reliably and also provide more debugging information as needed. Providing a core dump would also be possible if we can find a secure channel to do that. Note that our software uses NIFs implemented with rustler. Some of the NIFs run on dirty schedulers. However, we did not yet see any other indicators of corruption due to NIFs (no segfaults for example; we have been runnning the NIFs in one form or another for about 3 years).

macOS version information:

% sw_vers
ProductName:	macOS
ProductVersion:	11.1
BuildVersion:	20C69

EINVAL can be triggered for select() with the following C example on our test machine (empty fd sets!):

#include <stdio.h>
#include <errno.h>
#include <sys/time.h>

int main(void) {
  struct timeval select_timeout;
  int res;

  select_timeout.tv_sec = 511693940;
  select_timeout.tv_usec = 724117;

  res = select(1, NULL, NULL, NULL, &select_timeout);

  printf("res: %i, errno: %i\n", res, errno);

  return(0);
}

% make large_val_in_select
cc     large_val_in_select.c   -o large_val_in_select
% ./large_val_in_select 647836493
res: -1, errno: 22

The text was updated successfully, but these errors were encountered:

rickard-green · 2021-11-02T23:23:23Z

Thanks for the detailed bug report!

It turns out that that the tv_sec field is not allowed to be larger than 100'000'000 on MacOS. Unfortunately this is not documented.

I've made a pull request #5342 which should fix this issue. The branch of the pull request is based on OTP 22.3.4. This so we can reuse the branch for patches on OTP 22 and onwards. It should merge cleanly to any OTP version later than OTP 22.3.4. Note that I have only done very limited testing of it, but I'm quite confident that it will solve this issue. Please try it out in your test environment and confirm that it fixes the crashes you see.

rickard-green · 2021-11-02T23:32:56Z

Hmm, I just looked at the posix documentation of select() and it seems that the maximum timeout value supported is not required to be larger than 31 days for select() implementations, so I think I'll modify the #5342 pull request to not use timeout values larger than that on any platform. That will be work for tomorrow though...

evnu · 2021-11-03T12:36:42Z

Thanks for the quick reply and PR @rickard-green! I applied e90aa6f on top of OTP-24.1.4. With that, our tests no longer crash!

rickard-green · 2021-11-03T16:51:19Z

Thanks for the quick reply and PR @rickard-green! I applied e90aa6f on top of OTP-24.1.4. With that, our tests no longer crash!

Great, thanks!

I've replaced that PR with PR #5345 which will treat select() the same regardless of platform and limit the maximum timeout value to 31 days.

* rickard/select-timeout-fix/GH-5339/OTP-17735: Limit timeout value in calls to select()

…t-24 * rickard/select-timeout-fix/GH-5339/OTP-17735: Limit timeout value in calls to select()

…t-23 * rickard/select-timeout-fix/GH-5339/OTP-17735: Limit timeout value in calls to select()

…t-22 * rickard/select-timeout-fix/GH-5339/OTP-17735: Limit timeout value in calls to select()

rickard-green · 2021-11-11T19:12:13Z

The fix has been released in the patch packages OTP 24.1.5, OTP 23.3.4.9, and OTP 22.3.4.23 now.

evnu added the bug Issue is reported as a bug label Nov 2, 2021

rickard-green self-assigned this Nov 2, 2021

rickard-green added the team:VM Assigned to OTP team VM label Nov 2, 2021

rickard-green mentioned this issue Nov 2, 2021

Limit select() timeouts on Darwin #5342

Closed

rickard-green linked a pull request Nov 2, 2021 that will close this issue

Limit select() timeouts on Darwin #5342

Closed

rickard-green added the in progress label Nov 2, 2021

rickard-green mentioned this issue Nov 3, 2021

Limit timeout value in calls to select() #5345

Merged

rickard-green linked a pull request Nov 3, 2021 that will close this issue

Limit timeout value in calls to select() #5345

Merged

rickard-green removed a link to a pull request Nov 3, 2021

Limit select() timeouts on Darwin #5342

Closed

rickard-green added a commit that referenced this issue Nov 11, 2021

Merge branch 'rickard/select-timeout-fix/GH-5339/OTP-17735' into maint

cfc1b5e

* rickard/select-timeout-fix/GH-5339/OTP-17735: Limit timeout value in calls to select()

rickard-green pushed a commit that referenced this issue Nov 11, 2021

Merge branch 'rickard/select-timeout-fix/GH-5339/OTP-17735' into main…

7104456

…t-24 * rickard/select-timeout-fix/GH-5339/OTP-17735: Limit timeout value in calls to select()

rickard-green pushed a commit that referenced this issue Nov 11, 2021

Merge branch 'rickard/select-timeout-fix/GH-5339/OTP-17735' into main…

7e79c38

…t-23 * rickard/select-timeout-fix/GH-5339/OTP-17735: Limit timeout value in calls to select()

rickard-green pushed a commit that referenced this issue Nov 11, 2021

Merge branch 'rickard/select-timeout-fix/GH-5339/OTP-17735' into main…

c5004dd

…t-22 * rickard/select-timeout-fix/GH-5339/OTP-17735: Limit timeout value in calls to select()

rickard-green closed this as completed Nov 11, 2021

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid argument in wait__ (ethr_event.c) #5339

Invalid argument in wait__ (ethr_event.c) #5339

evnu commented Nov 2, 2021

rickard-green commented Nov 2, 2021

rickard-green commented Nov 2, 2021

evnu commented Nov 3, 2021

rickard-green commented Nov 3, 2021

rickard-green commented Nov 11, 2021

Invalid argument in wait__ (ethr_event.c) #5339

Invalid argument in wait__ (ethr_event.c) #5339

Comments

evnu commented Nov 2, 2021

rickard-green commented Nov 2, 2021

rickard-green commented Nov 2, 2021

evnu commented Nov 3, 2021

rickard-green commented Nov 3, 2021

rickard-green commented Nov 11, 2021