Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading a module that contains on_load may hang #7466

Open
jcpetruzza opened this issue Jul 3, 2023 · 5 comments
Open

Loading a module that contains on_load may hang #7466

jcpetruzza opened this issue Jul 3, 2023 · 5 comments
Assignees
Labels
bug Issue is reported as a bug stalled waiting for input by the Erlang/OTP team team:VM Assigned to OTP team VM

Comments

@jcpetruzza
Copy link

Describe the bug
If module foo contains an on_load annotation, then a call to code:load_binary/3 (and probably code:load_abs/1,2) may hang if there are concurrent calls to these functions, or even code:ensure_loaded/1) on the same module.

To Reproduce
Given these two modules:

%% file: foo.erl
-module(foo).

-on_load(init_module/0).

-export([bar/0]).

bar() -> ok.

init_module() ->
    io:format("Initializing ~p....~n", [?MODULE]),
    % wait some time, to make the race condition likely to happen
    timer:sleep(2_000),
    io:format(" done initializing ~p!~n", [?MODULE]).
%% file: repro.erl
-module(repro).

-export([go/0, do_load_binary/1]).

do_load_binary(Parent) ->
  {ok, Bin} = file:read_file("foo.beam"),
  code:load_binary(foo, "foo.beam", Bin),
  Parent ! {done, self()}.

go() ->
  Self = self(),
  X = spawn(fun() -> do_load_binary(Self) end),
  Y = spawn(fun() -> do_load_binary(Self) end),
  A = receive
    {done, X} -> ok
    after 10_000 -> not_ok
  end,
  B = receive
    {done, Y} -> ok
    after 10_000 -> not_ok
  end,
  {A, B}.

Run :

$ erlc foo.erl repro.erl && erl -noshell -eval 'Res = repro:go(), io:format("~p~n", [Res]), init:stop()'

Expected behavior
Both concurrent calls to code:load_binary/3 in repro:go/0 should eventually return, so we expect go() to return {ok, ok}. Instead, we see:

Initializing foo....
 done initializing foo!
Initializing foo....
 done initializing foo!
{ok,not_ok}

Without the after clause, the second receive will wait forever.

Affected versions
Both R25 and R26 show this problem, I didn't try it on older versions.

Additional context
As far as I can see, the problem is in code_server:try_finish_module, since the action that is passed to handle_pending_on_load will always return a {noreply, ...} when the module to load contains an on_load; however, if the action ends up waiting since there is an on_load in progress, this means that no reply will ever be sent to the caller.

We found the race condition investigating tests that would sometimes block. In our case, the race was between code:ensure_loaded/1 and code:load_binary/3 and the issue happens, again, when the action for load_binary waits on the on_load from ensure_loaded to finish, but this combination is harder to reproduce consistently.

@jcpetruzza jcpetruzza added the bug Issue is reported as a bug label Jul 3, 2023
@jhogberg jhogberg added the team:VM Assigned to OTP team VM label Jul 4, 2023
@jhogberg
Copy link
Contributor

jhogberg commented Jul 4, 2023

Thanks for your report! Just as a heads-up, we're a bit short-handed over the summer vacation period so we likely won't get around to fixing this until most of us are back from vacation.

@jhogberg jhogberg added the stalled waiting for input by the Erlang/OTP team label Jul 4, 2023
@mikpe
Copy link
Contributor

mikpe commented Jul 5, 2023

I don't think on_load/1 is intended to permit arbitrary Erlang code. Some years ago I mentioned using on_load/1 to implement linking support for a statically strongly typed language compiled to the BEAM, and @bjorng strongly advised against that, on the grounds that on_load/1 was only intended for loading NIFs. Maybe that has changed, but I wouldn't bet on it. (But if there are limitations they should be documented.)

@jhogberg
Copy link
Contributor

jhogberg commented Jul 6, 2023

on_load/1 complicates code loading in general (it still needs to appear atomic) so it's a fairly bad idea to do anything complicated in it, but it nevertheless shouldn't fail silently in this manner. Even if all you do is call load_nif it's still possible for this race to occur so we'll need to fix it before long.

leonardb added a commit to leonardb/elibphonenumber that referenced this issue Aug 10, 2023
When using OTP-26 our application crashes when io:format is used in the on_load function.

There is an open issue in otp repo with a similar issue where it's mentioned that on_load is somewhat brittle. erlang/otp#7466 (comment)

```
=INFO REPORT==== 10-Aug-2023::11:01:37.449084 ===
init got unexpected: {io_request,<0.1805.0>,
                         #Ref<0.3063375415.1090256898.200776>,
                         {put_chars,unicode,io_lib,format,
                             ["Loading library: ~p ~n",
                              [<<"/opt/smd/_build/local/rel/smd/lib/elibphonenumber-8.13.17/priv/phonenumber_util_nif">>]]}}

```
@frazze-jobb frazze-jobb assigned jhogberg and unassigned frazze-jobb Dec 11, 2023
@josevalim
Copy link
Contributor

Here is a patch that fixes this issue:

diff --git a/lib/kernel/src/code_server.erl b/lib/kernel/src/code_server.erl
index 7689acf124..4a5dbfc3c2 100644
--- a/lib/kernel/src/code_server.erl
+++ b/lib/kernel/src/code_server.erl
@@ -1387,16 +1387,19 @@ handle_pending_on_load_1(Mod, From, [H|T]) ->
 handle_pending_on_load_1(_, _, []) -> [].
 
 finish_on_load(PidRef, OnLoadRes, #state{on_load=OnLoad0}=St0) ->
-    case lists:keyfind(PidRef, 1, OnLoad0) of
-	false ->
-	    %% Since this process in general silently ignores messages
-	    %% it doesn't understand, it should also ignore a 'DOWN'
-	    %% message with an unknown reference.
-	    St0;
-	{PidRef,Mod,Waiting} ->
-	    St = finish_on_load_1(Mod, OnLoadRes, Waiting, St0),
-	    OnLoad = [E || {R,_,_}=E <- OnLoad0, R =/= PidRef],
-	    St#state{on_load=OnLoad}
+    case lists:keytake(PidRef, 1, OnLoad0) of
+        false ->
+            %% Since this process in general silently ignores messages
+            %% it doesn't understand, it should also ignore a 'DOWN'
+            %% message with an unknown reference.
+            St0;
+        {value, {PidRef,Mod,Waiting}, OnLoad} ->
+            %% The first callback we register (the last on the list),
+            %% must be the first to run, as it is the one that effectively
+            %% loads the module. We also must reset the OnLoad state before
+            %% running the actions, otherwise we may add new entries to
+            %% Waiting (which won't ever be picked up).
+            finish_on_load_1(Mod, OnLoadRes, lists:reverse(Waiting), St0#state{on_load=OnLoad})
     end.
 
 finish_on_load_1(Mod, OnLoadRes, Waiting, St) ->

The root cause was that we were not clearing up the OnLoad state for that module before triggering the actions, which would cause new entries to be added to Waiting (and those would never be picked up).

I was not able to write a test though due to limited CT knowledge, so I only have the fix. :) Anyone is welcome to use the patch above.

@josevalim
Copy link
Contributor

josevalim commented Feb 18, 2024

I will try to submit a patch for this one in the upcoming week. EDIT: I will do it once OTP 27 is out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug stalled waiting for input by the Erlang/OTP team team:VM Assigned to OTP team VM
Projects
None yet
Development

No branches or pull requests

5 participants