Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in code_server.erl (badmatch: error) #8510

Closed
robashton opened this issue May 26, 2024 · 8 comments · Fixed by #8744
Closed

Crash in code_server.erl (badmatch: error) #8510

robashton opened this issue May 26, 2024 · 8 comments · Fixed by #8744
Assignees
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM

Comments

@robashton
Copy link

robashton commented May 26, 2024

Describe the bug
In certain scenarios (Our massive Purescript codebase compiling into Erlang), possibly pertaining to short lived processes coupled with modules with larger than normal load times but I'm not sure, we get a bad match thrown inside code_server. The place where this reproduces most is in our Cowboy use, if we make some small requests just after start-up, we get the error with the module cowboy_static@ps (which nearly everything references). The file itself is nothing remarkable and I don't think is related to the issue.

If I preload the module with a manual load this does not happen.

I think this relates to the changes made in #7503 and #6736

To Reproduce
I can't easily build an isolated example

Expected behavior
The code should load and not crash code_server

Affected versions
26+

Additional context
This bit I can help with, I've pulled down latest and built, the problem exists on line 1184 of code_server.erl, we do a maps:take on the loading collection and the module doesn't exist in there.

A trace of the messages to and from the code server pid reveals the following order of events

4672: <9432.50.0> send: {code_server,{{binary_data,3620},
              "/private/var/tmp/_bazel_robashton/01ea9ac11aff3c301299cf1e989e0d76/execroot/__main__/bazel-out/darwin_arm64-fastbuild/bin/media-shell.runfiles/__main__/media-shell_deps/norsk/e
bin/cowboy_static@ps.beam",
              #Ref<0.167300051.2764308486.137148>}} -> <0.1045.0>
4672: <9432.50.0> 'receive': {code_call,<0.1045.0>,
           {load_module,#Ref<0.167300051.2764439556.148827>,cowboy_static@ps,
                        "/private/var/tmp/_bazel_robashton/01ea9ac11aff3c301299cf1e989e0d76/execroot/__main__/bazel-out/darwin_arm64-fastbuild/bin/media-shell.runfiles/__main__/media-shell_de
ps/norsk/ebin/cowboy_static@ps.beam",
                        false,#Ref<0.167300051.2764308486.137148>}}
4673: <9432.50.0> 'receive': {{'LOADER_DOWN',{cowboy_static@ps,{binary_data,3620},
                                  "/private/var/tmp/_bazel_robashton/01ea9ac11aff3c301299cf1e989e0d76/execroot/__main__/bazel-out/darwin_arm64-fastbuild/bin/media-shell.runfiles/__main__/medi
a-shell_deps/norsk/ebin/cowboy_static@ps.beam"}},
 #Ref<0.167300051.2764308486.137148>,process,<0.1045.0>,shutdown}
4673: <9432.50.0> spawn: <0.1047.0>, {erlang,apply,
                    [{"fun code_server:-handle_on_load/5-fun-0-/0",
                      [cowboy_static@ps]},
                     []]}
4673: <9432.50.0> send_to_non_existing_process: {code_server,{{binary_data,3620},
              "/private/var/tmp/_bazel_robashton/01ea9ac11aff3c301299cf1e989e0d76/execroot/__main__/bazel-out/darwin_arm64-fastbuild/bin/media-shell.runfiles/__main__/media-shell_deps/norsk/e
bin/cowboy_static@ps.beam",
              #Ref<0.167300051.2764308481.176919>}}, <0.1046.0>
4673: <9432.50.0> 'receive': {{'LOADER_DOWN',{cowboy_static@ps,{binary_data,3620},
                                  "/private/var/tmp/_bazel_robashton/01ea9ac11aff3c301299cf1e989e0d76/execroot/__main__/bazel-out/darwin_arm64-fastbuild/bin/media-shell.runfiles/__main__/medi
a-shell_deps/norsk/ebin/cowboy_static@ps.beam"}},
 #Ref<0.167300051.2764308481.176919>,process,<0.1046.0>,noproc}
4673: <9432.50.0> 'receive': {'DOWN',#Ref<0.167300051.2764308481.176918>,process,<0.1047.0>,ok}
4673: <9432.50.0> send: {finish_after_on_load,{cowboy_static@ps,true},
                      <0.50.0>,#Ref<0.167300051.2764308481.176922>} -> erts_code_purger
4675: <9432.50.0> 'receive': {reply,finish_after_on_load,ok,#Ref<0.167300051.2764308481.176922>}
4675: <9432.50.0> 'receive': timeout

Essentially we get some LOADER_DOWN message, remove the module from the map, and then try to read the module from the map.

Sticking a case statement around the maps:take "fixes" this, but I don't know if that is what is needed, or if it's just a band-aid - not fully understanding the end to end flow of this code and not having sat down with pen and paper to work it out yet. If this is obvious to somebody else then I'm happy for then to make the fix, or tell me what the fix is and I'll do it.

Cheers,

Rob

@robashton robashton added the bug Issue is reported as a bug label May 26, 2024
@josevalim
Copy link
Contributor

josevalim commented May 27, 2024

I believe I can see why this bug happens.

  1. We store the module as loading here:

false -> get_object_code_for_loading(St1, Mod, From)

  1. Then, to reply success, we schedule it here (I say schedule because it is postponed by handle_pending_on_load):

Action = fun(_, S) ->
case (EnsureLoaded =/= false) andalso erlang:module_loaded(Mod) of
true ->
reply_loading(EnsureLoaded, Mod, {module, Mod}, S);
false ->
try_finish_module_1(File, Mod, PC, From, EnsureLoaded, S)
end
end,
handle_pending_on_load(Action, Mod, From, St).

  1. However, if we receive a loader_down, we immediately delete it:

State = loader_down(State0, Info),

Then, when we process the pending on load queue, the loader down message has already been processed.

I believe the fix is to put both lines below around a handle_pending_on_load:

State = loader_down(State0, Info),

reply_loading(Ref, Mod, Error, S);

Otherwise we may process those events out of order. cc @michalmuskala

@josevalim
Copy link
Contributor

Actually, we pretty much have two different locking mechanisms for loading a module, one for loading on the client, the other one is for handling on_load callbacks. We should unify them into one.

@rickard-green rickard-green added the team:VM Assigned to OTP team VM label May 27, 2024
@josevalim
Copy link
Contributor

josevalim commented Jun 3, 2024

I am looking into this still but I am busy with the upcoming Elixir v1.17 release, sorry for the delay.

@robashton, a temporary solution would most likely be to preload any modules with -on_load callbacks in them. For example, you can do it in your application:start/2 callback.

@robashton
Copy link
Author

We are just running a forked OTP for now that pretends the error doesn't exist, we aren't blocked so no worries!

josevalim added a commit to josevalim/otp that referenced this issue Aug 25, 2024
Prior to this patch, the code server had two
internal queues, one to track module loading
and another to track on_load callbacks. This
pull requests refactors the code to have a
single queue, in order to fix bugs and improve
maintainability.

Closes erlang#7466.
Closes erlang#8510.
josevalim added a commit to josevalim/otp that referenced this issue Aug 25, 2024
Prior to this patch, the code server had two
internal queues, one to track module loading
and another to track on_load callbacks. This
pull requests refactors the code to have a
single queue, in order to fix bugs and improve
maintainability.

Closes erlang#7466.
Closes erlang#8510.
josevalim added a commit to josevalim/otp that referenced this issue Aug 25, 2024
Prior to this patch, the code server had two
internal queues, one to track module loading
and another to track on_load callbacks. This
pull requests refactors the code to have a
single queue, in order to fix bugs and improve
maintainability.

Closes erlang#7466.
Closes erlang#8510.
josevalim added a commit to josevalim/otp that referenced this issue Aug 25, 2024
Prior to this patch, the code server had two
internal queues, one to track module loading
and another to track on_load callbacks. This
pull requests refactors the code to have a
single queue, in order to fix bugs and improve
maintainability.

Closes erlang#7466.
Closes erlang#8510.
@axelson
Copy link

axelson commented Sep 13, 2024

Are there any workarounds for this in the meantime? It seems like we'd have to go back pretty far to avoid the changes from #7503 and #6736

@josevalim
Copy link
Contributor

This happens on modules with on_load callbacks on them, so you can try preloading them before (for example, in your helper). We also haven't hard reports of this happening earlier than OTP 27. More concurrency was added on 27, which would explain why we suddenly started seeing it.

@axelson
Copy link

axelson commented Sep 16, 2024

Hmm, we're currently using OTP 26 so I guess you have a report of it now 😄

I did find some modules in our dependencies that have on_load callbacks (via cd deps; rg on_load), I've now added loading of those at the bottom of our test_helper.exs:

Code.ensure_loaded(JsonXema)
Code.ensure_loaded(Bcrypt.Base)
Code.ensure_loaded(Jaxon.Parsers.NifParser)

I've also removed some usages of Patch which loads modules at runtime and now the frequency appears to have gone way down (although it is still happening occasionally).

@josevalim
Copy link
Contributor

Good to know! If Patch is relying on_load callbacks, then that's likely the root cause. I'd recommend opening up an issue, because, as far as I know, on_load callback is not recommended to be used for anything else besides loading NIFs.

kikofernandez pushed a commit to josevalim/otp that referenced this issue Oct 24, 2024
Prior to this patch, the code server had two
internal queues, one to track module loading
and another to track on_load callbacks. This
pull requests refactors the code to have a
single queue, in order to fix bugs and improve
maintainability.

Closes erlang#7466.
Closes erlang#8510.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants