Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce repeated work in concurrent code loading situations #7503

Merged
merged 1 commit into from Sep 6, 2023

Conversation

michalmuskala
Copy link
Contributor

With changes in #6736 significant work when code loading was moved away from the code server and shifted into the requesting process. However, there could be a lot of repeated work, especially on system startup, if a lot of similar processes are started, all trying to load the same module at the same time, overwhelming the code server with get_object_code requests.

In this change, we add additional synchronisation to ensure_loaded that makes sure only one process at a time tries to load the module. In the added test showcasing the worst-case scenario of long load path and many concurrent requests, this changes the runtime (on my local machine) from 8s+ to around 200ms.

Furthermore, the special sync operation can be merged with this and save one round-trip to the code server.

Fixes #7479

@github-actions
Copy link
Contributor

github-actions bot commented Jul 19, 2023

CT Test Results

       2 files       65 suites   1h 2m 43s ⏱️
1 496 tests 1 249 ✔️ 247 💤 0
1 687 runs  1 393 ✔️ 294 💤 0

Results for commit cb85704.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

@gonzalobf
Copy link
Contributor

We have a server with a lot of concurrent process doing the same work in parallel on the system startup. The server was stuck after a few seconds and we manually needed to reduce the load to get it working.

The server was running OTP-26.0.2 and a coredump showed 18K {code_call,<0.26608.5>,{get_object_code,utils_module}} @from= <0.26608.5>, messages in code_server mailbox.

I have tried this fix in production on OTP-26.0.2 and I can confirm that resolves the issue for us.

Thank you @michalmuskala :)

@jhogberg jhogberg self-assigned this Jul 21, 2023
@jhogberg jhogberg added team:VM Assigned to OTP team VM enhancement testing currently being tested, tag is used by OTP internal CI labels Jul 21, 2023
@jhogberg jhogberg added this to the OTP-26.1 milestone Jul 21, 2023
@jhogberg jhogberg removed the testing currently being tested, tag is used by OTP internal CI label Jul 28, 2023
@jhogberg
Copy link
Contributor

This PR seems to cause crashes in multi_load_SUITE:ensure_modules_loaded/1, can you look into it?

*** System report during multi_load_SUITE:ensure_modules_loaded/1 2023-07-28 01:40:46.450 ***

=ERROR REPORT==== 28-Jul-2023::01:40:46.448231 ===
Error in process <0.50.0> on node test_server@idril with exit value:
{function_clause,
    [{code_server,try_finish_module,
         ["/ldisk/daily_build/26_opu_c_docker_72968.2023-07-27_21/test/test_server/ct_run.test_server@idril.2023-07-28_00.51.09/test.kernel_test.logs/run.2023-07-28_00.52.42/log_private/multi_load_ensure_modules_loaded/m__1_195074.beam",
          m__1_195074,#Ref<0.599638063.1805516801.41042>,true,<0.119740.0>,
          {state,<0.49.0>,"/daily_build/otp",
              [{"/ldisk/daily_build/26_opu_c_docker_72968.2023-07-27_21/test/test_server/ct_run.test_server@idril.2023-07-28_00.51.09/test.kernel_test.logs/run.2023-07-28_00.52.42/log_private/multi_load_ensure_modules_loaded",
                nocache},
               {"/ldisk/daily_build/26_opu_c_docker_72968.2023-07-27_21/test/test_server/ct_run.test_server@idril.2023-07-28_00.51.09/test.kernel_test.logs/run.2023-07-28_00.52.42/log_private",
                nocache},
               {"/ldisk/daily_build/26_opu_c_docker_72968.2023-07-27_21/test/kernel_test",
                nocache},
               {".",nocache},
               {"/daily_build/otp/lib/kernel-9.1/ebin",nocache},
               {"/daily_build/otp/lib/stdlib-5.0.3/ebin",nocache},
               {"/daily_build/otp/lib/sasl-4.2.1/ebin",nocache},
               {"/daily_build/proper/ebin",nocache},
               {"/daily_build/otp/lib/xmerl-1.3.31.1/ebin",nocache},
               {"/daily_build/otp/lib/wx-2.3/ebin",nocache},
               {"/daily_build/otp/lib/tools-3.6/ebin",nocache},
               {"/daily_build/otp/lib/tftp-1.1/ebin",nocache},
               {"/daily_build/otp/lib/syntax_tools-3.1/ebin",nocache},
               {"/daily_build/otp/lib/ssl-11.0.2/ebin",nocache},
               {"/daily_build/otp/lib/ssh-5.0.1/ebin",nocache},
               {"/daily_build/otp/lib/snmp-5.15/ebin",nocache},
               {"/daily_build/otp/lib/runtime_tools-2.0/ebin",nocache},
               {"/daily_build/otp/lib/reltool-1.0/ebin",nocache},
               {"/daily_build/otp/lib/public_key-1.14/ebin",nocache},
               {"/daily_build/otp/lib/parsetools-2.5/ebin",nocache},
               {"/daily_build/otp/lib/os_mon-2.9/ebin",nocache},
               {"/daily_build/otp/lib/odbc-2.14.1/ebin",nocache},
               {"/daily_build/otp/lib/observer-2.15/ebin",nocache},
               {"/daily_build/otp/lib/mnesia-4.22/ebin",nocache},
               {"/daily_build/otp/lib/megaco-4.5/ebin",nocache},
               {"/daily_build/otp/lib/jinterface-1.14/ebin",nocache},
               {"/daily_build/otp/lib/inets-9.0.2/ebin",nocache},
               {"/daily_build/otp/lib/ftp-1.2/ebin",nocache},
               {"/daily_build/otp/lib/eunit-2.8.2/ebin",nocache},
               {"/daily_build/otp/lib/et-1.7/ebin",nocache},
               {"/daily_build/otp/lib/erts-14.1/ebin",nocache},
               {"/daily_build/otp/lib/erl_interface-5.4/ebin",nocache},
               {"/daily_build/otp/lib/erl_docgen-1.5/ebin",nocache},
               {"/daily_build/otp/lib/eldap-1.2.11/ebin",nocache},
               {"/daily_build/otp/lib/edoc-1.2/ebin",nocache},
               {"/daily_build/otp/lib/diameter-2.3/ebin",nocache},
               {"/daily_build/otp/lib/dialyzer-5.1.1/ebin",nocache},
               {"/daily_build/otp/lib/debugger-5.3.2/ebin",nocache},
               {"/daily_build/otp/lib/crypto-5.3/ebin",nocache},
               {"/daily_build/otp/lib/compiler-8.3.3/ebin",nocache},
               {"/daily_build/otp/lib/common_test-1.25.1/ebin",nocache},
               {"/daily_build/otp/lib/asn1-5.1/ebin",nocache},
               {"/ldisk/daily_build/26_opu_c_docker_72968.2023-07-27_21/test/test_server",
                nocache},
               {"/ldisk/daily_build/26_opu_c_docker_72968.2023-07-27_21/test/kernel_test/application_SUITE_data/deadlock",
                nocache}],
              code_server,code_names,interactive,[],#{}}],
         [{file,"code_server.erl"},{line,1118}]},
     {code_server,loop,1,[{file,"code_server.erl"},{line,165}]}]}

@michalmuskala
Copy link
Contributor Author

I'll take a look


monitor_loader(#state{loading = Loading0} = St, Mod, Pid, Bin, FName) ->
Tag = {'LOADER_DOWN', {Mod, Bin, FName}},
Ref = erlang:monitor(process, Pid, [{tag, Tag}]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very neat use of the tag to avoid the need for keeping stuff in state. Love it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I always wanted to do this since this feature was introduced. Saves on the "back" mapping

With changes in erlang#6736 significant work when code loading was moved away from
the code server and shifted into the requesting process. However, there
could be a lot of repeated work, especially on system startup, if a lot
of similar processes are started, all trying to load the same module
at the same time, overwhelming the code server with `get_object_code` requests.

In this change, we add additional synchronisation to `ensure_loaded` that
makes sure only one process at a time tries to load the module. In the
added test showcasing the worst-case scenario of long load path and many
concurrent requests, this changes the runtime (on my local machine)
from 8s+ to around 200ms.

Furthermore, the special sync operation can be merged with this and
save one round-trip to the code server.
@michalmuskala
Copy link
Contributor Author

Sorry for the delay, I got a bit sick.

I've sent an update that should fix the multi_load_SUITE. This keeps the old behaviour for ensure_modules_loaded. I initially tried to add the locking mechanism there too, but due to the intermediate processes in do_par the logic was becoming quite unwieldy. So I limited the bug fix to just ensure_loaded since that's the function called implicitly by the error handler, and the main one causing issues.

@jhogberg jhogberg added the testing currently being tested, tag is used by OTP internal CI label Aug 16, 2023
@jhogberg
Copy link
Contributor

Thanks, I've added it back to our tests :)

@michalmuskala
Copy link
Contributor Author

Any updates on the results of the latest tests?

@frazze-jobb
Copy link
Contributor

The tests looks good!

@frazze-jobb frazze-jobb merged commit 4ab6d3f into erlang:maint Sep 6, 2023
15 checks passed
@michalmuskala michalmuskala deleted the async_ensure_loaded branch September 7, 2023 13:30
josevalim added a commit to josevalim/otp that referenced this pull request Oct 6, 2023
Erlang/OTP 25 only attempted to perform
code loading if the mode was interactive:

https://github.com/erlang/otp/blob/maint-25/lib/kernel/src/code_server.erl#L301

This check was removed in erlang#6736 as part of
the decentralization. However, we received
reports of increased cpu/memory usage in
Erlang/OTP 26.1 in a code that was calling
code:ensure_loaded/1 on a hot path. The
underlying code was fixed but, given erlang#7503
added the server back into the equation for
ensure_loaded we can add the mode check
back to preserve Erlang/OTP 25 behaviour.
josevalim added a commit to josevalim/otp that referenced this pull request Oct 6, 2023
Erlang/OTP 25 only attempted to perform
code loading if the mode was interactive:

https://github.com/erlang/otp/blob/maint-25/lib/kernel/src/code_server.erl#L301

This check was removed in erlang#6736 as part of
the decentralization. However, we received
reports of increased cpu/memory usage in
Erlang/OTP 26.1 in a code that was calling
code:ensure_loaded/1 on a hot path. The
underlying code was fixed but, given erlang#7503
added the server back into the equation for
ensure_loaded we can add the mode check
back to preserve Erlang/OTP 25 behaviour.
josevalim added a commit to josevalim/otp that referenced this pull request Oct 15, 2023
The first regression is not attempt to
load code if mode is embedded.

Erlang/OTP 25 only attempted to perform
code loading if the mode was interactive:

https://github.com/erlang/otp/blob/maint-25/lib/kernel/src/code_server.erl#L301

This check was removed in erlang#6736 as part of
the decentralization. However, we received
reports of increased cpu/memory usage in
Erlang/OTP 26.1 in a code that was calling
code:ensure_loaded/1 on a hot path. The
underlying code was fixed but, given erlang#7503
added the server back into the equation for
ensure_loaded we can add the mode check
back to preserve Erlang/OTP 25 behaviour.

The second regression would cause the caller
process to deadlock if attempting to load a
file with invalid .beam more than once.
josevalim added a commit to josevalim/otp that referenced this pull request Dec 13, 2023
The first regression is not attempt to
load code if mode is embedded.

Erlang/OTP 25 only attempted to perform
code loading if the mode was interactive:

https://github.com/erlang/otp/blob/maint-25/lib/kernel/src/code_server.erl#L301

This check was removed in erlang#6736 as part of
the decentralization. However, we received
reports of increased cpu/memory usage in
Erlang/OTP 26.1 in a code that was calling
code:ensure_loaded/1 on a hot path. The
underlying code was fixed but, given erlang#7503
added the server back into the equation for
ensure_loaded we can add the mode check
back to preserve Erlang/OTP 25 behaviour.

The second regression would cause the caller
process to deadlock if attempting to load a
file with invalid .beam more than once.
josevalim added a commit to josevalim/otp that referenced this pull request Jan 24, 2024
We were accidentally returning the old closed
over state, which lead to loss of information,
causing the code_server to crash.

This was introduced in erlang#7503.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement team:VM Assigned to OTP team VM testing currently being tested, tag is used by OTP internal CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants