Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce contention on the code_server #6736

Merged
merged 2 commits into from Feb 10, 2023

Conversation

josevalim
Copy link
Contributor

@josevalim josevalim commented Jan 25, 2023

load_file/1, load_abs/2, and load_binary/3 now
perform most of the work on the client.

is_sticky/1 and is_loaded/1 now read directly
from the module database.

PS: I could not move ensure_loaded/1 to the
client for reasons I could not understand yet.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 25, 2023

CT Test Results

       2 files       65 suites   1h 6m 4s ⏱️
1 456 tests 1 225 ✔️ 227 💤 4
1 644 runs  1 367 ✔️ 271 💤 6

For more details on these failures, see this check.

Results for commit c9e6925.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

load_file/1, load_abs/2, and load_binary/3 now
perform most of the work on the client.

is_sticky/1 and is_loaded/1 now read directly
from the module database.
@bjorng bjorng added the team:VM Assigned to OTP team VM label Jan 26, 2023
@bjorng
Copy link
Contributor

bjorng commented Jan 26, 2023

A thought about ensure_loaded: If two processes calls code:ensure_loaded/1 for the same module, the module must only be loaded once. That is, when the code server has done finish_loading for the first request to load the module, it must discard the second request request to load the module. One way to implement that would be to compare the MD5 for the module. Not sure if that was the actual issue you encountered.

@bjorng
Copy link
Contributor

bjorng commented Jan 26, 2023

Have you observed actual contention caused by the functions you have optimized? (That is, will this optimization make a meaningful difference?)

When running in interactive mode, I would expect most code loading being caused by a call to an unloaded module and that the error_handler module would then call code:ensure_loaded/1 to load the module in question. I could be wrong.

@josevalim
Copy link
Contributor Author

@bjorng no, I didn't measure anything. My thoughts are that the approach is generally preferable in terms of design. In particular:

  • is_sticky and is_loaded are positive changes with no drawbacks (perhaps we should even enable read_concurrency in the table?)

  • Moving load_abs and load_binary to the client have no harm - but may not provide any benefit

  • load_file though needs to talk to the code server twice, passing the .beam binary around, so I understand if we would prefer to not change it

Do you think those are fair?

I will share what I did for ensure_loaded later.

@josevalim
Copy link
Contributor Author

@bjorng here is the commit for ensure_loaded:

josevalim@8f7b636

and here is the test failure I get:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
code_SUITE:on_load_binary failed on line 1304
Reason: {badmatch,{undef,[{on_load_binary,ok,[],[]},
        {code_S...}
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

@bjorng
Copy link
Contributor

bjorng commented Jan 26, 2023

I am only worried about potential unintended consequences because of lack serialization of the request.

Moving relatively cheap calls such as is_sticky and is_loaded to the client might not be worth it.

Loading potentially very expensive calls such as the call to prepare_loading is probably worth moving to the client (possibly even if there will be two calls to the server).

I will try to take a look at why ensure_loaded does not work tomorrow. I think that there is a much greater potential for more concurrency if that would work.

@josevalim
Copy link
Contributor Author

Moving relatively cheap calls such as is_sticky and is_loaded to the client might not be worth it.

Can you think of a scenario where it would introduce bugs? Given the writes are serialized, we will always read our own writes. The only race would be if two different processes are writing and reading at the same time, but then the result is not guaranteed in any way.

In any case, you are right in that they are cheap and the benefit in moving them is minimal. I will be glad to amend it in any way desired.

@bjorng
Copy link
Contributor

bjorng commented Jan 27, 2023

Can you think of a scenario where it would introduce bugs?

No, I am just paranoid. 😉

I have looked at the failed test case and I now roughly understands what happens. The following test module is used:

-module(on_load_binary).
-export([ok/0]).
-on_load({init, 0}).

init() ->
    on_load_binary_test_case_process !
        {on_load_binary, self()},
    receive go -> ok end.

ok() -> true.

One process attempts to load it using code:load_binary/3, and another process tries to call the ok/0 function, which will cause a call to code:ensure_loaded/1 from the error handler. It seems that init/1 is called once for each load attempt, which is not supposed to happen; the load attempts should be serialised and init1/ should only be called once. I am not sure why the test case terminates with an exception instead of hanging.

I will look more at this next week.

@frazze-jobb
Copy link
Contributor

I have now taken a look at this new ensure loaded function and fixed the issue.
The issue was that the second process needed to wait for the on_load function to finish before it can use the module.
I added a call in ensure loaded that asks the code_server if the module has an on_load function pending. If so wait for that to finish.

@frazze-jobb frazze-jobb added the testing currently being tested, tag is used by OTP internal CI label Feb 8, 2023
@frazze-jobb frazze-jobb merged commit f36b268 into erlang:master Feb 10, 2023
@frazze-jobb
Copy link
Contributor

Merged, thanks @josevalim for your contribution!

michalmuskala added a commit to michalmuskala/otp that referenced this pull request Jul 19, 2023
With changes in erlang#6736 significant work when code loading was moved
away from the code server and shifted into the requesting process.
However, there could be a lot of repeated work, especially on system
startup, if a lot of similar processes are started, all trying to
load the same module at the same time, overwhelming the code server
with `get_object_code` requests.

In this change, we add additional synchronisation to `ensure_loaded`
that makes sure only one process at a time tries to load the module.
In the added test showcasing the worst-case scenario of long load path
and many concurrent requests, this changes the runtime (on my local machine)
from 8s+ to around 200ms.
michalmuskala added a commit to michalmuskala/otp that referenced this pull request Jul 19, 2023
With changes in erlang#6736 significant work when code loading was moved
away from the code server and shifted into the requesting process.
However, there could be a lot of repeated work, especially on system
startup, if a lot of similar processes are started, all trying to
load the same module at the same time, overwhelming the code server
with `get_object_code` requests.

In this change, we add additional synchronisation to `ensure_loaded`
that makes sure only one process at a time tries to load the module.
In the added test showcasing the worst-case scenario of long load path
and many concurrent requests, this changes the runtime (on my local machine)
from 8s+ to around 200ms.
michalmuskala added a commit to michalmuskala/otp that referenced this pull request Jul 19, 2023
With changes in erlang#6736 significant work when code loading was moved away from
the code server and shifted into the requesting process. However, there
could be a lot of repeated work, especially on system startup, if a lot
of similar processes are started, all trying to load the same module
at the same time, overwhelming the code server with `get_object_code` requests.

In this change, we add additional synchronisation to `ensure_loaded` that
makes sure only one process at a time tries to load the module. In the
added test showcasing the worst-case scenario of long load path and many
concurrent requests, this changes the runtime (on my local machine)
from 8s+ to around 200ms.
michalmuskala added a commit to michalmuskala/otp that referenced this pull request Jul 19, 2023
With changes in erlang#6736 significant work when code loading was moved away from
the code server and shifted into the requesting process. However, there
could be a lot of repeated work, especially on system startup, if a lot
of similar processes are started, all trying to load the same module
at the same time, overwhelming the code server with `get_object_code` requests.

In this change, we add additional synchronisation to `ensure_loaded` that
makes sure only one process at a time tries to load the module. In the
added test showcasing the worst-case scenario of long load path and many
concurrent requests, this changes the runtime (on my local machine)
from 8s+ to around 200ms.

Furthermore, the special sync operation can be merged with this and
save one round-trip to the code server.
michalmuskala added a commit to michalmuskala/otp that referenced this pull request Jul 19, 2023
With changes in erlang#6736 significant work when code loading was moved away from
the code server and shifted into the requesting process. However, there
could be a lot of repeated work, especially on system startup, if a lot
of similar processes are started, all trying to load the same module
at the same time, overwhelming the code server with `get_object_code` requests.

In this change, we add additional synchronisation to `ensure_loaded` that
makes sure only one process at a time tries to load the module. In the
added test showcasing the worst-case scenario of long load path and many
concurrent requests, this changes the runtime (on my local machine)
from 8s+ to around 200ms.

Furthermore, the special sync operation can be merged with this and
save one round-trip to the code server.
michalmuskala added a commit to michalmuskala/otp that referenced this pull request Aug 16, 2023
With changes in erlang#6736 significant work when code loading was moved away from
the code server and shifted into the requesting process. However, there
could be a lot of repeated work, especially on system startup, if a lot
of similar processes are started, all trying to load the same module
at the same time, overwhelming the code server with `get_object_code` requests.

In this change, we add additional synchronisation to `ensure_loaded` that
makes sure only one process at a time tries to load the module. In the
added test showcasing the worst-case scenario of long load path and many
concurrent requests, this changes the runtime (on my local machine)
from 8s+ to around 200ms.

Furthermore, the special sync operation can be merged with this and
save one round-trip to the code server.
potatosalad pushed a commit to WhatsApp/warts that referenced this pull request Sep 7, 2023
Summary:
With changes in erlang#6736 significant work when code loading was moved away from
the code server and shifted into the requesting process. However, there
could be a lot of repeated work, especially on system startup, if a lot
of similar processes are started, all trying to load the same module
at the same time, overwhelming the code server with `get_object_code` requests.

In this change, we add additional synchronisation to `ensure_loaded` that
makes sure only one process at a time tries to load the module. In the
added test showcasing the worst-case scenario of long load path and many
concurrent requests, this changes the runtime (on my local machine)
from 8s+ to around 200ms.

Furthermore, the special sync operation can be merged with this and
save one round-trip to the code server.

Test Plan: Upstream CI

Reviewers: vlanvin, skatepalli, #whatsapp_devinfra_ctt, #whatsapp_clr

Reviewed By: vlanvin

Differential Revision: https://phabricator.intern.facebook.com/D48425445
josevalim added a commit to josevalim/otp that referenced this pull request Oct 6, 2023
Erlang/OTP 25 only attempted to perform
code loading if the mode was interactive:

https://github.com/erlang/otp/blob/maint-25/lib/kernel/src/code_server.erl#L301

This check was removed in erlang#6736 as part of
the decentralization. However, we received
reports of increased cpu/memory usage in
Erlang/OTP 26.1 in a code that was calling
code:ensure_loaded/1 on a hot path. The
underlying code was fixed but, given erlang#7503
added the server back into the equation for
ensure_loaded we can add the mode check
back to preserve Erlang/OTP 25 behaviour.
josevalim added a commit to josevalim/otp that referenced this pull request Oct 6, 2023
Erlang/OTP 25 only attempted to perform
code loading if the mode was interactive:

https://github.com/erlang/otp/blob/maint-25/lib/kernel/src/code_server.erl#L301

This check was removed in erlang#6736 as part of
the decentralization. However, we received
reports of increased cpu/memory usage in
Erlang/OTP 26.1 in a code that was calling
code:ensure_loaded/1 on a hot path. The
underlying code was fixed but, given erlang#7503
added the server back into the equation for
ensure_loaded we can add the mode check
back to preserve Erlang/OTP 25 behaviour.
josevalim added a commit to josevalim/otp that referenced this pull request Oct 15, 2023
The first regression is not attempt to
load code if mode is embedded.

Erlang/OTP 25 only attempted to perform
code loading if the mode was interactive:

https://github.com/erlang/otp/blob/maint-25/lib/kernel/src/code_server.erl#L301

This check was removed in erlang#6736 as part of
the decentralization. However, we received
reports of increased cpu/memory usage in
Erlang/OTP 26.1 in a code that was calling
code:ensure_loaded/1 on a hot path. The
underlying code was fixed but, given erlang#7503
added the server back into the equation for
ensure_loaded we can add the mode check
back to preserve Erlang/OTP 25 behaviour.

The second regression would cause the caller
process to deadlock if attempting to load a
file with invalid .beam more than once.
josevalim added a commit to josevalim/otp that referenced this pull request Dec 13, 2023
The first regression is not attempt to
load code if mode is embedded.

Erlang/OTP 25 only attempted to perform
code loading if the mode was interactive:

https://github.com/erlang/otp/blob/maint-25/lib/kernel/src/code_server.erl#L301

This check was removed in erlang#6736 as part of
the decentralization. However, we received
reports of increased cpu/memory usage in
Erlang/OTP 26.1 in a code that was calling
code:ensure_loaded/1 on a hot path. The
underlying code was fixed but, given erlang#7503
added the server back into the equation for
ensure_loaded we can add the mode check
back to preserve Erlang/OTP 25 behaviour.

The second regression would cause the caller
process to deadlock if attempting to load a
file with invalid .beam more than once.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team:VM Assigned to OTP team VM testing currently being tested, tag is used by OTP internal CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants