Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve timeout problems around running >= 200 realizations #5597

Conversation

valentin-krasontovitsch
Copy link
Contributor

@valentin-krasontovitsch valentin-krasontovitsch commented Jun 19, 2023

Issue
Resolves #5273

Approach
The failing symptoms occur in components communicating over websockets, in particular the ensemble evaluator as a server and the job queue, gui/cli, and job dispatch scripts as clients.

There are several improvements around the communication between those components.
The main (suspected) fix is backing of the so-called JobQueueNode busy business, by letting it sleep for a longer amount of time instead of one second between each querying of the queue system.
Rationale behind this is that the JobQueueNode keeps track of the state of a whole realization, with coarse resolution (not considering forward models). For long running realizations (>= 1h) it is quite unnecessary to query the state every second, as it does not change as long as the realization is running.
Given that each JobQueueNode runs in its own thread, @berland had the idea that this might starve components running in other threads for resources, which was somewhat confirmed by looking at timing logs (how long did some processing take) and their respective timestamps.

Pre review checklist

  • Added appropriate release note label
  • PR title captures the intent of the changes, and is fitting for release notes.
  • Commit history is consistent and clean, in line with the contribution guidelines.
  • NOT RELEVANT - Updated documentation
  • Ensured new behaviour is tested

Adding labels helps the maintainers when writing release notes. This is the list of release note labels.

@valentin-krasontovitsch valentin-krasontovitsch changed the title Test threading sleep with poly Resolve timeout problems around running many (>= 200) realizations Jun 20, 2023
@valentin-krasontovitsch valentin-krasontovitsch changed the title Resolve timeout problems around running many (>= 200) realizations Resolve timeout problems around running >= 200 realizations Jun 20, 2023
@valentin-krasontovitsch valentin-krasontovitsch added the release-notes:bug-fix Automatically categorise as bug fix in release notes label Jun 20, 2023
@codecov-commenter
Copy link

codecov-commenter commented Jun 20, 2023

Codecov Report

Merging #5597 (61a5bb6) into main (c1477f0) will decrease coverage by 0.01%.
The diff coverage is 94.73%.

@@            Coverage Diff             @@
##             main    #5597      +/-   ##
==========================================
- Coverage   77.07%   77.06%   -0.01%     
==========================================
  Files         394      394              
  Lines       25784    25797      +13     
  Branches     1739     1739              
==========================================
+ Hits        19873    19881       +8     
- Misses       5335     5340       +5     
  Partials      576      576              
Impacted Files Coverage Δ
src/ert/ensemble_evaluator/evaluator.py 91.70% <ø> (ø)
src/ert/ensemble_evaluator/sync_ws_duplexer.py 90.12% <ø> (ø)
src/ert/_c_wrappers/job_queue/queue.py 86.08% <88.88%> (-1.86%) ⬇️
src/ert/_c_wrappers/job_queue/job_queue_node.py 90.05% <100.00%> (+0.55%) ⬆️
src/ert/ensemble_evaluator/dispatch.py 95.71% <100.00%> (+0.71%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@valentin-krasontovitsch valentin-krasontovitsch force-pushed the test-threading-sleep-with-poly branch 3 times, most recently from 61a5bb6 to 6efaa73 Compare June 21, 2023 08:11
The websockets client in the job queue implements a manual backoff,
which is no longer necessary with python > 3.6

We remove the test covering the reconnection attempts of the websocket
client used in the job queue. As the code stands now, it is difficult to
test the desired behaviour, cf. the reconnect test in the websockets
library itself. We postpone until we have refactored this part of the
code into an experiment server.

Co-authored-by: Anna Kvashchuk <kvashchuk.anna@gmail.com>
The dispatch message batcher (part of the ensemble evaluator) was
measuring and logging the duration of the processing of messages before
actually processing them with the respective handlers.

We move the (final part of the measuring and) logging to a done callback
executed after processing of a batch is complete.
We are experiencing many timeouts in clients talking to the websocket
server running as part of the ensemble evaluator.

This affects both the job queue, as well as the job runner scripts
connecting from the compute nodes.

We have observed that the ensemble evaluator (ee) has to handle many
events coming from the job runner scripts, which can take more than 10
seconds, which blocks the ee and stops it from taking care of its client
connections (in particular receiving, sending, and responding to ping
pongs, which are part of websockets).

We thus increase the ping pong timeouts and intervals on both clients
and server, and increase the open connection timeout for clients, as
both situations have been observed to cause timeouts.

Co-authored-by: Anna Kvashchuk <kvashchuk.anna@gmail.com>
@valentin-krasontovitsch valentin-krasontovitsch force-pushed the test-threading-sleep-with-poly branch 2 times, most recently from ab3b2df to c6a04ee Compare June 21, 2023 10:07
@kvashchuka
Copy link
Contributor

I will dare to ask - should we add a test for this fix?

@valentin-krasontovitsch
Copy link
Contributor Author

we've discussed the testing. at the moment, we are in the unfortunate situation of being pressed for time.

furthermore, the initial problems we were encountering and trying to fix were difficult to encapsulate in a unit test.

now we have an easy way of reproducing, using an integration test (run ert with this config) based on the poly example.

from yesterday's standup / discussions i understood that running that breaking example might bring down the computer it's running on, but i'm willing to take that risk (on a github action runner) ; )

i suggest to look into it after this is merged and the komodo build is under way.

@valentin-krasontovitsch valentin-krasontovitsch force-pushed the test-threading-sleep-with-poly branch 2 times, most recently from 52fc095 to 3b67331 Compare June 21, 2023 10:36
Instead of reconnecting every time we have changes to publish, we
reuse connections.

Furthermore, the job queue was trying to publish changes, establishing
a new connection each time, regardless of whether there were any changes
or not, spinning in a loop with a one second sleep.

We also move the loop code into its own function, for increased
readability.

Furthermore, we do not have a maximum reconnect counter with final
raising of an exception. The job queue will try to reconnect
indefinitely whenever a connection times out, and raise if the socket
it's trying to connect to is not open.
The rationale behind this is that we should be able to recover from
timeouts and the server will become responsive again even if it's busy
or a while.
We instead log whenever a connection is dropped by the job queue.

There is a minor rename of variables.

Co-authored-by: Anna Kvashchuk <kvashchuk.anna@gmail.com>
Co-authored-by: Julius Parulek <jparu@equinor.com>
The job queue node monitors the job running on a a single compute node,
which is responsible for running all forward models for one realization,
and runs in its own thread.

It queries the queue system regularly, in between sleeps.

We suspect that an increase of realizations, which leads to an increase
of threads, starves other threads for resources, for instance the thread
of the ensemble evaluator server, which in turn leads to timeouts in
connections the server holds.

We increase the sleep duration (after an initial start-up phase) and add
a random offset to stagger execution of the job threads.

Co-authored-by: Håvard Berland <havb@equinor.com>
Copy link
Contributor

@xjules xjules left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful job @valentin-krasontovitsch and co.! Let's get it in! 🚀

@valentin-krasontovitsch valentin-krasontovitsch merged commit b2f5cc9 into equinor:main Jun 21, 2023
36 of 37 checks passed
@valentin-krasontovitsch valentin-krasontovitsch deleted the test-threading-sleep-with-poly branch June 21, 2023 11:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes:bug-fix Automatically categorise as bug fix in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Running ert with >= 200 realizations fails
4 participants