Resolve timeout problems around running >= 200 realizations #5597

valentin-krasontovitsch · 2023-06-19T13:34:57Z

Issue
Resolves #5273

Approach
The failing symptoms occur in components communicating over websockets, in particular the ensemble evaluator as a server and the job queue, gui/cli, and job dispatch scripts as clients.

There are several improvements around the communication between those components.
The main (suspected) fix is backing of the so-called JobQueueNode busy business, by letting it sleep for a longer amount of time instead of one second between each querying of the queue system.
Rationale behind this is that the JobQueueNode keeps track of the state of a whole realization, with coarse resolution (not considering forward models). For long running realizations (>= 1h) it is quite unnecessary to query the state every second, as it does not change as long as the realization is running.
Given that each JobQueueNode runs in its own thread, @berland had the idea that this might starve components running in other threads for resources, which was somewhat confirmed by looking at timing logs (how long did some processing take) and their respective timestamps.

Pre review checklist

Added appropriate release note label
PR title captures the intent of the changes, and is fitting for release notes.
Commit history is consistent and clean, in line with the contribution guidelines.
NOT RELEVANT - Updated documentation
Ensured new behaviour is tested

Adding labels helps the maintainers when writing release notes. This is the list of release note labels.

src/ert/_c_wrappers/job_queue/job_queue_node.py

src/ert/_c_wrappers/job_queue/queue.py

src/ert/ensemble_evaluator/_builder/_legacy.py

codecov-commenter · 2023-06-20T06:27:02Z

Codecov Report

Merging #5597 (61a5bb6) into main (c1477f0) will decrease coverage by 0.01%.
The diff coverage is 94.73%.

@@            Coverage Diff             @@
##             main    #5597      +/-   ##
==========================================
- Coverage   77.07%   77.06%   -0.01%     
==========================================
  Files         394      394              
  Lines       25784    25797      +13     
  Branches     1739     1739              
==========================================
+ Hits        19873    19881       +8     
- Misses       5335     5340       +5     
  Partials      576      576

Impacted Files	Coverage Δ
src/ert/ensemble_evaluator/evaluator.py	`91.70% <ø> (ø)`
src/ert/ensemble_evaluator/sync_ws_duplexer.py	`90.12% <ø> (ø)`
src/ert/_c_wrappers/job_queue/queue.py	`86.08% <88.88%> (-1.86%)`	⬇️
src/ert/_c_wrappers/job_queue/job_queue_node.py	`90.05% <100.00%> (+0.55%)`	⬆️
src/ert/ensemble_evaluator/dispatch.py	`95.71% <100.00%> (+0.71%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

src/ert/ensemble_evaluator/dispatch.py

src/ert/_c_wrappers/job_queue/queue.py

The websockets client in the job queue implements a manual backoff, which is no longer necessary with python > 3.6 We remove the test covering the reconnection attempts of the websocket client used in the job queue. As the code stands now, it is difficult to test the desired behaviour, cf. the reconnect test in the websockets library itself. We postpone until we have refactored this part of the code into an experiment server. Co-authored-by: Anna Kvashchuk <kvashchuk.anna@gmail.com>

The dispatch message batcher (part of the ensemble evaluator) was measuring and logging the duration of the processing of messages before actually processing them with the respective handlers. We move the (final part of the measuring and) logging to a done callback executed after processing of a batch is complete.

We are experiencing many timeouts in clients talking to the websocket server running as part of the ensemble evaluator. This affects both the job queue, as well as the job runner scripts connecting from the compute nodes. We have observed that the ensemble evaluator (ee) has to handle many events coming from the job runner scripts, which can take more than 10 seconds, which blocks the ee and stops it from taking care of its client connections (in particular receiving, sending, and responding to ping pongs, which are part of websockets). We thus increase the ping pong timeouts and intervals on both clients and server, and increase the open connection timeout for clients, as both situations have been observed to cause timeouts. Co-authored-by: Anna Kvashchuk <kvashchuk.anna@gmail.com>

kvashchuka · 2023-06-21T10:09:20Z

I will dare to ask - should we add a test for this fix?

valentin-krasontovitsch · 2023-06-21T10:22:06Z

we've discussed the testing. at the moment, we are in the unfortunate situation of being pressed for time.

furthermore, the initial problems we were encountering and trying to fix were difficult to encapsulate in a unit test.

now we have an easy way of reproducing, using an integration test (run ert with this config) based on the poly example.

from yesterday's standup / discussions i understood that running that breaking example might bring down the computer it's running on, but i'm willing to take that risk (on a github action runner) ; )

i suggest to look into it after this is merged and the komodo build is under way.

Instead of reconnecting every time we have changes to publish, we reuse connections. Furthermore, the job queue was trying to publish changes, establishing a new connection each time, regardless of whether there were any changes or not, spinning in a loop with a one second sleep. We also move the loop code into its own function, for increased readability. Furthermore, we do not have a maximum reconnect counter with final raising of an exception. The job queue will try to reconnect indefinitely whenever a connection times out, and raise if the socket it's trying to connect to is not open. The rationale behind this is that we should be able to recover from timeouts and the server will become responsive again even if it's busy or a while. We instead log whenever a connection is dropped by the job queue. There is a minor rename of variables. Co-authored-by: Anna Kvashchuk <kvashchuk.anna@gmail.com> Co-authored-by: Julius Parulek <jparu@equinor.com>

The job queue node monitors the job running on a a single compute node, which is responsible for running all forward models for one realization, and runs in its own thread. It queries the queue system regularly, in between sleeps. We suspect that an increase of realizations, which leads to an increase of threads, starves other threads for resources, for instance the thread of the ensemble evaluator server, which in turn leads to timeouts in connections the server holds. We increase the sleep duration (after an initial start-up phase) and add a random offset to stagger execution of the job threads. Co-authored-by: Håvard Berland <havb@equinor.com>

xjules

Wonderful job @valentin-krasontovitsch and co.! Let's get it in! 🚀