-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve timeout problems around running >= 200 realizations #5597
Resolve timeout problems around running >= 200 realizations #5597
Conversation
ee3be7e
to
2f3e8fc
Compare
Codecov Report
@@ Coverage Diff @@
## main #5597 +/- ##
==========================================
- Coverage 77.07% 77.06% -0.01%
==========================================
Files 394 394
Lines 25784 25797 +13
Branches 1739 1739
==========================================
+ Hits 19873 19881 +8
- Misses 5335 5340 +5
Partials 576 576
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
2f3e8fc
to
11fe445
Compare
61a5bb6
to
6efaa73
Compare
6efaa73
to
53c1e73
Compare
The websockets client in the job queue implements a manual backoff, which is no longer necessary with python > 3.6 We remove the test covering the reconnection attempts of the websocket client used in the job queue. As the code stands now, it is difficult to test the desired behaviour, cf. the reconnect test in the websockets library itself. We postpone until we have refactored this part of the code into an experiment server. Co-authored-by: Anna Kvashchuk <kvashchuk.anna@gmail.com>
The dispatch message batcher (part of the ensemble evaluator) was measuring and logging the duration of the processing of messages before actually processing them with the respective handlers. We move the (final part of the measuring and) logging to a done callback executed after processing of a batch is complete.
We are experiencing many timeouts in clients talking to the websocket server running as part of the ensemble evaluator. This affects both the job queue, as well as the job runner scripts connecting from the compute nodes. We have observed that the ensemble evaluator (ee) has to handle many events coming from the job runner scripts, which can take more than 10 seconds, which blocks the ee and stops it from taking care of its client connections (in particular receiving, sending, and responding to ping pongs, which are part of websockets). We thus increase the ping pong timeouts and intervals on both clients and server, and increase the open connection timeout for clients, as both situations have been observed to cause timeouts. Co-authored-by: Anna Kvashchuk <kvashchuk.anna@gmail.com>
ab3b2df
to
c6a04ee
Compare
I will dare to ask - should we add a test for this fix? |
we've discussed the testing. at the moment, we are in the unfortunate situation of being pressed for time. furthermore, the initial problems we were encountering and trying to fix were difficult to encapsulate in a unit test. now we have an easy way of reproducing, using an integration test (run ert with this config) based on the poly example. from yesterday's standup / discussions i understood that running that breaking example might bring down the computer it's running on, but i'm willing to take that risk (on a github action runner) ; ) i suggest to look into it after this is merged and the komodo build is under way. |
52fc095
to
3b67331
Compare
Instead of reconnecting every time we have changes to publish, we reuse connections. Furthermore, the job queue was trying to publish changes, establishing a new connection each time, regardless of whether there were any changes or not, spinning in a loop with a one second sleep. We also move the loop code into its own function, for increased readability. Furthermore, we do not have a maximum reconnect counter with final raising of an exception. The job queue will try to reconnect indefinitely whenever a connection times out, and raise if the socket it's trying to connect to is not open. The rationale behind this is that we should be able to recover from timeouts and the server will become responsive again even if it's busy or a while. We instead log whenever a connection is dropped by the job queue. There is a minor rename of variables. Co-authored-by: Anna Kvashchuk <kvashchuk.anna@gmail.com> Co-authored-by: Julius Parulek <jparu@equinor.com>
3b67331
to
1a9e4c8
Compare
The job queue node monitors the job running on a a single compute node, which is responsible for running all forward models for one realization, and runs in its own thread. It queries the queue system regularly, in between sleeps. We suspect that an increase of realizations, which leads to an increase of threads, starves other threads for resources, for instance the thread of the ensemble evaluator server, which in turn leads to timeouts in connections the server holds. We increase the sleep duration (after an initial start-up phase) and add a random offset to stagger execution of the job threads. Co-authored-by: Håvard Berland <havb@equinor.com>
1a9e4c8
to
b12c482
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonderful job @valentin-krasontovitsch and co.! Let's get it in! 🚀
Issue
Resolves #5273
Approach
The failing symptoms occur in components communicating over websockets, in particular the ensemble evaluator as a server and the job queue, gui/cli, and job dispatch scripts as clients.
There are several improvements around the communication between those components.
The main (suspected) fix is backing of the so-called
JobQueueNode
busy business, by letting it sleep for a longer amount of time instead of one second between each querying of the queue system.Rationale behind this is that the
JobQueueNode
keeps track of the state of a whole realization, with coarse resolution (not considering forward models). For long running realizations (>= 1h) it is quite unnecessary to query the state every second, as it does not change as long as the realization is running.Given that each
JobQueueNode
runs in its own thread, @berland had the idea that this might starve components running in other threads for resources, which was somewhat confirmed by looking at timing logs (how long did some processing take) and their respective timestamps.Pre review checklist
Adding labels helps the maintainers when writing release notes. This is the list of release note labels.