-
Notifications
You must be signed in to change notification settings - Fork 46
Closed
Labels
Description
We have been encountering an odd issue, where submission fail, but the frontend is not updated, and is stuck in the scoring phase.
Sometimes, when a submission is transferred to our workers, the worker (i.e. not our scoring problem, but some part of the worker code) crashes, and does so in a way that this is not visible on the front end (i.e. the submission status is not updated). This leads to really confusing situations, where the participants think that their submissions are just scoring for a very long time, when in reality it is not even running.
Below is the worker side log for the whole event, which some information redacted (I replaced some bits with <...>)
compute_worker | [2025-04-11 07:51:05,910: INFO/MainProcess] Received task: <...>
compute_worker | [2025-04-11 07:51:05,911: INFO/ForkPoolWorker-1] Received run arguments: {<...>}
compute_worker | [2025-04-11 07:51:05,911: INFO/ForkPoolWorker-1] Checking if cache directory needs to be pruned...
compute_worker | [2025-04-11 07:51:05,912: INFO/ForkPoolWorker-1] Cache directory does not need to be pruned!
compute_worker | [2025-04-11 07:51:05,912: INFO/ForkPoolWorker-1] Getting bundle <...> to unpack @ program
compute_worker | [2025-04-11 07:51:06,026: INFO/ForkPoolWorker-1] Getting bundle <...> to unpack @ input/ref
compute_worker | [2025-04-11 07:51:06,235: INFO/ForkPoolWorker-1] Getting bundle <...> to unpack @ input/res
compute_worker | [2025-04-11 07:51:06,325: INFO/ForkPoolWorker-1] Failed. Retrying in 60 seconds...
compute_worker | [2025-04-11 07:54:15,797: INFO/ForkPoolWorker-1] Destroying submission temp dir: /codabench/tmpo45waxce
compute_worker | [2025-04-11 07:54:15,807: ERROR/ForkPoolWorker-1] Task <...> raised unexpected: URLError(TimeoutError(110, 'Connection timed out'))
compute_worker | Traceback (most recent call last):
compute_worker | File "/usr/local/lib/python3.9/urllib/request.py", line 1346, in do_open
compute_worker | h.request(req.get_method(), req.selector, req.data, headers,
compute_worker | File "/usr/local/lib/python3.9/http/client.py", line 1285, in request
compute_worker | self._send_request(method, url, body, headers, encode_chunked)
compute_worker | File "/usr/local/lib/python3.9/http/client.py", line 1331, in _send_request
compute_worker | self.endheaders(body, encode_chunked=encode_chunked)
compute_worker | File "/usr/local/lib/python3.9/http/client.py", line 1280, in endheaders
compute_worker | self._send_output(message_body, encode_chunked=encode_chunked)
compute_worker | File "/usr/local/lib/python3.9/http/client.py", line 1040, in _send_output
compute_worker | self.send(msg)
compute_worker | File "/usr/local/lib/python3.9/http/client.py", line 980, in send
compute_worker | self.connect()
compute_worker | File "/usr/local/lib/python3.9/http/client.py", line 1447, in connect
compute_worker | super().connect()
compute_worker | File "/usr/local/lib/python3.9/http/client.py", line 946, in connect
compute_worker | self.sock = self._create_connection(
compute_worker | File "/usr/local/lib/python3.9/socket.py", line 844, in create_connection
compute_worker | raise err
compute_worker | File "/usr/local/lib/python3.9/socket.py", line 832, in create_connection
compute_worker | sock.connect(sa)
compute_worker | TimeoutError: [Errno 110] Connection timed out
compute_worker |
compute_worker | During handling of the above exception, another exception occurred:
compute_worker |
compute_worker | Traceback (most recent call last):
compute_worker | File "/usr/local/lib/python3.9/site-packages/celery/app/trace.py", line 385, in trace_task
compute_worker | R = retval = fun(*args, **kwargs)
compute_worker | File "/usr/local/lib/python3.9/site-packages/celery/app/trace.py", line 650, in __protected_call__
compute_worker | return self.run(*args, **kwargs)
compute_worker | File "/compute_worker.py", line 112, in run_wrapper
compute_worker | run.prepare()
compute_worker | File "/compute_worker.py", line 787, in prepare
compute_worker | zip_file = self._get_bundle(url, path, cache=cache_this_bundle)
compute_worker | File "/compute_worker.py", line 451, in _get_bundle
compute_worker | urlretrieve(url, bundle_file)
compute_worker | File "/usr/local/lib/python3.9/urllib/request.py", line 239, in urlretrieve
compute_worker | with contextlib.closing(urlopen(url, data)) as fp:
compute_worker | File "/usr/local/lib/python3.9/urllib/request.py", line 214, in urlopen
compute_worker | return opener.open(url, data, timeout)
compute_worker | File "/usr/local/lib/python3.9/urllib/request.py", line 517, in open
compute_worker | response = self._open(req, data)
compute_worker | File "/usr/local/lib/python3.9/urllib/request.py", line 534, in _open
compute_worker | result = self._call_chain(self.handle_open, protocol, protocol +
compute_worker | File "/usr/local/lib/python3.9/urllib/request.py", line 494, in _call_chain
compute_worker | result = func(*args)
compute_worker | File "/usr/local/lib/python3.9/urllib/request.py", line 1389, in https_open
compute_worker | return self.do_open(http.client.HTTPSConnection, req,
compute_worker | File "/usr/local/lib/python3.9/urllib/request.py", line 1349, in do_open
compute_worker | raise URLError(err)
compute_worker | urllib.error.URLError: <urlopen error [Errno 110] Connection timed out>
Reactions are currently unavailable