Skip to content

Recover from crashed worker #21

@andig

Description

@andig
2025-09-21T08:13:38Z app[908057d9a02328] fra [info][2025-09-21 08:13:38 +0000] [649] [CRITICAL] WORKER TIMEOUT (pid:7180)
2025-09-21T08:13:38Z app[908057d9a02328] fra [info][2025-09-21 08:13:38 +0000] [7180] [ERROR] Error handling request /optimize/charge-schedule
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]Traceback (most recent call last):
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]  File "/usr/local/lib/python3.13/site-packages/gunicorn/workers/sync.py", line 135, in handle
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]    self.handle_request(listener, req, client, addr)
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]    ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]  File "/usr/local/lib/python3.13/site-packages/gunicorn/workers/sync.py", line 178, in handle_request
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]    respiter = self.wsgi(environ, resp.start_response)
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]  File "/usr/local/lib/python3.13/site-packages/flask/app.py", line 1498, in __call__
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]    return self.wsgi_app(environ, start_response)
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]           ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]  File "/usr/local/lib/python3.13/site-packages/flask/app.py", line 1473, in wsgi_app
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]    response = self.full_dispatch_request()
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]  File "/usr/local/lib/python3.13/site-packages/flask/app.py", line 880, in full_dispatch_request
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]    rv = self.dispatch_request()
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]  File "/usr/local/lib/python3.13/site-packages/flask/app.py", line 865, in dispatch_request
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]  File "/usr/local/lib/python3.13/site-packages/flask_restx/api.py", line 402, in wrapper
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]    resp = resource(*args, **kwargs)
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]  File "/usr/local/lib/python3.13/site-packages/flask/views.py", line 110, in view
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]    return current_app.ensure_sync(self.dispatch_request)(**kwargs)  # type: ignore[no-any-return]
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]  File "/usr/local/lib/python3.13/site-packages/flask_restx/resource.py", line 41, in dispatch_request
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]    resp = meth(*args, **kwargs)
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]  File "/usr/local/lib/python3.13/site-packages/flask_restx/marshalling.py", line 244, in wrapper
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]    resp = f(*args, **kwargs)
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]  File "/app/app.py", line 171, in post
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]    result = optimizer.solve()
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]  File "/app/optimizer.py", line 276, in solve
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]    self.problem.solve(solver)
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]    ~~~~~~~~~~~~~~~~~~^^^^^^^^
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]  File "/usr/local/lib/python3.13/site-packages/pulp/pulp.py", line 2092, in solve
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]    status = solver.actualSolve(self, **kwargs)
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]  File "/usr/local/lib/python3.13/site-packages/pulp/apis/coin_api.py", line 140, in actualSolve
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]    return self.solve_CBC(lp, **kwargs)
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]           ~~~~~~~~~~~~~~^^^^^^^^^^^^^^
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]  File "/usr/local/lib/python3.13/site-packages/pulp/apis/coin_api.py", line 218, in solve_CBC
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]    if cbc.wait() != 0:
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]       ~~~~~~~~^^
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]  File "/usr/local/lib/python3.13/subprocess.py", line 1280, in wait
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]    return self._wait(timeout=timeout)
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]           ~~~~~~~~~~^^^^^^^^^^^^^^^^^
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]  File "/usr/local/lib/python3.13/subprocess.py", line 2066, in _wait
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]    (pid, sts) = self._try_wait(0)
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]                 ~~~~~~~~~~~~~~^^^
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]  File "/usr/local/lib/python3.13/subprocess.py", line 2024, in _try_wait
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]    (pid, sts) = os.waitpid(self.pid, wait_flags)
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]                 ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]  File "/usr/local/lib/python3.13/site-packages/gunicorn/workers/base.py", line 203, in handle_abort
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]    sys.exit(1)
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]    ~~~~~~~~^^^
2025-09-21T08:13:38Z app[908057d9a02328] fra [info]SystemExit: 1
2025-09-21T08:13:38Z app[908057d9a02328] fra [info][2025-09-21 08:13:38 +0000] [7180] [INFO] Worker exiting (pid: 7180)

leads to:

2025-09-21T08:19:25Z app[908057d9a02328] fra [info][2025-09-21 08:19:25 +0000] [649] [ERROR] Worker (pid:7339) was sent SIGKILL! Perhaps out of memory?
2025-09-21T08:19:25Z app[908057d9a02328] fra [info][2025-09-21 08:19:25 +0000] [7364] [INFO] Booting worker with pid: 7364
2025-09-21T08:19:26Z proxy[908057d9a02328] fra [error][PR03] could not find a good candidate within 20 attempts at load balancing. last error: [PR01] no known healthy instances found for route tcp/443. (hint: is your app shut down? is there an ongoing deployment with a volume or are you using the 'immediate' strategy? have your app's instances all reached their hard limit?)
2025-09-21T08:19:28Z proxy[908057d9a02328] fra [error][PR03] could not find a good candidate within 1 attempts at load balancing. last error: [PC05] timed out while connecting to your instance. this indicates a problem with your app (hint: look at your logs and metrics)
2025-09-21T08:19:29Z proxy[908057d9a02328] fra [error][PR04] could not find a good candidate within 20 attempts at load balancing
2025-09-21T08:19:32Z proxy[908057d9a02328] arn [error][PR04] could not find a good candidate within 20 attempts at load balancing
2025-09-21T08:20:02Z app[908057d9a02328] fra [info][2025-09-21 08:20:02 +0000] [649] [CRITICAL] WORKER TIMEOUT (pid:7336)
2025-09-21T08:20:04Z app[908057d9a02328] fra [info][2025-09-21 08:20:04 +0000] [649] [ERROR] Worker (pid:7336) was sent SIGKILL! Perhaps out of memory?
2025-09-21T08:20:04Z app[908057d9a02328] fra [info][2025-09-21 08:20:04 +0000] [7369] [INFO] Booting worker with pid: 7369

Something seems to eat all resources until the machine is exhausted.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions