[Bug] Compute Crashes if it Can't Connect to HPC #85

alexandermichels · 2023-02-15T18:29:01Z

It appears that if Core cannot connect to any single HPC when submitting a job, it will crash. Error stack below:

1676474207QdpNk: [event] JOB_QUEUED job [1676474207QdpNk] is queued, waiting for registration
10.0.2.4 - - [15/Feb/2023:15:16:47 +0000] "POST /job/1676474207QdpNk/submit HTTP/1.1" 200 945 "-" "-"
10.0.2.4 - - [15/Feb/2023:15:16:47 +0000] "GET /job/1676474207QdpNk HTTP/1.1" 200 1146 "-" "-"
10.0.2.4 - - [15/Feb/2023:15:16:47 +0000] "GET /job/1676474207QdpNk HTTP/1.1" 200 1146 "-" "-"
1676474207QdpNk: [event] JOB_REGISTERED job [1676474207QdpNk] is registered with the supervisor, waiting for initialization
10.0.2.4 - - [15/Feb/2023:15:16:58 +0000] "GET /job/1676474207QdpNk HTTP/1.1" 200 1378 "-" "-"
10.0.2.4 - - [15/Feb/2023:15:17:08 +0000] "GET /job/1676474207QdpNk HTTP/1.1" 200 1378 "-" "-"
Error: Timed out while waiting for handshake
    at Timeout._onTimeout (/job_supervisor/node_modules/ssh2/lib/client.js:695:19)
    at listOnTimeout (node:internal/timers:557:17)
    at processTimers (node:internal/timers:500:7)
Error: Not connected
    at Client.exec (/job_supervisor/node_modules/ssh2/lib/client.js:722:11)
    at /job_supervisor/node_modules/node-ssh/lib/cjs/index.js:252:24
    at new Promise (<anonymous>)
    at NodeSSH.execCommand (/job_supervisor/node_modules/node-ssh/lib/cjs/index.js:251:16)
    at Supervisor.<anonymous> (/job_supervisor/production/src/Supervisor.js:179:55)
    at step (/job_supervisor/production/src/Supervisor.js:33:23)
    at Object.throw (/job_supervisor/production/src/Supervisor.js:14:53)
    at rejected (/job_supervisor/production/src/Supervisor.js:6:65)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)

We need better error handling for these cases so it doesn't bring down Core. Off of the top of my head, we could:

We could just reject the job and tell the user that the HPC couldn't be connected to
Add code to do an exponential backoff on the connection. So if we can't connect, we try again in 1 second, then 2 seconds, 4, 8, ... until the job can go through

The text was updated successfully, but these errors were encountered:

alexandermichels · 2023-12-13T20:43:06Z

#105 currently being tested to solve this issue.

alexandermichels · 2024-04-02T19:20:27Z

@JTSIV1 This is solved by #108, correct? Only thing left to do is have the SDK catch the error?

alexandermichels assigned JTSIV1 Apr 2, 2024

alexandermichels closed this as completed Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Compute Crashes if it Can't Connect to HPC #85

[Bug] Compute Crashes if it Can't Connect to HPC #85

alexandermichels commented Feb 15, 2023

alexandermichels commented Dec 13, 2023

alexandermichels commented Apr 2, 2024

[Bug] Compute Crashes if it Can't Connect to HPC #85

[Bug] Compute Crashes if it Can't Connect to HPC #85

Comments

alexandermichels commented Feb 15, 2023

alexandermichels commented Dec 13, 2023

alexandermichels commented Apr 2, 2024