Exponential backoff implementation #105

JTSIV1 · 2023-11-27T22:25:48Z

Should prevent short server blips from crashing job/core

alexandermichels · 2024-01-19T16:22:29Z

Hey @JTSIV1 , great job so far. We discussed the issue being that the job doesn't emit the job end event.

From looking over the code, I think maybe what we need to do is in the portion of the code that says "we waited too long give up" (this if: https://github.com/cybergis/cybergis-compute-core/pull/105/files#diff-5483e642678bd804cbce6f7ada5dbc64abb41ec9df389d9986471455ba9ac1fcR147) is to emit the JOB_FAILED event ourselves with a message like "Failed to connect to HPC".

Not sure which of these two syntaxes used elsewhere in the supervisor should be used: (1) self.emitter.registerEvents (https://github.com/cybergis/cybergis-compute-core/blob/v2/src/Supervisor.ts#L68) or (2) this.maintainerMasterEventEmitter.emit(...) (

cybergis-compute-core/src/Supervisor.ts

Line 191 in 3247c1c

this.maintainerMasterEventEmitter.emit("job_end", job.hpc, job.id);

). Spend a bit of time looking into the differences between those two and maybe @zimo-xiao can offer a bit of insight.

If this works you should also be able to remove the || exit part in this line (https://github.com/cybergis/cybergis-compute-core/pull/105/files#diff-5483e642678bd804cbce6f7ada5dbc64abb41ec9df389d9986471455ba9ac1fcR195) because the job fail event being emitted should flip the job to isEnd.

Something interesting to check with the current code: does your current "job fail" show a job fail event in the database? I'm guessing not because it seems that the Emitter is the one updating that in the DB. So one way to check if this change is working is to pull up the database and go to the events table to look for these JOB_FAILED/JOB_ENDED events.

zimo-xiao · 2024-01-19T16:41:33Z

src/Supervisor.ts

+        }
+        try {
+          console.log("jdebug ok");
+          await sleep(wait * 1000);


await sleep will block the execution of the job as they are executed sequentially. A more appropriate method might be to wrap the createMaintianerWorker at line 113 this.createMaintainerWorker(job); within a Promise, use setTimeout as the timer, and interrupt the execution using reject:

new Promise((resolve, reject) => { console.log("Function starts"); setTimeout(() => { console.log("setTimeout callback runs"); reject('Interrupted by setTimeout'); }, 1000); // Simulate some work for (let i = 0; i < 5; i++) { console.log(`Processing ${i}`); } resolve('Completed successfully'); });

Also, you can use this to emit a failure event:

self.emitter.registerEvents( job, "JOB_REGISTERED", `job [${job.id}] is registered with the supervisor, waiting for initialization` );

alexandermichels · 2024-02-15T21:18:24Z

Abandoning this effort in favor of: #108

JTSIV1 added 2 commits November 27, 2023 16:24

Exponential backoff implementation w/ some bugs

5e69ed0

Update Supervisor.ts

a85bae8

alexandermichels mentioned this pull request Dec 13, 2023

[Bug] Compute Crashes if it Can't Connect to HPC #85

Closed

Resolved sleep issue

c997f9d

zimo-xiao reviewed Jan 19, 2024

View reviewed changes

Working backoff solution

88e3bbb

alexandermichels closed this Feb 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exponential backoff implementation #105

Exponential backoff implementation #105

JTSIV1 commented Nov 27, 2023

alexandermichels commented Jan 19, 2024

zimo-xiao Jan 19, 2024

zimo-xiao Jan 19, 2024

alexandermichels commented Feb 15, 2024

Exponential backoff implementation #105

Exponential backoff implementation #105

Conversation

JTSIV1 commented Nov 27, 2023

alexandermichels commented Jan 19, 2024

zimo-xiao Jan 19, 2024

Choose a reason for hiding this comment

zimo-xiao Jan 19, 2024

Choose a reason for hiding this comment

alexandermichels commented Feb 15, 2024