Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exponential backoff implementation #105

Closed
wants to merge 4 commits into from
Closed

Conversation

JTSIV1
Copy link
Contributor

@JTSIV1 JTSIV1 commented Nov 27, 2023

  • Should prevent short server blips from crashing job/core

@alexandermichels
Copy link
Member

Hey @JTSIV1 , great job so far. We discussed the issue being that the job doesn't emit the job end event.

From looking over the code, I think maybe what we need to do is in the portion of the code that says "we waited too long give up" (this if: https://github.com/cybergis/cybergis-compute-core/pull/105/files#diff-5483e642678bd804cbce6f7ada5dbc64abb41ec9df389d9986471455ba9ac1fcR147) is to emit the JOB_FAILED event ourselves with a message like "Failed to connect to HPC".

Not sure which of these two syntaxes used elsewhere in the supervisor should be used: (1) self.emitter.registerEvents (https://github.com/cybergis/cybergis-compute-core/blob/v2/src/Supervisor.ts#L68) or (2) this.maintainerMasterEventEmitter.emit(...) (

this.maintainerMasterEventEmitter.emit("job_end", job.hpc, job.id);
). Spend a bit of time looking into the differences between those two and maybe @zimo-xiao can offer a bit of insight.

If this works you should also be able to remove the || exit part in this line (https://github.com/cybergis/cybergis-compute-core/pull/105/files#diff-5483e642678bd804cbce6f7ada5dbc64abb41ec9df389d9986471455ba9ac1fcR195) because the job fail event being emitted should flip the job to isEnd.

Something interesting to check with the current code: does your current "job fail" show a job fail event in the database? I'm guessing not because it seems that the Emitter is the one updating that in the DB. So one way to check if this change is working is to pull up the database and go to the events table to look for these JOB_FAILED/JOB_ENDED events.

}
try {
console.log("jdebug ok");
await sleep(wait * 1000);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

await sleep will block the execution of the job as they are executed sequentially. A more appropriate method might be to wrap the createMaintianerWorker at line 113 this.createMaintainerWorker(job); within a Promise, use setTimeout as the timer, and interrupt the execution using reject:

   new Promise((resolve, reject) => {
        console.log("Function starts");

        setTimeout(() => {
            console.log("setTimeout callback runs");
            reject('Interrupted by setTimeout');
        }, 1000);

        // Simulate some work
        for (let i = 0; i < 5; i++) {
            console.log(`Processing ${i}`);
        }

        resolve('Completed successfully');
    });

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you can use this to emit a failure event:

          self.emitter.registerEvents(
            job,
            "JOB_REGISTERED",
            `job [${job.id}] is registered with the supervisor, waiting for initialization`
          );

@alexandermichels
Copy link
Member

Abandoning this effort in favor of: #108

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants