Bugfix: Add image building job deletion delay #345
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Context
As of present, when the Turing API server triggers an image building job as part of launching a batch ensembling job/service, it checks for the presence of an existing image building job in the cluster with the same name. The status of the image building job found (if it is found), corresponds to a variety of situations:
In particular, we're interested in the second scenario, where there a new job is triggered right after the deletion of the old (failed) run (see this for more details):
A potential problem can occur whereby the creation of the new job happens too quickly after the deletion of the job, due to the k8s resources having insufficient time to be removed from the cluster (the kube API server would return an error saying that the creation of the new job has failed due to the existence of another job with the very same name). When this happens, the (new) image building job fails immediately.
This PR thus introduces a new method to wait for the deletion to be complete before proceeding with the creation of another image builder job.
Main Modifications
api/turing/imagebuilder/imagebuilder.go
- Addition of an additional helper method to wait for image building jobs to be completely removed from the cluster