Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix: Add image building job deletion delay #345

Conversation

deadlycoconuts
Copy link
Contributor

@deadlycoconuts deadlycoconuts commented Jun 15, 2023

Context

As of present, when the Turing API server triggers an image building job as part of launching a batch ensembling job/service, it checks for the presence of an existing image building job in the cluster with the same name. The status of the image building job found (if it is found), corresponds to a variety of situations:

  • Found; running -> an existing image building job is already taking place and still running -> wait for the job to complete
  • Found; failed -> an existing image building job has failed -> delete the old job -> begin a new one
  • Not found -> begin a new one

In particular, we're interested in the second scenario, where there a new job is triggered right after the deletion of the old (failed) run (see this for more details):

err = ib.clusterController.DeleteJob(context.Background(), ib.imageBuildingConfig.BuildNamespace, job.Name)
if err != nil {
	log.Errorf("error deleting job: %v", err)
	return "", ErrDeleteFailedJob
}

job, err = ib.createKanikoJob(kanikoJobName, imageRef, request.ArtifactURI, request.BuildLabels,
	request.EnsemblerFolder, request.BaseImageRefTag)
if err != nil {
	log.Errorf("unable to build image %s, error: %v", imageRef, err)
	return "", ErrUnableToBuildImage
}

A potential problem can occur whereby the creation of the new job happens too quickly after the deletion of the job, due to the k8s resources having insufficient time to be removed from the cluster (the kube API server would return an error saying that the creation of the new job has failed due to the existence of another job with the very same name). When this happens, the (new) image building job fails immediately.

This PR thus introduces a new method to wait for the deletion to be complete before proceeding with the creation of another image builder job.

Main Modifications

  • api/turing/imagebuilder/imagebuilder.go - Addition of an additional helper method to wait for image building jobs to be completely removed from the cluster

@deadlycoconuts deadlycoconuts self-assigned this Jun 15, 2023
@deadlycoconuts deadlycoconuts force-pushed the add_image_building_job_deletion_delay branch from 426736c to cc53f2e Compare June 15, 2023 06:53
@deadlycoconuts deadlycoconuts added the type: bug Something isn't working label Jun 16, 2023
@deadlycoconuts deadlycoconuts marked this pull request as ready for review June 16, 2023 05:06
@deadlycoconuts deadlycoconuts changed the title Add image building job deletion delay Bugfix: Add image building job deletion delay Jun 16, 2023
Copy link
Collaborator

@krithika369 krithika369 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks, @deadlycoconuts !

@deadlycoconuts deadlycoconuts merged commit e8989c7 into caraml-dev:main Jun 19, 2023
12 checks passed
@deadlycoconuts deadlycoconuts deleted the add_image_building_job_deletion_delay branch July 18, 2023 04:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants