Skip to content

Conversation

@vishalbollu
Copy link
Contributor

Reverting order of operations to check queue length before job status because pods don't automatically terminate if one of the containers terminate. For a TensorFlow batch job, a pod has the batch container and the TF serving container. Even if the batch container exits, the TF serving container still remains running therefore the pod never exits. Since the pod never exists, job.active will be non-zero and the cortex job will remain in the running state.

Revert to checking the length of the queue for now until a better workaround is found.


checklist:

  • run make test and make lint
  • test manually (i.e. build/push all images, restart operator, and re-deploy APIs)

@vishalbollu vishalbollu requested a review from RobertLucian March 2, 2021 02:04
@vishalbollu vishalbollu merged commit 8ff681e into master Mar 2, 2021
@vishalbollu vishalbollu deleted the check-queue-before-k8s-job branch March 2, 2021 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants