Check sqs queue length before checking job status #1925

vishalbollu · 2021-03-02T02:04:11Z

Reverting order of operations to check queue length before job status because pods don't automatically terminate if one of the containers terminate. For a TensorFlow batch job, a pod has the batch container and the TF serving container. Even if the batch container exits, the TF serving container still remains running therefore the pod never exits. Since the pod never exists, job.active will be non-zero and the cortex job will remain in the running state.

Revert to checking the length of the queue for now until a better workaround is found.

checklist:

run make test and make lint
test manually (i.e. build/push all images, restart operator, and re-deploy APIs)

Update cron.go

4126a0d

vishalbollu requested a review from RobertLucian March 2, 2021 02:04

RobertLucian approved these changes Mar 2, 2021

View reviewed changes

vishalbollu merged commit 8ff681e into master Mar 2, 2021

vishalbollu deleted the check-queue-before-k8s-job branch March 2, 2021 02:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Check sqs queue length before checking job status #1925

Check sqs queue length before checking job status #1925

Uh oh!

vishalbollu commented Mar 2, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Check sqs queue length before checking job status #1925

Check sqs queue length before checking job status #1925

Uh oh!

Conversation

vishalbollu commented Mar 2, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants