Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k8s glue: clearml_agent: ERROR: Execution required enqueued task #56

Closed
Shaked opened this issue Apr 8, 2021 · 12 comments
Closed

k8s glue: clearml_agent: ERROR: Execution required enqueued task #56

Shaked opened this issue Apr 8, 2021 · 12 comments

Comments

@Shaked
Copy link

Shaked commented Apr 8, 2021

While using k8s glue, there seems to be a weird behvaiour which ends up with the following error:

clearml_agent: ERROR: Execution required enqueued task, but task id=ba22288b727d44a3bc49f44d158b8b2c is not queued.

Steps to reproduce:

  • Run a task
     task = Task.init(project_name=project_name, task_name=model_name)
     trains_logger = Task.current_task().get_logger()
     task.set_base_docker(custom_image)
     task.execute_remotely(queue_name='shakedtest')
  • Make sure that the above task fails for some reason e.g: missing dependency:

ERROR: Could not find a version that satisfies the requirement trains==0.16.2rc3 (from -r /tmp/cached-reqsdvw6mmq0.txt (line 124)) (from versions: 0.9.0rc2, 0.9.0rc3, 0.9.0rc4, 0.9.0, 0.9.1rc0, 0.9.1rc1, 0.9.1, 0.9.2rc0, 0.9.2, 0.9.3rc0, 0.9.3rc1, 0.9.3rc3, 0.9.3, 0.9.3.1, 0.9.4rc1, 0.10.0rc2, 0.10.0, 0.10.1rc0, 0.10.1, 0.10.2rc0, 0.10.2rc1, 0.10.2rc2, 0.10.2, 0.10.3rc0, 0.10.3rc1, 0.10.3rc2, 0.10.3rc3, 0.10.3rc4, 0.10.3, 0.10.4rc1, 0.10.4rc2, 0.10.4rc3, 0.10.4, 0.10.5rc0, 0.10.5rc1, 0.10.5rc2, 0.10.5, 0.10.6, 0.10.7rc3, 0.10.7rc4, 0.10.7rc5, 0.10.7rc6, 0.10.7rc7, 0.10.7, 0.11.0rc1, 0.11.0rc2, 0.11.0rc3, 0.11.0, 0.11.1rc1, 0.11.1rc2, 0.11.1rc3, 0.11.1rc4, 0.11.1rc5, 0.11.1rc6, 0.11.1rc7, 0.11.1, 0.11.2, 0.11.3rc0, 0.11.3, 0.12.0rc0, 0.12.0rc1, 0.12.0rc2, 0.12.0rc3, 0.12.0rc4, 0.12.0, 0.12.1rc0, 0.12.1rc1, 0.12.1rc3, 0.12.1rc4, 0.12.1, 0.12.2rc0, 0.12.2rc1, 0.12.2rc2, 0.12.2rc4, 0.12.2, 0.12.3rc0, 0.13.0rc0, 0.13.0rc2, 0.13.0, 0.13.1rc0, 0.13.1rc1, 0.13.1rc2, 0.13.1, 0.13.2rc0, 0.13.2rc1, 0.13.2rc2, 0.13.2rc3, 0.13.2, 0.13.3, 0.13.4rc0, 0.13.4rc1, 0.14.0, 0.14.1rc0, 0.14.1rc1, 0.14.1rc2, 0.14.1, 0.14.2rc0, 0.14.2, 0.14.3rc0, 0.14.3, 0.15.0rc0, 0.15.0, 0.15.1rc0, 0.15.1, 0.15.2rc0, 0.16.0, 0.16.1, 0.16.2rc0, 0.16.2rc1, 0.16.2rc2, 0.16.2, 0.16.3rc0, 0.16.3, 0.16.4rc0, 0.16.4)
ERROR: No matching distribution found for trains==0.16.2rc3 (from -r /tmp/cached-reqsdvw6mmq0.txt (line 124))

Expected

There are two options here:

  1. Stop the task, stop the pod that just failed (Error state) and make sure it doesn't run again.
  2. Do not stop the task and allow the pod to recover so that it won't end up with the mentioned error.

Actual

The task is marked as failed within the clearml/trains UI. However, the agent keeps on trying to run the pod which was initiated by the task at first and then it keeps failing with the following error message in an endless loop

Execution required enqueued task, but task id= is not queued.

Unfortunately it is not possible to change the restartPolicy of a pod

Only a .spec.template.spec.restartPolicy equal to Always is allowed, which is the default if not specified.

Therefore I believe that this should be handled by the agent itself i.e if the task fails, the agent should act accordingly.

@bmartinn
Copy link
Member

bmartinn commented Apr 8, 2021

Hi @Shaked

Let me see if I understand you correctly, it seems the Task failed and the pod restarts, is that correct?

It seems the main issue is the restart policy of the pod, basically it should be none (if the pod ended for any reason leave it as is).

Could it be that it was set incorrectly in the template yaml provided to the k8s glue ?
Are you saying k8s new restart policy cannot be set to None any longer ? Is this an issue with specific Kubernetes distro/version ?

Regrading the second error Execution required enqueued task, but task id= is not queued.

The idea is that you will be able to abort a job from clearml even if it is pending in the k8s scheduler.
To implement this feature (and create visibility), when the k8s-glue pulls a Task from the original queue, it will push it into a temporary k8s_scheduler queue. If a user wants to abort the execution of that Task before it actually started, a user could dequeue the Task from the k8s_scheduler queue, then when the k8s actually launch the Task, the agent checks if it is still queued (i.e. in the k8s_scheduler queue), if it isn't it will just abort with the error "... task id= is not queued."

@Shaked
Copy link
Author

Shaked commented Apr 9, 2021

Let me see if I understand you correctly, it seems the Task failed and the pod restarts, is that correct?

Correct

It seems the main issue is the restart policy of the pod, basically it should be none (if the pod ended for any reason leave it as is).

This is not possible by k8s design:

Only a .spec.template.spec.restartPolicy equal to Always is allowed, which is the default if not specified.
https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#pod-template

Could it be that it was set incorrectly in the template yaml provided to the k8s glue ?

Nope

The idea is that you will be able to abort a job from clearml even if it is pending in the k8s scheduler.
To implement this feature (and create visibility), when the k8s-glue pulls a Task from the original queue, it will push it into a temporary k8s_scheduler queue. If a user wants to abort the execution of that Task before it actually started, a user could dequeue the Task from the k8s_scheduler queue, then when the k8s actually launch the Task, the agent checks if it is still queued (i.e. in the k8s_scheduler queue), if it isn't it will just abort with the error "... task id= is not queued."

I understand that part, however in this case the user didn't want to abort the execution of the task but it just failed while the agent kept on trying on running the worker pod - this ends with a loop of restarts: the first fails for the original reason which the task failed for and then all the others tries fail for Execution required enqueued task, but task id= is not queued. as it has failed and there's no queued task.

In order to make this work, the agent should create k8s Kind: Job instead of a Pod, which will allow the control of the restart policy:

https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-template

Only a RestartPolicy equal to Never or OnFailure is allowed.

allegroai-git pushed a commit that referenced this issue Apr 12, 2021
@bmartinn
Copy link
Member

Hi @Shaked
Quick update, a fix was pushed to the master, which should solve the restart issue.
Basically if the pod started running, it should always return exit code 0 (i.e. success).
Please feel free to test :)

This should solve any restart policy issue, since from the k8s perspective the pod fully executed itself.
This does not mean that the Task did not fail, and if it did you will be able to see it in the UI, but from the k8s "Job scheduling" aspect it was successful, as it started the container and everything worked as expected.

In order to make this work, the agent should create k8s Kind: Job instead of a Pod, which will allow the control of the restart policy:

This is a good idea, in a way this really is a "Job" not a service, I'm trying to find when exactly Kind: Job was introduced into Kubernetes, if this feature is too new, I would not want to default to it and break older deployments.
It seems it is already supported with Kubernetes v1.17 and OpenShift v4.1 not sure though these are the first versions introducing it ...
wdyt?

@Shaked
Copy link
Author

Shaked commented Apr 25, 2021

@bmartinn

Hi @Shaked
Quick update, a fix was pushed to the master, which should solve the restart issue.
Basically if the pod started running, it should always return exit code 0 (i.e. success).
Please feel free to test :)

This should solve any restart policy issue, since from the k8s perspective the pod fully executed itself.
This does not mean that the Task did not fail, and if it did you will be able to see it in the UI, but from the k8s "Job scheduling" aspect it was successful, as it started the container and everything worked as expected.

I'm currently running the latest version of clearml-agent from master, same as these latest changes:

root@clearml-agent-98c78db49-ph87t:/usr/local/lib/python3.9/site-packages/clearml_agent# sed -n '35,0p' glue/k8s.py
    KUBECTL_APPLY_CMD = "kubectl apply --namespace={namespace} -f"
root@clearml-agent-98c78db49-ph87t:/usr/local/lib/python3.9/site-packages/clearml_agent# sed -n '339,0p' glue/k8s.py
            if self.max_pods_limit:

and the failed pod keeps on trying to connect while the clearml task has failed:

shakedkleinorbach@Shakeds-MBP:~ 2021-04-25 15-28-52

The first time it failed was for something internal:

ModuleNotFoundError: No module named 'compute_overlap'
Running task 'bc9bf8e753174a249b8d6ac25661676c'
Storing stdout and stderr log to '/tmp/.clearml_agent_out.i8hxne10.txt', '/tmp/.clearml_agent_out.i8hxne10.txt'
DONE: Running task 'bc9bf8e753174a249b8d6ac25661676c', exit status 1

But then it was marked as Completed but kept on running:

You should consider upgrading via the '/usr/bin/python3.6 -m pip install --upgrade pip' command.
clearml_agent: ERROR: Execution required enqueued task, but task id=bc9bf8e753174a249b8d6ac25661676c is not queued.

The clearml task has failed of course:

 Log 2021-04-25 15-30-49

This is a good idea, in a way this really is a "Job" not a service, I'm trying to find when exactly Kind: Job was introduced into Kubernetes, if this feature is too new, I would not want to default to it and break older deployments.
It seems it is already supported with Kubernetes v1.17 and OpenShift v4.1 not sure though these are the first versions introducing it ...
wdyt?

It seems like Job has been around for some time now, this is an archive static file from k8s' website: https://raw.githubusercontent.com/kubernetes/website/release-1.10/static/docs/reference/generated/kubernetes-api/v1.10/index.html

I was only able to find this cached page: http://web.archive.org/web/20201129194804/https://v1-15.docs.kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/

and in the k8s Slack channel they even say that Job was introduced somewhere around 1.5: https://kubernetes.slack.com/archives/C09NXKJKA/p1619353295486400?thread_ts=1619349773.473000&cid=C09NXKJKA

Job v1 in 1.10 : https://raw.githubusercontent.com/kubernetes/website/snapshot-final-v1.14/static/docs/reference/generated/kubernetes-api/v1.10/index.html

Job v1 in 1.5 and mention of v1beta1 v2alpha1 in 1.5 doc (edited)

I guess at first it shouldn't be set to default as it might break older deployments and confuse your users. I'd suggest to use k8s way of making these type of changes - first announce it but don't set it to default and let the users know that it would be the default after version X.Y.

Shaked

@bmartinn
Copy link
Member

and the failed pod keeps on trying to connect while the clearml task has failed:

Can you verify that the pod cmd ended with ; exit 0. This means the pod will always be successful hence will never restart (that's before we change to Jobs, see below)

It seems like Job has been around for some time now, this is an archive static file from k8s' website ....

I'm in let's have it as a flag, probably makes sense not to have as default :)

@Shaked
Copy link
Author

Shaked commented Apr 25, 2021

Can you verify that the pod cmd ended with ; exit 0. This means the pod will always be successful hence will never restart (that's before we change to Jobs, see below)

Yes it ended with exit 0:

    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0

I think that otherwise it would have not set the pod reason to Completed.

This also happens when I abort a clearml task, not only when it fails.

I'm in let's have it as a flag, probably makes sense not to have as default :)

+1

@bmartinn
Copy link
Member

bmartinn commented Apr 25, 2021

Wait now I'm confused, if it says Reason: Completed why did it restart the pod ?
In other words, what's the difference between completed run and aborted/failed one from the k8s perspective? (basically there shouldn't be any difference and in all cases it should not restart)

@Shaked
Copy link
Author

Shaked commented Apr 25, 2021

I think that you have to explicitly add restartPolicy: Never to the pod, otherwise:

Always means that the container will be restarted even if it exited with a zero exit code (i.e. successfully). This is useful when you don't care why the container exited, you just want to make sure that it is always running (e.g. a web server). This is the default.

OnFailure means that the container will only be restarted if it exited with a non-zero exit code (i.e. something went wrong). This is useful when you want accomplish a certain task with the pod, and ensure that it completes successfully - if it doesn't it will be restarted until it does.

Never means that the container will not be restarted regardless of why it exited.

https://stackoverflow.com/questions/40530946/what-is-the-difference-between-always-and-on-failure-for-kubernetes-restart-poli

The spec of a Pod has a restartPolicy field with possible values Always, OnFailure, and Never. The default value is Always.

The restartPolicy applies to all containers in the Pod. restartPolicy only refers to restarts of the containers by the kubelet on the same node. After containers in a Pod exit, the kubelet restarts them with an exponential back-off delay (10s, 20s, 40s, …), that is capped at five minutes. Once a container has executed for 10 minutes without any problems, the kubelet resets the restart backoff timer for that container.

https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy

@Shaked
Copy link
Author

Shaked commented Apr 25, 2021

I guess it should be added somewhere around

template['spec'].setdefault('containers', [])

...
        template.setdefault('spec', {})
        template['spec'].setdefault('containers', [])
        template['spec']['restartPolicy'] = 'Never' # <------- this line
        if labels:
...

@Shaked
Copy link
Author

Shaked commented Apr 26, 2021

@bmartinn I just tried to add restartPolicy: Never to the pod template and it worked for both abort and failure:

clearml-gcpdevqueue-id-05494501ea0940ebaac12c70791e58fb   0/1     Pending   0          0s
clearml-gcpdevqueue-id-05494501ea0940ebaac12c70791e58fb   0/1     Pending   0          0s
clearml-gcpdevqueue-id-05494501ea0940ebaac12c70791e58fb   0/1     ContainerCreating   0          0s
clearml-gcpdevqueue-id-05494501ea0940ebaac12c70791e58fb   1/1     Running             0          6s
clearml-gcpdevqueue-id-05494501ea0940ebaac12c70791e58fb   0/1     Completed           0          43s
clearml-gcpdevqueue-id-05494501ea0940ebaac12c70791e58fb   0/1     Terminating         0          47s
clearml-gcpdevqueue-id-05494501ea0940ebaac12c70791e58fb   0/1     Terminating         0          47s
clearml-gcpdevqueue-id-05494501ea0940ebaac12c70791e58fb   0/1     Pending             0          0s
clearml-gcpdevqueue-id-05494501ea0940ebaac12c70791e58fb   0/1     Pending             0          0s
clearml-gcpdevqueue-id-05494501ea0940ebaac12c70791e58fb   0/1     ContainerCreating   0          0s
clearml-gcpdevqueue-id-05494501ea0940ebaac12c70791e58fb   1/1     Running             0          6s
clearml-gcpdevqueue-id-05494501ea0940ebaac12c70791e58fb   0/1     Completed           0          2m24s
clearml-gcpdevqueue-id-05494501ea0940ebaac12c70791e58fb   0/1     Terminating         0          2m26s
clearml-gcpdevqueue-id-05494501ea0940ebaac12c70791e58fb   0/1     Terminating         0          2m26s

I do want to note one thing: this works because the scheduler creates a Kind: Pod. If it were to be a Kind: Deployment then this setup would have not worked as it deployments' restartPolicy is only set to Always.

@bmartinn
Copy link
Member

bmartinn commented Apr 28, 2021

Hi @Shaked
Apologies for the delayed reply here, great news that it worked!

I do want to note one thing: this works because the scheduler creates a Kind: Pod.

The glue forcefully overwrites the Kind: Pod set in the yaml template, making sure we are not accidentally using Kind: Deployment

I will make sure that (1) we had the restartPolicy: never automatically set by the glue (2) we add the Kind: Job option as additional execution flag.

I will update here once merged

Committed :) 4f18bb7

@jkhenning
Copy link
Member

Closing this as it was already released. Please reopen if required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants