New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
k8s glue: clearml_agent: ERROR: Execution required enqueued task #56
Comments
Hi @Shaked Let me see if I understand you correctly, it seems the Task failed and the pod restarts, is that correct? It seems the main issue is the restart policy of the pod, basically it should be none (if the pod ended for any reason leave it as is). Could it be that it was set incorrectly in the template yaml provided to the k8s glue ? Regrading the second error The idea is that you will be able to abort a job from |
Correct
This is not possible by k8s design:
Nope
I understand that part, however in this case the user didn't want to abort the execution of the task but it just failed while the agent kept on trying on running the worker pod - this ends with a loop of restarts: the first fails for the original reason which the task failed for and then all the others tries fail for In order to make this work, the agent should create k8s https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-template
|
Hi @Shaked This should solve any restart policy issue, since from the k8s perspective the pod fully executed itself.
This is a good idea, in a way this really is a "Job" not a service, I'm trying to find when exactly |
I'm currently running the latest version of clearml-agent from master, same as these latest changes:
and the failed pod keeps on trying to connect while the clearml task has failed: The first time it failed was for something internal:
But then it was marked as
The clearml task has failed of course:
It seems like I was only able to find this cached page: http://web.archive.org/web/20201129194804/https://v1-15.docs.kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/ and in the k8s Slack channel they even say that Job was introduced somewhere around 1.5: https://kubernetes.slack.com/archives/C09NXKJKA/p1619353295486400?thread_ts=1619349773.473000&cid=C09NXKJKA
I guess at first it shouldn't be set to default as it might break older deployments and confuse your users. I'd suggest to use k8s way of making these type of changes - first announce it but don't set it to default and let the users know that it would be the default after version X.Y. Shaked |
Can you verify that the pod cmd ended with
I'm in let's have it as a flag, probably makes sense not to have as default :) |
Yes it ended with exit 0:
I think that otherwise it would have not set the pod reason to This also happens when I abort a clearml task, not only when it fails.
+1 |
Wait now I'm confused, if it says |
I think that you have to explicitly add
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy |
I guess it should be added somewhere around clearml-agent/clearml_agent/glue/k8s.py Line 415 in 537b67e
|
@bmartinn I just tried to add
I do want to note one thing: this works because the scheduler creates a |
Hi @Shaked
The glue forcefully overwrites the I will make sure that (1) we had the I will update here once merged Committed :) 4f18bb7 |
Closing this as it was already released. Please reopen if required. |
While using k8s glue, there seems to be a weird behvaiour which ends up with the following error:
Steps to reproduce:
Expected
There are two options here:
Actual
The task is marked as
failed
within the clearml/trains UI. However, the agent keeps on trying to run the pod which was initiated by the task at first and then it keeps failing with the following error message in an endless loopUnfortunately it is not possible to change the restartPolicy of a pod
Therefore I believe that this should be handled by the agent itself i.e if the task fails, the agent should act accordingly.
The text was updated successfully, but these errors were encountered: