-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubeflow components with gpu requirements, do not start on karpenter provisioned nodes #540
Comments
(updated) Bottlerocket OS only has containerd as the container runtime and the Argo installed might be using docker as the default executor. Emissary is default executor for Argo version(3.3.8) shipped in Kubeflow 1.6. Can you try this on Kubeflow 1.6 and let us know if you are able to reproduce it? Also your can you format the workflow spec correctly? Based on your slack message, you already tried changing the executor to emissary and facing another issue
@mathsavvy Which deployment option of Kubeflow do you use? |
@surajkota I am using kubeflow on AWS deployment (RDS+S3) v1.5. |
Which eks version are you using? |
@ananth102 EKS 1.21 |
I was able to get a gpu smoke check/volume operation to work with karpenter on 1.6.1. I will try it on 1.5. This was my karpenter configuration:
Pipeline I used:
|
@weshallsin are you able to confirm this on 1.6.1? |
@surajkota Sorry to keep you guys waiting. I am currently travelling, will check and update here asap. |
@weshallsin any updates? |
Closing since there has been no update in last 1+ months. Please repoen the issue when you have more data. Thanks |
Hi @ryansteakley @surajkota @ananth102 Sorry for such a delayed response. I was finally able to check this on kubeflow v1.6. The smoke-test shared by @ananth102 does work well but we wish to use python function based kubeflow components and that still doesn't work. Please note my component code below:
The instance(g4dn.2xlarge) was provisoned by karpenter but the component is stuck in pending state and I see following error on describing the pod :
|
Hi @weshallsin, the error looks like is in related to the limits specified. Can you try the following:
|
@weshallsin can you confirm this can be closed now that gpu instances working through karpenter? |
Can you please post the resolution? |
@surajkota @askulkarni2 I can confirm that the issue was with the memory units. The pipeline components with GPU requirements are working after upgrading to latest kubeflow. |
I am using kubeflow pipelines for model training and everything works fine when using self-managed node-groups.
The issues arise when using karpenter to provision nodes for pipeline components that require GPUs. The gpu node gets provisioned but the pipeline component doesn't start. It is stuck with the following message:
MountVolume.SetUp failed for volume "docker-sock" : hostPath type check failed: /var/run/docker.sock is not a socket file
I am using karpenter with Bottlerocket AMI Family.
Kubeflow version : v1.5
kfp version : 1.6.3
API version : v1
Here is a sample pod spec:
The text was updated successfully, but these errors were encountered: