Kubeflow components with gpu requirements, do not start on karpenter provisioned nodes #540

weshallsin · 2023-01-09T23:37:31Z

I am using kubeflow pipelines for model training and everything works fine when using self-managed node-groups.
The issues arise when using karpenter to provision nodes for pipeline components that require GPUs. The gpu node gets provisioned but the pipeline component doesn't start. It is stuck with the following message:
MountVolume.SetUp failed for volume "docker-sock" : hostPath type check failed: /var/run/docker.sock is not a socket file
I am using karpenter with Bottlerocket AMI Family.

Kubeflow version : v1.5
kfp version : 1.6.3

API version : v1

Here is a sample pod spec:

  "kind": "Workflow",
  "apiVersion": "argoproj.io/v1alpha1",
  "metadata": {
    "generateName": "demo-setup-pipeline-agent-test-",
    "creationTimestamp": null,
    "labels": {
      "pipelines.kubeflow.org/kfp_sdk_version": "1.6.3"
    },
    "annotations": {
      "pipelines.kubeflow.org/kfp_sdk_version": "1.6.3",
      "pipelines.kubeflow.org/pipeline_compilation_time": "2022-12-22T17:29:41.058053",
      "pipelines.kubeflow.org/pipeline_spec": "{\"name\": \"demo_setup_pipeline_agent_test\"}"
    }
  },
  "spec": {
    "templates": [
      {
        "name": "clearml-agent",
        "inputs": {},
        "outputs": {},
        "metadata": {
          "annotations": {
            "pipelines.kubeflow.org/component_ref": "{}",
            "pipelines.kubeflow.org/component_spec": "{\"implementation\": {\"container\": {\"args\": [], \"command\": [\"sh\", \"-c\", \"(PIP_DISABLE_PIP_VERSION_CHECK=1 python3 -m pip install --quiet --no-warn-script-location 'clearml' 'clearml-agent' || PIP_DISABLE_PIP_VERSION_CHECK=1 python3 -m pip install --quiet --no-warn-script-location 'clearml' 'clearml-agent' --user) \u0026\u0026 \\\"$0\\\" \\\"$@\\\"\", \"sh\", \"-ec\", \"program_path=$(mktemp)\\nprintf \\\"%s\\\" \\\"$0\\\" \u003e \\\"$program_path\\\"\\npython3 -u \\\"$program_path\\\" \\\"$@\\\"\\n\", \"def clearml_agent():\\n\\n    def clearml_setup():\\n        import os\\n        from clearml import Task\\n\\n        CLEARML_PROJECT = 'Vodafone Sentiment'\\n        CLEARML_TASK = 'vodafone dataset download'\\n        os.environ[\\\"CLEARML_PROJECT\\\"] = CLEARML_PROJECT\\n        os.environ[\\\"CLEARML_TASK\\\"] = CLEARML_TASK\\n        os.environ['MPLBACKEND'] = \\\"TkAg\\\" \\n\\n        Task.set_credentials(\\n         api_host=\\\"https://api.clear.ml\\\", \\n         web_host=\\\"https://app.clear.ml\\\", \\n         files_host=\\\"https://files.clear.ml\\\", \\n         key='********', \\n         secret='****'\\n        )\\n\\n        os.system('clearml-agent daemon --queue default')\\n\\n    clearml_setup()\\n\\nimport argparse\\n_parser = argparse.ArgumentParser(prog='Clearml agent', description='')\\n_parsed_args = vars(_parser.parse_args())\\n\\n_outputs = clearml_agent(**_parsed_args)\\n\"], \"image\": \"huggingface/transformers-pytorch-gpu\"}}, \"name\": \"Clearml agent\"}"
          },
          "labels": {
            "pipelines.kubeflow.org/kfp_sdk_version": "1.6.3",
            "pipelines.kubeflow.org/pipeline-sdk-type": "kfp"
          }
        },
        "container": {
          "name": "",
          "image": "huggingface/transformers-pytorch-gpu",
          "command": [
            "sh",
            "-c",
            "(PIP_DISABLE_PIP_VERSION_CHECK=1 python3 -m pip install --quiet --no-warn-script-location 'clearml' 'clearml-agent' || PIP_DISABLE_PIP_VERSION_CHECK=1 python3 -m pip install --quiet --no-warn-script-location 'clearml' 'clearml-agent' --user) \u0026\u0026 \"$0\" \"$@\"",
            "sh",
            "-ec",
            "program_path=$(mktemp)\nprintf \"%s\" \"$0\" \u003e \"$program_path\"\npython3 -u \"$program_path\" \"$@\"\n",
            "def clearml_agent():\n\n    def clearml_setup():\n        import os\n        from clearml import Task\n\n        CLEARML_PROJECT = 'Vodafone Sentiment'\n        CLEARML_TASK = 'vodafone dataset download'\n        os.environ[\"CLEARML_PROJECT\"] = CLEARML_PROJECT\n        os.environ[\"CLEARML_TASK\"] = CLEARML_TASK\n        os.environ['MPLBACKEND'] = \"TkAg\" \n\n        Task.set_credentials(\n         api_host=\"https://api.clear.ml\", \n         web_host=\"https://app.clear.ml\", \n         files_host=\"https://files.clear.ml\", \n         key='*************', \n         secret='***************'\n        )\n\n        os.system('clearml-agent daemon --queue default')\n\n    clearml_setup()\n\nimport argparse\n_parser = argparse.ArgumentParser(prog='Clearml agent', description='')\n_parsed_args = vars(_parser.parse_args())\n\n_outputs = clearml_agent(**_parsed_args)\n"
          ],
          "resources": {
            "limits": {
              "cpu": "10",
              "memory": "20G",
              "nvidia.com/gpu": "1"
            }
          }
        }
      },
      {
        "name": "demo-setup-pipeline-agent-test",
        "inputs": {},
        "outputs": {},
        "metadata": {},
        "dag": {
          "tasks": [
            {
              "name": "clearml-agent",
              "template": "clearml-agent",
              "arguments": {}
            }
          ]
        }
      }
    ],
    "entrypoint": "demo-setup-pipeline-agent-test",
    "arguments": {},
    "serviceAccountName": "pipeline-runner"
  },
  "status": {
    "startedAt": null,
    "finishedAt": null
  }
}```

The text was updated successfully, but these errors were encountered:

surajkota · 2023-01-10T02:36:24Z

(updated) Bottlerocket OS only has containerd as the container runtime and the Argo installed might be using docker as the default executor. Emissary is default executor for Argo version(3.3.8) shipped in Kubeflow 1.6. Can you try this on Kubeflow 1.6 and let us know if you are able to reproduce it?

Also your can you format the workflow spec correctly?

Based on your slack message, you already tried changing the executor to emissary and facing another issue

My pods keep getting evicted with error: The node was low on resource: ephemeral-storage. whereas the same pods run fine on self-managed node groups. I have tried adding ephemeral-storage requests and limits as suggested here : https://stackoverflow.com/questions/59906810/the-node-was-low-on-resource-ephemeral-storage but it doesnot work either

@mathsavvy Which deployment option of Kubeflow do you use?

weshallsin · 2023-01-10T11:33:32Z

@surajkota I am using kubeflow on AWS deployment (RDS+S3) v1.5.

ananth102 · 2023-01-10T18:57:24Z

Which eks version are you using?

weshallsin · 2023-01-10T19:22:27Z

@ananth102 EKS 1.21

ananth102 · 2023-01-16T20:34:32Z

I was able to get a gpu smoke check/volume operation to work with karpenter on 1.6.1. I will try it on 1.5. This was my karpenter configuration:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  requirements:
    - key: "karpenter.k8s.aws/instance-category"
      operator: In
      values: ["p"]
  limits:
    resources:
      cpu: 1000
  providerRef:
    name: default
  ttlSecondsAfterEmpty: 30
---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: default
spec:
  amiFamily: Bottlerocket
  subnetSelector:
    karpenter.sh/discovery: ${CLUSTER_NAME}
  securityGroupSelector:
    karpenter.sh/discovery: ${CLUSTER_NAME}

Pipeline I used:

import kfp
from kfp import dsl, components

@components.create_component_from_func
def write_to_volume():
    with open("/mnt/file.txt", "w") as file:
        file.write("Hello world")

def gpu_smoking_check_op():
    return dsl.ContainerOp(
        name='check',
        image='tensorflow/tensorflow:latest-gpu',
        command=['sh', '-c'],
        arguments=['nvidia-smi']
    ).set_gpu_limit(1)

@dsl.pipeline(
    name='GPU smoke check',
    description='smoke check as to whether GPU env is ready.'
)
def gpu_pipeline():
    gpu_smoking_check = gpu_smoking_check_op()
    vop = dsl.VolumeOp(
        name="create-pvc",
        resource_name="my-pvc",
        modes=dsl.VOLUME_MODE_RWO,
        size="1Gi"
    )

    write_to_volume().add_pvolumes({"/mnt": vop.volume})

surajkota · 2023-01-24T21:30:08Z

@weshallsin are you able to confirm this on 1.6.1?

weshallsin · 2023-01-25T16:29:56Z

@surajkota Sorry to keep you guys waiting. I am currently travelling, will check and update here asap.

ryansteakley · 2023-03-02T19:24:00Z

@weshallsin any updates?

surajkota · 2023-03-13T18:21:55Z

Closing since there has been no update in last 1+ months. Please repoen the issue when you have more data.

Thanks

weshallsin · 2023-05-04T07:07:32Z

Hi @ryansteakley @surajkota @ananth102

Sorry for such a delayed response. I was finally able to check this on kubeflow v1.6.

The smoke-test shared by @ananth102 does work well but we wish to use python function based kubeflow components and that still doesn't work. Please note my component code below:

import kfp
from kfp.components import create_component_from_func
import kfp.dsl as dsl

def gpu_check():
    from subprocess import getoutput
    gpu_info = getoutput('nvidia-smi')
    print(gpu_info)

smoke_test_op = create_component_from_func(
    gpu_check, base_image='tensorflow/tensorflow:latest-gpu')

@dsl.pipeline(
  name='gpu-test-pipeline',
)
def gpu_pipeline():
    task = smoke_test_op().set_gpu_limit(1).set_memory_limit('5').set_cpu_limit('5')

The instance(g4dn.2xlarge) was provisoned by karpenter but the component is stuck in pending state and I see following error on describing the pod :

Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  4m1s (x2 over 4m2s)  default-scheduler  0/4 nodes are available: 4 Insufficient cpu, 4 Insufficient nvidia.com/gpu.
  Normal   Nominated         3m57s                karpenter          Pod should schedule on ip-xx-xx-xx-xxx.ec2.internal
  Normal   Scheduled         3m18s                default-scheduler  Successfully assigned kubeflow-user-example-com/gpu-test-pipeline-5slr5-3385059224 to ip-xx-xx-xx-xxx.ec2.internal
  Normal   Pulling           3m17s                kubelet            Pulling image "gcr.io/ml-pipeline/argoexec:v3.3.8-license-compliance"
  Normal   Pulled            3m13s                kubelet            Successfully pulled image "gcr.io/ml-pipeline/argoexec:v3.3.8-license-compliance" in 4.557527705s (4.557535404s including waiting)
  Normal   Created           3m13s                kubelet            Created container init
  Normal   Started           3m13s                kubelet            Started container init
  Normal   Pulled            3m2s                 kubelet            Container image "gcr.io/ml-pipeline/argoexec:v3.3.8-license-compliance" already present on machine
  Normal   Created           3m2s                 kubelet            Created container wait
  Normal   Started           3m2s                 kubelet            Started container wait
  Normal   Pulling           3m2s                 kubelet            Pulling image "tensorflow/tensorflow:latest-gpu"
  Normal   Pulled            41s                  kubelet            Successfully pulled image "tensorflow/tensorflow:latest-gpu" in 2m21.024395881s (2m21.024404372s including waiting)
  Normal   Created           41s                  kubelet            Created container main
  Warning  Failed            24s                  kubelet            Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error setting cgroup config for procHooks process: unable to set memory limit to 5 (current usage: 2859008, peak usage: 3244032): unknown

surajkota · 2023-05-10T21:33:00Z

Hi @weshallsin, the error looks like is in related to the limits specified. Can you try the following:

Run your pipeline without the cpu and memory limits
Set the cpu and memory limits limits but raise them, 5 Bytes of memory is not sufficient. did you mean to try 5Gi? For more information on units, see https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#resource-units-in-kubernetes

askulkarni2 · 2023-05-11T20:00:00Z

@weshallsin can you confirm this can be closed now that gpu instances working through karpenter?

surajkota · 2023-05-13T01:43:45Z

Can you please post the resolution?

weshallsin · 2023-05-15T11:02:36Z

@surajkota @askulkarni2 I can confirm that the issue was with the memory units. The pipeline components with GPU requirements are working after upgrading to latest kubeflow.

weshallsin added the bug Something isn't working label Jan 9, 2023

weshallsin changed the title ~~Kubeflow components with gpu requirements, do not start on karpenter nodes~~ Kubeflow components with gpu requirements, do not start on karpenter provisioned nodes Jan 9, 2023

surajkota closed this as not planned Won't fix, can't repro, duplicate, stale Mar 13, 2023

surajkota reopened this May 5, 2023

surajkota added the work in progress Has been assigned and is in progress label May 13, 2023

weshallsin closed this as completed May 15, 2023

surajkota mentioned this issue Jun 5, 2023

Create a FAQ page #753

Open

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubeflow components with gpu requirements, do not start on karpenter provisioned nodes #540

Kubeflow components with gpu requirements, do not start on karpenter provisioned nodes #540

weshallsin commented Jan 9, 2023 •

edited

surajkota commented Jan 10, 2023 •

edited

weshallsin commented Jan 10, 2023

ananth102 commented Jan 10, 2023

weshallsin commented Jan 10, 2023

ananth102 commented Jan 16, 2023 •

edited

surajkota commented Jan 24, 2023

weshallsin commented Jan 25, 2023

ryansteakley commented Mar 2, 2023

surajkota commented Mar 13, 2023

weshallsin commented May 4, 2023 •

edited

surajkota commented May 10, 2023

askulkarni2 commented May 11, 2023

surajkota commented May 13, 2023

weshallsin commented May 15, 2023 •

edited

Kubeflow components with gpu requirements, do not start on karpenter provisioned nodes #540

Kubeflow components with gpu requirements, do not start on karpenter provisioned nodes #540

Comments

weshallsin commented Jan 9, 2023 • edited

surajkota commented Jan 10, 2023 • edited

weshallsin commented Jan 10, 2023

ananth102 commented Jan 10, 2023

weshallsin commented Jan 10, 2023

ananth102 commented Jan 16, 2023 • edited

surajkota commented Jan 24, 2023

weshallsin commented Jan 25, 2023

ryansteakley commented Mar 2, 2023

surajkota commented Mar 13, 2023

weshallsin commented May 4, 2023 • edited

surajkota commented May 10, 2023

askulkarni2 commented May 11, 2023

surajkota commented May 13, 2023

weshallsin commented May 15, 2023 • edited

weshallsin commented Jan 9, 2023 •

edited

surajkota commented Jan 10, 2023 •

edited

ananth102 commented Jan 16, 2023 •

edited

weshallsin commented May 4, 2023 •

edited

weshallsin commented May 15, 2023 •

edited