Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubeflow components with gpu requirements, do not start on karpenter provisioned nodes #540

Closed
weshallsin opened this issue Jan 9, 2023 · 14 comments
Labels
bug Something isn't working work in progress Has been assigned and is in progress

Comments

@weshallsin
Copy link

weshallsin commented Jan 9, 2023

I am using kubeflow pipelines for model training and everything works fine when using self-managed node-groups.
The issues arise when using karpenter to provision nodes for pipeline components that require GPUs. The gpu node gets provisioned but the pipeline component doesn't start. It is stuck with the following message:
MountVolume.SetUp failed for volume "docker-sock" : hostPath type check failed: /var/run/docker.sock is not a socket file
I am using karpenter with Bottlerocket AMI Family.

Kubeflow version : v1.5
kfp version : 1.6.3

API version : v1

Here is a sample pod spec:

  "kind": "Workflow",
  "apiVersion": "argoproj.io/v1alpha1",
  "metadata": {
    "generateName": "demo-setup-pipeline-agent-test-",
    "creationTimestamp": null,
    "labels": {
      "pipelines.kubeflow.org/kfp_sdk_version": "1.6.3"
    },
    "annotations": {
      "pipelines.kubeflow.org/kfp_sdk_version": "1.6.3",
      "pipelines.kubeflow.org/pipeline_compilation_time": "2022-12-22T17:29:41.058053",
      "pipelines.kubeflow.org/pipeline_spec": "{\"name\": \"demo_setup_pipeline_agent_test\"}"
    }
  },
  "spec": {
    "templates": [
      {
        "name": "clearml-agent",
        "inputs": {},
        "outputs": {},
        "metadata": {
          "annotations": {
            "pipelines.kubeflow.org/component_ref": "{}",
            "pipelines.kubeflow.org/component_spec": "{\"implementation\": {\"container\": {\"args\": [], \"command\": [\"sh\", \"-c\", \"(PIP_DISABLE_PIP_VERSION_CHECK=1 python3 -m pip install --quiet --no-warn-script-location 'clearml' 'clearml-agent' || PIP_DISABLE_PIP_VERSION_CHECK=1 python3 -m pip install --quiet --no-warn-script-location 'clearml' 'clearml-agent' --user) \u0026\u0026 \\\"$0\\\" \\\"$@\\\"\", \"sh\", \"-ec\", \"program_path=$(mktemp)\\nprintf \\\"%s\\\" \\\"$0\\\" \u003e \\\"$program_path\\\"\\npython3 -u \\\"$program_path\\\" \\\"$@\\\"\\n\", \"def clearml_agent():\\n\\n    def clearml_setup():\\n        import os\\n        from clearml import Task\\n\\n        CLEARML_PROJECT = 'Vodafone Sentiment'\\n        CLEARML_TASK = 'vodafone dataset download'\\n        os.environ[\\\"CLEARML_PROJECT\\\"] = CLEARML_PROJECT\\n        os.environ[\\\"CLEARML_TASK\\\"] = CLEARML_TASK\\n        os.environ['MPLBACKEND'] = \\\"TkAg\\\" \\n\\n        Task.set_credentials(\\n         api_host=\\\"https://api.clear.ml\\\", \\n         web_host=\\\"https://app.clear.ml\\\", \\n         files_host=\\\"https://files.clear.ml\\\", \\n         key='********', \\n         secret='****'\\n        )\\n\\n        os.system('clearml-agent daemon --queue default')\\n\\n    clearml_setup()\\n\\nimport argparse\\n_parser = argparse.ArgumentParser(prog='Clearml agent', description='')\\n_parsed_args = vars(_parser.parse_args())\\n\\n_outputs = clearml_agent(**_parsed_args)\\n\"], \"image\": \"huggingface/transformers-pytorch-gpu\"}}, \"name\": \"Clearml agent\"}"
          },
          "labels": {
            "pipelines.kubeflow.org/kfp_sdk_version": "1.6.3",
            "pipelines.kubeflow.org/pipeline-sdk-type": "kfp"
          }
        },
        "container": {
          "name": "",
          "image": "huggingface/transformers-pytorch-gpu",
          "command": [
            "sh",
            "-c",
            "(PIP_DISABLE_PIP_VERSION_CHECK=1 python3 -m pip install --quiet --no-warn-script-location 'clearml' 'clearml-agent' || PIP_DISABLE_PIP_VERSION_CHECK=1 python3 -m pip install --quiet --no-warn-script-location 'clearml' 'clearml-agent' --user) \u0026\u0026 \"$0\" \"$@\"",
            "sh",
            "-ec",
            "program_path=$(mktemp)\nprintf \"%s\" \"$0\" \u003e \"$program_path\"\npython3 -u \"$program_path\" \"$@\"\n",
            "def clearml_agent():\n\n    def clearml_setup():\n        import os\n        from clearml import Task\n\n        CLEARML_PROJECT = 'Vodafone Sentiment'\n        CLEARML_TASK = 'vodafone dataset download'\n        os.environ[\"CLEARML_PROJECT\"] = CLEARML_PROJECT\n        os.environ[\"CLEARML_TASK\"] = CLEARML_TASK\n        os.environ['MPLBACKEND'] = \"TkAg\" \n\n        Task.set_credentials(\n         api_host=\"https://api.clear.ml\", \n         web_host=\"https://app.clear.ml\", \n         files_host=\"https://files.clear.ml\", \n         key='*************', \n         secret='***************'\n        )\n\n        os.system('clearml-agent daemon --queue default')\n\n    clearml_setup()\n\nimport argparse\n_parser = argparse.ArgumentParser(prog='Clearml agent', description='')\n_parsed_args = vars(_parser.parse_args())\n\n_outputs = clearml_agent(**_parsed_args)\n"
          ],
          "resources": {
            "limits": {
              "cpu": "10",
              "memory": "20G",
              "nvidia.com/gpu": "1"
            }
          }
        }
      },
      {
        "name": "demo-setup-pipeline-agent-test",
        "inputs": {},
        "outputs": {},
        "metadata": {},
        "dag": {
          "tasks": [
            {
              "name": "clearml-agent",
              "template": "clearml-agent",
              "arguments": {}
            }
          ]
        }
      }
    ],
    "entrypoint": "demo-setup-pipeline-agent-test",
    "arguments": {},
    "serviceAccountName": "pipeline-runner"
  },
  "status": {
    "startedAt": null,
    "finishedAt": null
  }
}```
@weshallsin weshallsin added the bug Something isn't working label Jan 9, 2023
@weshallsin weshallsin changed the title Kubeflow components with gpu requirements, do not start on karpenter nodes Kubeflow components with gpu requirements, do not start on karpenter provisioned nodes Jan 9, 2023
@surajkota
Copy link
Contributor

surajkota commented Jan 10, 2023

(updated) Bottlerocket OS only has containerd as the container runtime and the Argo installed might be using docker as the default executor. Emissary is default executor for Argo version(3.3.8) shipped in Kubeflow 1.6. Can you try this on Kubeflow 1.6 and let us know if you are able to reproduce it?

Also your can you format the workflow spec correctly?

Based on your slack message, you already tried changing the executor to emissary and facing another issue

My pods keep getting evicted with error: The node was low on resource: ephemeral-storage. whereas the same pods run fine on self-managed node groups. I have tried adding ephemeral-storage requests and limits as suggested here : https://stackoverflow.com/questions/59906810/the-node-was-low-on-resource-ephemeral-storage but it doesnot work either

@mathsavvy Which deployment option of Kubeflow do you use?

@weshallsin
Copy link
Author

@surajkota I am using kubeflow on AWS deployment (RDS+S3) v1.5.

@ananth102
Copy link
Contributor

Which eks version are you using?

@weshallsin
Copy link
Author

@ananth102 EKS 1.21

@ananth102
Copy link
Contributor

ananth102 commented Jan 16, 2023

I was able to get a gpu smoke check/volume operation to work with karpenter on 1.6.1. I will try it on 1.5. This was my karpenter configuration:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  requirements:
    - key: "karpenter.k8s.aws/instance-category"
      operator: In
      values: ["p"]
  limits:
    resources:
      cpu: 1000
  providerRef:
    name: default
  ttlSecondsAfterEmpty: 30
---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: default
spec:
  amiFamily: Bottlerocket
  subnetSelector:
    karpenter.sh/discovery: ${CLUSTER_NAME}
  securityGroupSelector:
    karpenter.sh/discovery: ${CLUSTER_NAME}

Pipeline I used:

import kfp
from kfp import dsl, components

@components.create_component_from_func
def write_to_volume():
    with open("/mnt/file.txt", "w") as file:
        file.write("Hello world")

def gpu_smoking_check_op():
    return dsl.ContainerOp(
        name='check',
        image='tensorflow/tensorflow:latest-gpu',
        command=['sh', '-c'],
        arguments=['nvidia-smi']
    ).set_gpu_limit(1)

@dsl.pipeline(
    name='GPU smoke check',
    description='smoke check as to whether GPU env is ready.'
)
def gpu_pipeline():
    gpu_smoking_check = gpu_smoking_check_op()
    vop = dsl.VolumeOp(
        name="create-pvc",
        resource_name="my-pvc",
        modes=dsl.VOLUME_MODE_RWO,
        size="1Gi"
    )

    write_to_volume().add_pvolumes({"/mnt": vop.volume})

@surajkota
Copy link
Contributor

@weshallsin are you able to confirm this on 1.6.1?

@weshallsin
Copy link
Author

@surajkota Sorry to keep you guys waiting. I am currently travelling, will check and update here asap.

@ryansteakley
Copy link
Contributor

@weshallsin any updates?

@surajkota
Copy link
Contributor

Closing since there has been no update in last 1+ months. Please repoen the issue when you have more data.

Thanks

@surajkota surajkota closed this as not planned Won't fix, can't repro, duplicate, stale Mar 13, 2023
@weshallsin
Copy link
Author

weshallsin commented May 4, 2023

Hi @ryansteakley @surajkota @ananth102

Sorry for such a delayed response. I was finally able to check this on kubeflow v1.6.

The smoke-test shared by @ananth102 does work well but we wish to use python function based kubeflow components and that still doesn't work. Please note my component code below:

import kfp
from kfp.components import create_component_from_func
import kfp.dsl as dsl

def gpu_check():
    from subprocess import getoutput
    gpu_info = getoutput('nvidia-smi')
    print(gpu_info)

smoke_test_op = create_component_from_func(
    gpu_check, base_image='tensorflow/tensorflow:latest-gpu')

@dsl.pipeline(
  name='gpu-test-pipeline',
)
def gpu_pipeline():
    task = smoke_test_op().set_gpu_limit(1).set_memory_limit('5').set_cpu_limit('5')

The instance(g4dn.2xlarge) was provisoned by karpenter but the component is stuck in pending state and I see following error on describing the pod :

Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  4m1s (x2 over 4m2s)  default-scheduler  0/4 nodes are available: 4 Insufficient cpu, 4 Insufficient nvidia.com/gpu.
  Normal   Nominated         3m57s                karpenter          Pod should schedule on ip-xx-xx-xx-xxx.ec2.internal
  Normal   Scheduled         3m18s                default-scheduler  Successfully assigned kubeflow-user-example-com/gpu-test-pipeline-5slr5-3385059224 to ip-xx-xx-xx-xxx.ec2.internal
  Normal   Pulling           3m17s                kubelet            Pulling image "gcr.io/ml-pipeline/argoexec:v3.3.8-license-compliance"
  Normal   Pulled            3m13s                kubelet            Successfully pulled image "gcr.io/ml-pipeline/argoexec:v3.3.8-license-compliance" in 4.557527705s (4.557535404s including waiting)
  Normal   Created           3m13s                kubelet            Created container init
  Normal   Started           3m13s                kubelet            Started container init
  Normal   Pulled            3m2s                 kubelet            Container image "gcr.io/ml-pipeline/argoexec:v3.3.8-license-compliance" already present on machine
  Normal   Created           3m2s                 kubelet            Created container wait
  Normal   Started           3m2s                 kubelet            Started container wait
  Normal   Pulling           3m2s                 kubelet            Pulling image "tensorflow/tensorflow:latest-gpu"
  Normal   Pulled            41s                  kubelet            Successfully pulled image "tensorflow/tensorflow:latest-gpu" in 2m21.024395881s (2m21.024404372s including waiting)
  Normal   Created           41s                  kubelet            Created container main
  Warning  Failed            24s                  kubelet            Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error setting cgroup config for procHooks process: unable to set memory limit to 5 (current usage: 2859008, peak usage: 3244032): unknown

@surajkota surajkota reopened this May 5, 2023
@surajkota
Copy link
Contributor

Hi @weshallsin, the error looks like is in related to the limits specified. Can you try the following:

@askulkarni2
Copy link

@weshallsin can you confirm this can be closed now that gpu instances working through karpenter?

@surajkota
Copy link
Contributor

Can you please post the resolution?

@surajkota surajkota added the work in progress Has been assigned and is in progress label May 13, 2023
@weshallsin
Copy link
Author

weshallsin commented May 15, 2023

@surajkota @askulkarni2 I can confirm that the issue was with the memory units. The pipeline components with GPU requirements are working after upgrading to latest kubeflow.

@surajkota surajkota mentioned this issue Jun 5, 2023
14 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working work in progress Has been assigned and is in progress
Projects
None yet
Development

No branches or pull requests

5 participants