[WIP] Initial support for Intel Gaudi accelerators (PR version #2) by dweineha · Pull Request #2300 · dell/omnia

dweineha · 2024-07-25T14:59:55Z

This PR adds initial support for Intel Gaudi accelerators; this PR rectifies some issues identified in the previous PR.

NOTE: This is for feedback and review only.

Add OS prerequisite for Gaudi drivers. Only ubuntu support --------- Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>

Add Preliminary Intel Gaudi support to omnia accelerator playbooks Signed-off-by: Yupeng Zhang <yupeng.zhang@intel.com>

Fix Intel Gaudi scripts to make ansible-lint happy. Signed-off-by: David Weinehall <david.weinehall@intel.com>

Use true instead of yes, and add surrounding spaces for jinja2 expressions. Signed-off-by: David Weinehall <david.weinehall@intel.com>

Require a newer version of mpi-operator. Signed-off-by: Fengfeng Tao <fengfeng.tao@intel.com>

Intel Gaudi uses a custom runtime. Configure containerd to use it on nodes that have Intel Gaudi accelerators installed. Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>

Add the necessary configuration to deploy Kubernetes with support for Intel Gaudi. Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>

Add postscripts for installing Intel Gaudi drivers after OS provisioning. Rename "gaudi" to "intelgaudi" in software config per Dell's feedback. accelerator_config is going to be deprecated, change gaudi version input and check the same way AMD does. Do not install base packages on non-Gaudi nodes. Remove tuneD. Remove unused variables. Signed-off-by: Yupeng Zhang <yupeng.zhang@intel.com>

priti-parate · 2024-07-25T17:14:27Z

@@ -0,0 +1,21 @@
+// Copyright 2024 Intel Corporation.


C code is not allowed legally with omnia repo

This file is used to verify the tpc compiler works or not, just like test_ROCm_code.cpp.
Since omnia contains test_ROCm_code.cpp already, does it mean omnia only allows cpp but not c code?

We need to remove test_ROCm_code.cpp also

@yupengzh-intel, just remove our c code. @priti-parate should take care of the other ones that are not part of this PR.

let's remove it and remove the test too

priti-parate · 2024-07-25T17:15:54Z

+- name: Check if accelerator is present on node
+  ansible.builtin.include_tasks: verify_has_accelerators.yml
+
+- name: Include accelerator_config


not required as we need to read values from software_config.json

Ok, will remove that.

priti-parate · 2024-07-25T17:17:59Z

+intel_gaudi_device_pattern: "Processing accelerators: Habana Labs Ltd."
+
+intel_habana_packages:
+  - habanalabs-container-runtime


container-runtime should be part of scheduler.yml. just after kubernetes installation we can call k8s_habana_container_runtime role

Will move to scheduler

priti-parate · 2024-07-25T17:19:15Z

+
+intel_habana_packages:
+  - habanalabs-container-runtime
+  - habanalabs-dkms


all other packages need to read from gaudi.json. MEanining playbook should install whatever present in Gaudi.json so that if there is any change in package later it will automatically apply just by updating Gaudi.json file

I see amd defines the package names in role too. But I think it is a good idea to define package list only in json, to achieve centralized configuration. But for provisioning postscripts, still need to manage their own copy.

We agree to merge and will be handled with another PR, @yupengzh-intel please add a comment TODO: move to a central config file

priti-parate · 2024-07-25T17:19:56Z

+  - habanatools
+
+intel_apt_base_packages:
+  - cmake


all dependency packages also need to add in Gaudi.json

Ok, will change

same as above it needs to be centralized

Do CodeQL scans on push and pull requests. Signed-off-by: David Weinehall <david.weinehall@intel.com>

Aditya-DP · 2024-07-31T12:04:45Z

+        "package": "habanalabs-dkms={{ intelgaudi_version }}",
+        "type": "deb",
+        "repo_name": "gaudi"
+      },


different subgroups for Guadi drivers and Habana stack

so Gaudi.jason should have subgroup details as below:

will be taken care of in a subsequent PR

Aditya-DP · 2024-07-31T12:05:43Z

        {"name": "pytorch"},
-        {"name": "tensorflow"}
+        {"name": "tensorflow"},
+        {"name": "intelgaudi", "version": "1.16.2-2"}


subgroup for habana similar to amdgpu and rocm. If guadi present in softwares then only habana stack should be setup

We expect changes somthing as below in software_config.json, so group name can be "intelgaudi" and subgroup will be habana

will be taken care of in a subsequent PR

priti-parate · 2024-07-31T12:31:57Z


+# TODO: need a way to differentiate platforms, then run different validations
 - name: Check xcat installation status
-  ansible.builtin.include_tasks: validate_amd.yml


please do not remove this, instead add new task

same as above

priti-parate · 2024-07-31T12:41:00Z

+
+# Usage: configure_gaudi.yml
+gaudi_postscripts_path:
+  - { src: "{{ role_path }}/templates/omnia_gaudi.j2", dest: "/install/postscripts/omnia_gaudi", mode: "755" }


@Aditya-DP have also implemented in this way?

@priti-parate we have made this change with with respect to gaudi postscript

same as the previous comment

priti-parate · 2024-07-31T12:42:00Z

@@ -0,0 +1,70 @@
+#!/bin/bash


@Aditya-DP @abhishek-sa1 is this aligning to our implementation?

@priti-parate We have already implemented by our team and needs to be reverted

@priti-parate, This was requested by you in the past. @abhishek-sa1 Can you provide the link to the PR your team is working on? I don't see any PR or branch regarding this. We shouldn't have multiple work streams. Let's sync Tomorrow in our Omnia meeting to close on this.

The script is intended as a reference and can be overwritten by a subsequent PR

priti-parate · 2024-07-31T12:42:22Z

@@ -0,0 +1,70 @@
+#!/bin/bash
+################################################################################################################
+#  omnia_rocm:


change rocm to habana

priti-parate · 2024-07-31T12:48:11Z

@@ -0,0 +1,42 @@
+name: Security Scanning


please revert, omnia team will be taking care on this

priti-parate · 2024-07-31T12:49:02Z

 ============

-The accelerator role allows users to  set up the `AMD ROCm <https://www.amd.com/en/graphics/servers-solutions-rocm>`_ platform or the `CUDA Nvidia toolkit <https://developer.nvidia.com/cuda-zone>`_. These tools allow users to unlock the potential of installed GPUs.
+The accelerator role allows users to set up the `AMD ROCm <https://www.amd.com/en/graphics/servers-solutions-rocm>`_ platform, the `CUDA Nvidia toolkit <https://developer.nvidia.com/cuda-zone>`_ or the `Intel Gaudi <https://docs.habana.ai/en/latest/index.html>`_ platform. These tools allow users to unlock the potential of installed GPUs.


doc will be updated by omnia doc team, request to remove this

Shouldn't we keep atomicity on this? Why don't have all the core modifications regarding Gaudi3 SW support in one place? In this way we avoid issues related to features that are not fully complete. We don't want to miss the documentation. Can I suggest pushing this as part of this PR and then if there is a need for other changes you can do them in a new PR?

Agreed to revert this one

priti-parate · 2024-07-31T12:55:17Z

+        update_cache: true
+
+    # This number was defined by Habana team and is the number used in IDC clusters
+    - name: Set the number of hugepages


installing tuned packages will be part of different utility and omnia team is writing playbook for the same

So, just to be clear, setting the number of hugepages here is required by Gaudi for running certain workloads. This value is not being set by any default tuned profile. I don't think this should be removed from the Gaudi provisioning task.

we will be creating seperate utility to set such specific parameters for any GPU. And that will be called in Omnia.yml, hence we can remove from here

ok we will remove this

Katakam-Rakesh · 2024-07-31T12:55:24Z

+# limitations under the License.
+---
+
+- name: Check Intel Gaudi HPU


Can we move this complete file to scheduler/roles/k8s_habana_container_runtime/tasks/ and update the filename to check_prerequisite.yml

I will move it to the new role.

priti-parate · 2024-07-31T12:55:37Z

+
+    # This number was defined by Habana team and is the number used in IDC clusters
+    - name: Set the number of hugepages
+      ansible.posix.sysctl:


setting tuned parameters also will be part of new playbook

ok we will remove this. Will be addressed in a future PR

Katakam-Rakesh · 2024-07-31T12:57:13Z

    - k8s_prepare_services

+- name: Check for Intel Gaudi accelerator
+  ansible.builtin.import_playbook: ../utils/check_intel_gaudi_device.yml


Create role named as scheduler/roles/k8s_habana_container_runtime and place this task with file name check_pre_requisite.yml

Katakam-Rakesh · 2024-07-31T12:57:56Z

    - hostvars['127.0.0.1']['k8s_support']
    - "'kube_control_plane' in group_names"
  block:
+    - name: Change containerd runtime


Place this task in scheduler/roles/k8s_habana_container_runtime/tasks/main.yml

@yhluo946 can you please do this change feel free to reach out to @Katakam-Rakesh if you need help

Katakam-Rakesh · 2024-07-31T12:59:49Z

+# limitations under the License.
+---
+
+- name: Uncomment habana-container-runtime config mount_accelerators line


Please move this file to scheduler/roles/k8s_habana_container_runtime/tasks/change_containerd_runtime.yml.
k8s_habana_container_runtime/tasks/main.yml should call check_pre_requisite.yml and change_containerd_runtime.yml

@yhluo946 can you please move this to a separate role and move the container runtime package installation here

priti-parate · 2024-07-31T13:00:35Z

+- name: Check node accelerator status
+  ansible.builtin.shell: |
+    set -o pipefail
+    lspci | grep -i "{{ intel_gaudi_device_pattern }}"


we dont need separate playbook just for this command

We agreed to keep this for the moment

Katakam-Rakesh · 2024-07-31T13:05:31Z

+    - "'habanalabs-device-plugin-daemonset' not in k8s_pods.stdout"
+    - k8s_version >= minimal_gaudi_k8s_version
+    - accelerator_type is defined
+    - accelerator_type == "habana"


accelerator_type variable is available to the kude_node group only as it was set in utils/check_intel_gaudi_device.yml. In k8s_start_services/tasks/deploy_k8s_services.yaml accelerator_type variable will be always in VARIABLE NOT DEFINED state as this role will be called on kube_control_plane group. Please fix this scenario

Katakam-Rakesh · 2024-07-31T13:08:14Z


+# HABANA PLUGIN
+- name: Deploy Habana Device plugin
+  ansible.builtin.command: "kubectl create -f '{{ habana_device_plugin_yaml_url }}'"


habana device plugin should be deployed on nodes having gaudi only and not on the other nodes.

Eg: Let us say k8s cluster has 4 nodes out of which two nodes are having gaudi. In this case this plugin should be scheduled on 2 nodes having gaudi but not on other 2 nodes.

The device plugin is a daemonset. More info: https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/

What will be status of habana-device-plugin pod on the node which is not having gaudi?

@francares Does the daemonset make use of a selector to ensure that it only runs on nodes with Gaudi?

We agreed to handle this in a separate PR. The expected behavior is to not have any failed pod.

priti-parate · 2024-08-01T05:25:19Z

  ansible.builtin.include_tasks: include_local_repo_config.yml

+# TODO: need a way to differentiate platforms, then run different validations
 - name: Check xcat installation status


check xCAT installation status for Gaudi , please have specific task name

we can keep this as a TODO, this will be addressed by another PR.

priti-parate · 2024-08-01T05:29:19Z

+---
+
+- name: Check if accelerator is present on node
+  ansible.builtin.include_tasks: verify_has_accelerators.yml


is this separate playbook required just for a lspci command?

We agreed to keep it separate for the moment

priti-parate · 2024-08-01T05:31:04Z

+      args:
+        executable: /bin/bash
+
+    - name: Add the habanalabs kernel module


all three can club into one task with list variable

Deploy pytorch image for Gaudi device. Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>

Change containerd runtimme in prepare instead of start. Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>

Check whether any remote node has a Gaudi device, and set that fact on localhost, using tasks/facts delegation; this fact can then be accessed globally using hostvars. After running hl-smi check, add a new task that delegate to localhost, which will store the is_gaudi_cluster fact if Gaudi device detected on any of kube_node. Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>

* Revert documentation changes. * Add all packages to intelgaudi.json. * Remove C code and related steps. * Refactor repeated steps into looping. Signed-off-by: Yupeng Zhang <yupeng.zhang@intel.com>

Do changes in a Habana specific role instead of doing them in the general k8s prepare services role. Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>

Fix repo for Gaudi dependencies and add device plugin image in local repo. Signed-off-by: Adam Ghandoura <adam.ghandoura@intel.com>

Remove .github/workflow/security-scanning.yaml; upstream will provide this. Signed-off-by: David Weinehall <david.weinehall@intel.com>

dweineha · 2024-08-02T09:55:19Z

This is superseded by PR version #3; closing this. Thanks for all feedback!

yhluo946 and others added 8 commits July 18, 2024 14:45

OS prerequisite changes (dell#2)

de7af57

Add OS prerequisite for Gaudi drivers. Only ubuntu support --------- Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>

add Intel Gaudi support to omnia accelerator (dell#1)

5881638

Add Preliminary Intel Gaudi support to omnia accelerator playbooks Signed-off-by: Yupeng Zhang <yupeng.zhang@intel.com>

ansible-lint cleanup

305dce5

Fix Intel Gaudi scripts to make ansible-lint happy. Signed-off-by: David Weinehall <david.weinehall@intel.com>

Fix a few more ansible-lint complaints

b97658b

Use true instead of yes, and add surrounding spaces for jinja2 expressions. Signed-off-by: David Weinehall <david.weinehall@intel.com>

Bump required mpi-operator version to 0.5.0

34db2f7

Require a newer version of mpi-operator. Signed-off-by: Fengfeng Tao <fengfeng.tao@intel.com>

Change containerd runtime

b175752

Intel Gaudi uses a custom runtime. Configure containerd to use it on nodes that have Intel Gaudi accelerators installed. Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>

Configure Kubernetes for Intel Gaudi

ea55a19

Add the necessary configuration to deploy Kubernetes with support for Intel Gaudi. Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>

priti-parate reviewed Jul 25, 2024

View reviewed changes

Add workflow for CodeQL

a7c1e0f

Do CodeQL scans on push and pull requests. Signed-off-by: David Weinehall <david.weinehall@intel.com>

Aditya-DP reviewed Jul 31, 2024

View reviewed changes

priti-parate reviewed Jul 31, 2024

View reviewed changes

Katakam-Rakesh reviewed Jul 31, 2024

View reviewed changes

priti-parate reviewed Jul 31, 2024

View reviewed changes

Katakam-Rakesh reviewed Jul 31, 2024

View reviewed changes

priti-parate reviewed Jul 31, 2024

View reviewed changes

Katakam-Rakesh reviewed Jul 31, 2024

View reviewed changes

priti-parate reviewed Aug 1, 2024

View reviewed changes

dweineha added 7 commits August 1, 2024 15:45

Deploy pytorch image for Gaudi device

0abc3c5

Deploy pytorch image for Gaudi device. Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>

Move containerd runtime script to proper place

c138533

Change containerd runtimme in prepare instead of start. Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>

Fixes based on PR review comments

eef18d5

* Revert documentation changes. * Add all packages to intelgaudi.json. * Remove C code and related steps. * Refactor repeated steps into looping. Signed-off-by: Yupeng Zhang <yupeng.zhang@intel.com>

Re-organize container runtime changes

1dbd017

Do changes in a Habana specific role instead of doing them in the general k8s prepare services role. Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>

Gaudi: change dep repo name and pull device plugin

687d02c

Fix repo for Gaudi dependencies and add device plugin image in local repo. Signed-off-by: Adam Ghandoura <adam.ghandoura@intel.com>

Remove the security-scanning workflow

dc2c978

Remove .github/workflow/security-scanning.yaml; upstream will provide this. Signed-off-by: David Weinehall <david.weinehall@intel.com>

dweineha closed this Aug 2, 2024

Conversation

dweineha commented Jul 25, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

priti-parate Jul 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

priti-parate Jul 31, 2024 •

edited

Loading

ghandoura Aug 1, 2024 •

edited

Loading