Skip to content

[WIP] Initial support for Intel Gaudi accelerators (PR version #2)#2300

Closed
dweineha wants to merge 16 commits into
dell:mainfrom
dweineha:release_1.7_fixes2
Closed

[WIP] Initial support for Intel Gaudi accelerators (PR version #2)#2300
dweineha wants to merge 16 commits into
dell:mainfrom
dweineha:release_1.7_fixes2

Conversation

@dweineha
Copy link
Copy Markdown

This PR adds initial support for Intel Gaudi accelerators; this PR rectifies some issues identified in the previous PR.

NOTE: This is for feedback and review only.

yhluo946 and others added 8 commits July 18, 2024 14:45
Add OS prerequisite for Gaudi drivers.
Only ubuntu support

---------
Signed-off-by:  Yuhao Luo <yuhao.luo@intel.com>
Add Preliminary Intel Gaudi support to omnia accelerator playbooks

Signed-off-by: Yupeng Zhang <yupeng.zhang@intel.com>
Fix Intel Gaudi scripts to make ansible-lint happy.

Signed-off-by: David Weinehall <david.weinehall@intel.com>
Use true instead of yes, and add surrounding spaces for jinja2
expressions.

Signed-off-by: David Weinehall <david.weinehall@intel.com>
Require a newer version of mpi-operator.

Signed-off-by: Fengfeng Tao <fengfeng.tao@intel.com>
Intel Gaudi uses a custom runtime. Configure containerd to use
it on nodes that have Intel Gaudi accelerators installed.

Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>
Add the necessary configuration to deploy Kubernetes
with support for Intel Gaudi.

Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>
Add postscripts for installing Intel Gaudi drivers after OS provisioning.
Rename "gaudi" to "intelgaudi" in software config per Dell's feedback.
accelerator_config is going to be deprecated,
change gaudi version input and check the same way AMD does.
Do not install base packages on non-Gaudi nodes.
Remove tuneD.
Remove unused variables.

Signed-off-by: Yupeng Zhang <yupeng.zhang@intel.com>
@@ -0,0 +1,21 @@
// Copyright 2024 Intel Corporation.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C code is not allowed legally with omnia repo

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is used to verify the tpc compiler works or not, just like test_ROCm_code.cpp.
Since omnia contains test_ROCm_code.cpp already, does it mean omnia only allows cpp but not c code?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to remove test_ROCm_code.cpp also

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yupengzh-intel, just remove our c code. @priti-parate should take care of the other ones that are not part of this PR.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove it and remove the test too

- name: Check if accelerator is present on node
ansible.builtin.include_tasks: verify_has_accelerators.yml

- name: Include accelerator_config
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not required as we need to read values from software_config.json

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will remove that.

Comment thread accelerator/roles/intel/vars/main.yml Outdated
intel_gaudi_device_pattern: "Processing accelerators: Habana Labs Ltd."

intel_habana_packages:
- habanalabs-container-runtime
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

container-runtime should be part of scheduler.yml. just after kubernetes installation we can call k8s_habana_container_runtime role

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will move to scheduler


intel_habana_packages:
- habanalabs-container-runtime
- habanalabs-dkms
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all other packages need to read from gaudi.json. MEanining playbook should install whatever present in Gaudi.json so that if there is any change in package later it will automatically apply just by updating Gaudi.json file

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see amd defines the package names in role too. But I think it is a good idea to define package list only in json, to achieve centralized configuration. But for provisioning postscripts, still need to manage their own copy.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We agree to merge and will be handled with another PR, @yupengzh-intel please add a comment TODO: move to a central config file

- habanatools

intel_apt_base_packages:
- cmake
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all dependency packages also need to add in Gaudi.json

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will change

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above it needs to be centralized

Do CodeQL scans on push and pull requests.

Signed-off-by: David Weinehall <david.weinehall@intel.com>
"package": "habanalabs-dkms={{ intelgaudi_version }}",
"type": "deb",
"repo_name": "gaudi"
},
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

different subgroups for Guadi drivers and Habana stack

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so Gaudi.jason should have subgroup details as below:
image

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be taken care of in a subsequent PR

{"name": "pytorch"},
{"name": "tensorflow"}
{"name": "tensorflow"},
{"name": "intelgaudi", "version": "1.16.2-2"}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subgroup for habana similar to amdgpu and rocm. If guadi present in softwares then only habana stack should be setup

Copy link
Copy Markdown
Collaborator

@priti-parate priti-parate Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We expect changes somthing as below in software_config.json, so group name can be "intelgaudi" and subgroup will be habana
image

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be taken care of in a subsequent PR


# TODO: need a way to differentiate platforms, then run different validations
- name: Check xcat installation status
ansible.builtin.include_tasks: validate_amd.yml
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do not remove this, instead add new task

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above


# Usage: configure_gaudi.yml
gaudi_postscripts_path:
- { src: "{{ role_path }}/templates/omnia_gaudi.j2", dest: "/install/postscripts/omnia_gaudi", mode: "755" }
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Aditya-DP have also implemented in this way?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@priti-parate we have made this change with with respect to gaudi postscript

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as the previous comment

@@ -0,0 +1,70 @@
#!/bin/bash
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Aditya-DP @abhishek-sa1 is this aligning to our implementation?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@priti-parate We have already implemented by our team and needs to be reverted

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@priti-parate, This was requested by you in the past. @abhishek-sa1 Can you provide the link to the PR your team is working on? I don't see any PR or branch regarding this. We shouldn't have multiple work streams. Let's sync Tomorrow in our Omnia meeting to close on this.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script is intended as a reference and can be overwritten by a subsequent PR

@@ -0,0 +1,70 @@
#!/bin/bash
################################################################################################################
# omnia_rocm:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change rocm to habana

@@ -0,0 +1,42 @@
name: Security Scanning
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please revert, omnia team will be taking care on this

Comment thread accelerator/README.rst Outdated
============

The accelerator role allows users to set up the `AMD ROCm <https://www.amd.com/en/graphics/servers-solutions-rocm>`_ platform or the `CUDA Nvidia toolkit <https://developer.nvidia.com/cuda-zone>`_. These tools allow users to unlock the potential of installed GPUs.
The accelerator role allows users to set up the `AMD ROCm <https://www.amd.com/en/graphics/servers-solutions-rocm>`_ platform, the `CUDA Nvidia toolkit <https://developer.nvidia.com/cuda-zone>`_ or the `Intel Gaudi <https://docs.habana.ai/en/latest/index.html>`_ platform. These tools allow users to unlock the potential of installed GPUs.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc will be updated by omnia doc team, request to remove this

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we keep atomicity on this? Why don't have all the core modifications regarding Gaudi3 SW support in one place? In this way we avoid issues related to features that are not fully complete. We don't want to miss the documentation. Can I suggest pushing this as part of this PR and then if there is a need for other changes you can do them in a new PR?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed to revert this one

update_cache: true

# This number was defined by Habana team and is the number used in IDC clusters
- name: Set the number of hugepages
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

installing tuned packages will be part of different utility and omnia team is writing playbook for the same

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, just to be clear, setting the number of hugepages here is required by Gaudi for running certain workloads. This value is not being set by any default tuned profile. I don't think this should be removed from the Gaudi provisioning task.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will be creating seperate utility to set such specific parameters for any GPU. And that will be called in Omnia.yml, hence we can remove from here

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok we will remove this

Comment thread utils/check_intel_gaudi_device.yml Outdated
# limitations under the License.
---

- name: Check Intel Gaudi HPU
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this complete file to scheduler/roles/k8s_habana_container_runtime/tasks/ and update the filename to check_prerequisite.yml

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will move it to the new role.


# This number was defined by Habana team and is the number used in IDC clusters
- name: Set the number of hugepages
ansible.posix.sysctl:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setting tuned parameters also will be part of new playbook

Copy link
Copy Markdown

@ghandoura ghandoura Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok we will remove this. Will be addressed in a future PR

Comment thread scheduler/scheduler.yml Outdated
- k8s_prepare_services

- name: Check for Intel Gaudi accelerator
ansible.builtin.import_playbook: ../utils/check_intel_gaudi_device.yml
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create role named as scheduler/roles/k8s_habana_container_runtime and place this task with file name check_pre_requisite.yml

- hostvars['127.0.0.1']['k8s_support']
- "'kube_control_plane' in group_names"
block:
- name: Change containerd runtime
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Place this task in scheduler/roles/k8s_habana_container_runtime/tasks/main.yml

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yhluo946 can you please do this change feel free to reach out to @Katakam-Rakesh if you need help

# limitations under the License.
---

- name: Uncomment habana-container-runtime config mount_accelerators line
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this file to scheduler/roles/k8s_habana_container_runtime/tasks/change_containerd_runtime.yml.
k8s_habana_container_runtime/tasks/main.yml should call check_pre_requisite.yml and change_containerd_runtime.yml

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yhluo946 can you please move this to a separate role and move the container runtime package installation here

- name: Check node accelerator status
ansible.builtin.shell: |
set -o pipefail
lspci | grep -i "{{ intel_gaudi_device_pattern }}"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we dont need separate playbook just for this command

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We agreed to keep this for the moment

- "'habanalabs-device-plugin-daemonset' not in k8s_pods.stdout"
- k8s_version >= minimal_gaudi_k8s_version
- accelerator_type is defined
- accelerator_type == "habana"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accelerator_type variable is available to the kude_node group only as it was set in utils/check_intel_gaudi_device.yml. In k8s_start_services/tasks/deploy_k8s_services.yaml accelerator_type variable will be always in VARIABLE NOT DEFINED state as this role will be called on kube_control_plane group. Please fix this scenario


# HABANA PLUGIN
- name: Deploy Habana Device plugin
ansible.builtin.command: "kubectl create -f '{{ habana_device_plugin_yaml_url }}'"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

habana device plugin should be deployed on nodes having gaudi only and not on the other nodes.

Eg: Let us say k8s cluster has 4 nodes out of which two nodes are having gaudi. In this case this plugin should be scheduled on 2 nodes having gaudi but not on other 2 nodes.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The device plugin is a daemonset. More info: https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will be status of habana-device-plugin pod on the node which is not having gaudi?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@francares Does the daemonset make use of a selector to ensure that it only runs on nodes with Gaudi?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We agreed to handle this in a separate PR. The expected behavior is to not have any failed pod.

ansible.builtin.include_tasks: include_local_repo_config.yml

# TODO: need a way to differentiate platforms, then run different validations
- name: Check xcat installation status
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check xCAT installation status for Gaudi , please have specific task name

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can keep this as a TODO, this will be addressed by another PR.

---

- name: Check if accelerator is present on node
ansible.builtin.include_tasks: verify_has_accelerators.yml
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this separate playbook required just for a lspci command?

Copy link
Copy Markdown

@ghandoura ghandoura Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We agreed to keep it separate for the moment

args:
executable: /bin/bash

- name: Add the habanalabs kernel module
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all three can club into one task with list variable

Deploy pytorch image for Gaudi device.

Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>
Change containerd runtimme in prepare instead of start.

Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>
Check whether any remote node has a Gaudi device, and set that fact on
localhost, using tasks/facts delegation; this fact can then be accessed
globally using hostvars.

After running hl-smi check, add a new task that delegate to localhost,
which will store the is_gaudi_cluster fact if Gaudi device detected on
any of kube_node.

Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>
* Revert documentation changes.
* Add all packages to intelgaudi.json.
* Remove C code and related steps.
* Refactor repeated steps into looping.

Signed-off-by: Yupeng Zhang <yupeng.zhang@intel.com>
Do changes in a Habana specific role instead of doing them
in the general k8s prepare services role.

Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>
Fix repo for Gaudi dependencies and add device plugin image
in local repo.

Signed-off-by: Adam Ghandoura <adam.ghandoura@intel.com>
Remove .github/workflow/security-scanning.yaml;
upstream will provide this.

Signed-off-by: David Weinehall <david.weinehall@intel.com>
@dweineha
Copy link
Copy Markdown
Author

dweineha commented Aug 2, 2024

This is superseded by PR version #3; closing this. Thanks for all feedback!

@dweineha dweineha closed this Aug 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants