[WIP] Initial support for Intel Gaudi accelerators (PR version #2)#2300
[WIP] Initial support for Intel Gaudi accelerators (PR version #2)#2300dweineha wants to merge 16 commits into
Conversation
Add OS prerequisite for Gaudi drivers. Only ubuntu support --------- Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>
Add Preliminary Intel Gaudi support to omnia accelerator playbooks Signed-off-by: Yupeng Zhang <yupeng.zhang@intel.com>
Fix Intel Gaudi scripts to make ansible-lint happy. Signed-off-by: David Weinehall <david.weinehall@intel.com>
Use true instead of yes, and add surrounding spaces for jinja2 expressions. Signed-off-by: David Weinehall <david.weinehall@intel.com>
Require a newer version of mpi-operator. Signed-off-by: Fengfeng Tao <fengfeng.tao@intel.com>
Intel Gaudi uses a custom runtime. Configure containerd to use it on nodes that have Intel Gaudi accelerators installed. Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>
Add the necessary configuration to deploy Kubernetes with support for Intel Gaudi. Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>
Add postscripts for installing Intel Gaudi drivers after OS provisioning. Rename "gaudi" to "intelgaudi" in software config per Dell's feedback. accelerator_config is going to be deprecated, change gaudi version input and check the same way AMD does. Do not install base packages on non-Gaudi nodes. Remove tuneD. Remove unused variables. Signed-off-by: Yupeng Zhang <yupeng.zhang@intel.com>
| @@ -0,0 +1,21 @@ | |||
| // Copyright 2024 Intel Corporation. | |||
There was a problem hiding this comment.
C code is not allowed legally with omnia repo
There was a problem hiding this comment.
This file is used to verify the tpc compiler works or not, just like test_ROCm_code.cpp.
Since omnia contains test_ROCm_code.cpp already, does it mean omnia only allows cpp but not c code?
There was a problem hiding this comment.
We need to remove test_ROCm_code.cpp also
There was a problem hiding this comment.
@yupengzh-intel, just remove our c code. @priti-parate should take care of the other ones that are not part of this PR.
There was a problem hiding this comment.
let's remove it and remove the test too
| - name: Check if accelerator is present on node | ||
| ansible.builtin.include_tasks: verify_has_accelerators.yml | ||
|
|
||
| - name: Include accelerator_config |
There was a problem hiding this comment.
not required as we need to read values from software_config.json
| intel_gaudi_device_pattern: "Processing accelerators: Habana Labs Ltd." | ||
|
|
||
| intel_habana_packages: | ||
| - habanalabs-container-runtime |
There was a problem hiding this comment.
container-runtime should be part of scheduler.yml. just after kubernetes installation we can call k8s_habana_container_runtime role
|
|
||
| intel_habana_packages: | ||
| - habanalabs-container-runtime | ||
| - habanalabs-dkms |
There was a problem hiding this comment.
all other packages need to read from gaudi.json. MEanining playbook should install whatever present in Gaudi.json so that if there is any change in package later it will automatically apply just by updating Gaudi.json file
There was a problem hiding this comment.
I see amd defines the package names in role too. But I think it is a good idea to define package list only in json, to achieve centralized configuration. But for provisioning postscripts, still need to manage their own copy.
There was a problem hiding this comment.
We agree to merge and will be handled with another PR, @yupengzh-intel please add a comment TODO: move to a central config file
| - habanatools | ||
|
|
||
| intel_apt_base_packages: | ||
| - cmake |
There was a problem hiding this comment.
all dependency packages also need to add in Gaudi.json
There was a problem hiding this comment.
same as above it needs to be centralized
Do CodeQL scans on push and pull requests. Signed-off-by: David Weinehall <david.weinehall@intel.com>
| "package": "habanalabs-dkms={{ intelgaudi_version }}", | ||
| "type": "deb", | ||
| "repo_name": "gaudi" | ||
| }, |
There was a problem hiding this comment.
different subgroups for Guadi drivers and Habana stack
There was a problem hiding this comment.
will be taken care of in a subsequent PR
| {"name": "pytorch"}, | ||
| {"name": "tensorflow"} | ||
| {"name": "tensorflow"}, | ||
| {"name": "intelgaudi", "version": "1.16.2-2"} |
There was a problem hiding this comment.
subgroup for habana similar to amdgpu and rocm. If guadi present in softwares then only habana stack should be setup
There was a problem hiding this comment.
will be taken care of in a subsequent PR
|
|
||
| # TODO: need a way to differentiate platforms, then run different validations | ||
| - name: Check xcat installation status | ||
| ansible.builtin.include_tasks: validate_amd.yml |
There was a problem hiding this comment.
please do not remove this, instead add new task
|
|
||
| # Usage: configure_gaudi.yml | ||
| gaudi_postscripts_path: | ||
| - { src: "{{ role_path }}/templates/omnia_gaudi.j2", dest: "/install/postscripts/omnia_gaudi", mode: "755" } |
There was a problem hiding this comment.
@priti-parate we have made this change with with respect to gaudi postscript
| @@ -0,0 +1,70 @@ | |||
| #!/bin/bash | |||
There was a problem hiding this comment.
@Aditya-DP @abhishek-sa1 is this aligning to our implementation?
There was a problem hiding this comment.
@priti-parate We have already implemented by our team and needs to be reverted
There was a problem hiding this comment.
@priti-parate, This was requested by you in the past. @abhishek-sa1 Can you provide the link to the PR your team is working on? I don't see any PR or branch regarding this. We shouldn't have multiple work streams. Let's sync Tomorrow in our Omnia meeting to close on this.
There was a problem hiding this comment.
The script is intended as a reference and can be overwritten by a subsequent PR
| @@ -0,0 +1,70 @@ | |||
| #!/bin/bash | |||
| ################################################################################################################ | |||
| # omnia_rocm: | |||
There was a problem hiding this comment.
change rocm to habana
| @@ -0,0 +1,42 @@ | |||
| name: Security Scanning | |||
There was a problem hiding this comment.
please revert, omnia team will be taking care on this
| ============ | ||
|
|
||
| The accelerator role allows users to set up the `AMD ROCm <https://www.amd.com/en/graphics/servers-solutions-rocm>`_ platform or the `CUDA Nvidia toolkit <https://developer.nvidia.com/cuda-zone>`_. These tools allow users to unlock the potential of installed GPUs. | ||
| The accelerator role allows users to set up the `AMD ROCm <https://www.amd.com/en/graphics/servers-solutions-rocm>`_ platform, the `CUDA Nvidia toolkit <https://developer.nvidia.com/cuda-zone>`_ or the `Intel Gaudi <https://docs.habana.ai/en/latest/index.html>`_ platform. These tools allow users to unlock the potential of installed GPUs. |
There was a problem hiding this comment.
doc will be updated by omnia doc team, request to remove this
There was a problem hiding this comment.
Shouldn't we keep atomicity on this? Why don't have all the core modifications regarding Gaudi3 SW support in one place? In this way we avoid issues related to features that are not fully complete. We don't want to miss the documentation. Can I suggest pushing this as part of this PR and then if there is a need for other changes you can do them in a new PR?
| update_cache: true | ||
|
|
||
| # This number was defined by Habana team and is the number used in IDC clusters | ||
| - name: Set the number of hugepages |
There was a problem hiding this comment.
installing tuned packages will be part of different utility and omnia team is writing playbook for the same
There was a problem hiding this comment.
So, just to be clear, setting the number of hugepages here is required by Gaudi for running certain workloads. This value is not being set by any default tuned profile. I don't think this should be removed from the Gaudi provisioning task.
There was a problem hiding this comment.
we will be creating seperate utility to set such specific parameters for any GPU. And that will be called in Omnia.yml, hence we can remove from here
| # limitations under the License. | ||
| --- | ||
|
|
||
| - name: Check Intel Gaudi HPU |
There was a problem hiding this comment.
Can we move this complete file to scheduler/roles/k8s_habana_container_runtime/tasks/ and update the filename to check_prerequisite.yml
|
|
||
| # This number was defined by Habana team and is the number used in IDC clusters | ||
| - name: Set the number of hugepages | ||
| ansible.posix.sysctl: |
There was a problem hiding this comment.
setting tuned parameters also will be part of new playbook
There was a problem hiding this comment.
ok we will remove this. Will be addressed in a future PR
| - k8s_prepare_services | ||
|
|
||
| - name: Check for Intel Gaudi accelerator | ||
| ansible.builtin.import_playbook: ../utils/check_intel_gaudi_device.yml |
There was a problem hiding this comment.
Create role named as scheduler/roles/k8s_habana_container_runtime and place this task with file name check_pre_requisite.yml
| - hostvars['127.0.0.1']['k8s_support'] | ||
| - "'kube_control_plane' in group_names" | ||
| block: | ||
| - name: Change containerd runtime |
There was a problem hiding this comment.
Place this task in scheduler/roles/k8s_habana_container_runtime/tasks/main.yml
There was a problem hiding this comment.
@yhluo946 can you please do this change feel free to reach out to @Katakam-Rakesh if you need help
| # limitations under the License. | ||
| --- | ||
|
|
||
| - name: Uncomment habana-container-runtime config mount_accelerators line |
There was a problem hiding this comment.
Please move this file to scheduler/roles/k8s_habana_container_runtime/tasks/change_containerd_runtime.yml.
k8s_habana_container_runtime/tasks/main.yml should call check_pre_requisite.yml and change_containerd_runtime.yml
There was a problem hiding this comment.
@yhluo946 can you please move this to a separate role and move the container runtime package installation here
| - name: Check node accelerator status | ||
| ansible.builtin.shell: | | ||
| set -o pipefail | ||
| lspci | grep -i "{{ intel_gaudi_device_pattern }}" |
There was a problem hiding this comment.
we dont need separate playbook just for this command
There was a problem hiding this comment.
We agreed to keep this for the moment
| - "'habanalabs-device-plugin-daemonset' not in k8s_pods.stdout" | ||
| - k8s_version >= minimal_gaudi_k8s_version | ||
| - accelerator_type is defined | ||
| - accelerator_type == "habana" |
There was a problem hiding this comment.
accelerator_type variable is available to the kude_node group only as it was set in utils/check_intel_gaudi_device.yml. In k8s_start_services/tasks/deploy_k8s_services.yaml accelerator_type variable will be always in VARIABLE NOT DEFINED state as this role will be called on kube_control_plane group. Please fix this scenario
|
|
||
| # HABANA PLUGIN | ||
| - name: Deploy Habana Device plugin | ||
| ansible.builtin.command: "kubectl create -f '{{ habana_device_plugin_yaml_url }}'" |
There was a problem hiding this comment.
habana device plugin should be deployed on nodes having gaudi only and not on the other nodes.
Eg: Let us say k8s cluster has 4 nodes out of which two nodes are having gaudi. In this case this plugin should be scheduled on 2 nodes having gaudi but not on other 2 nodes.
There was a problem hiding this comment.
The device plugin is a daemonset. More info: https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/
There was a problem hiding this comment.
What will be status of habana-device-plugin pod on the node which is not having gaudi?
There was a problem hiding this comment.
@francares Does the daemonset make use of a selector to ensure that it only runs on nodes with Gaudi?
There was a problem hiding this comment.
We agreed to handle this in a separate PR. The expected behavior is to not have any failed pod.
| ansible.builtin.include_tasks: include_local_repo_config.yml | ||
|
|
||
| # TODO: need a way to differentiate platforms, then run different validations | ||
| - name: Check xcat installation status |
There was a problem hiding this comment.
check xCAT installation status for Gaudi , please have specific task name
There was a problem hiding this comment.
we can keep this as a TODO, this will be addressed by another PR.
| --- | ||
|
|
||
| - name: Check if accelerator is present on node | ||
| ansible.builtin.include_tasks: verify_has_accelerators.yml |
There was a problem hiding this comment.
is this separate playbook required just for a lspci command?
There was a problem hiding this comment.
We agreed to keep it separate for the moment
| args: | ||
| executable: /bin/bash | ||
|
|
||
| - name: Add the habanalabs kernel module |
There was a problem hiding this comment.
all three can club into one task with list variable
Deploy pytorch image for Gaudi device. Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>
Change containerd runtimme in prepare instead of start. Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>
Check whether any remote node has a Gaudi device, and set that fact on localhost, using tasks/facts delegation; this fact can then be accessed globally using hostvars. After running hl-smi check, add a new task that delegate to localhost, which will store the is_gaudi_cluster fact if Gaudi device detected on any of kube_node. Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>
* Revert documentation changes. * Add all packages to intelgaudi.json. * Remove C code and related steps. * Refactor repeated steps into looping. Signed-off-by: Yupeng Zhang <yupeng.zhang@intel.com>
Do changes in a Habana specific role instead of doing them in the general k8s prepare services role. Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>
Fix repo for Gaudi dependencies and add device plugin image in local repo. Signed-off-by: Adam Ghandoura <adam.ghandoura@intel.com>
Remove .github/workflow/security-scanning.yaml; upstream will provide this. Signed-off-by: David Weinehall <david.weinehall@intel.com>
|
This is superseded by PR version #3; closing this. Thanks for all feedback! |


This PR adds initial support for Intel Gaudi accelerators; this PR rectifies some issues identified in the previous PR.
NOTE: This is for feedback and review only.