Initial support for Intel Gaudi accelerators (PR version #3)#2301
Conversation
Add OS prerequisite for Gaudi drivers. Only ubuntu support --------- Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>
Add Preliminary Intel Gaudi support to omnia accelerator playbooks Signed-off-by: Yupeng Zhang <yupeng.zhang@intel.com>
Fix Intel Gaudi scripts to make ansible-lint happy. Signed-off-by: David Weinehall <david.weinehall@intel.com>
Use true instead of yes, and add surrounding spaces for jinja2 expressions. Signed-off-by: David Weinehall <david.weinehall@intel.com>
Require a newer version of mpi-operator. Signed-off-by: Fengfeng Tao <fengfeng.tao@intel.com>
Intel Gaudi uses a custom runtime. Configure containerd to use it on nodes that have Intel Gaudi accelerators installed. Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>
Add the necessary configuration to deploy Kubernetes with support for Intel Gaudi. Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>
Add postscripts for installing Intel Gaudi drivers after OS provisioning. Rename "gaudi" to "intelgaudi" in software config per Dell's feedback. accelerator_config is going to be deprecated, change gaudi version input and check the same way AMD does. Do not install base packages on non-Gaudi nodes. Remove tuneD. Remove unused variables. Signed-off-by: Yupeng Zhang <yupeng.zhang@intel.com>
Do CodeQL scans on push and pull requests. Signed-off-by: David Weinehall <david.weinehall@intel.com>
Deploy pytorch image for Gaudi device. Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>
Change containerd runtimme in prepare instead of start. Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>
Check whether any remote node has a Gaudi device, and set that fact on localhost, using tasks/facts delegation; this fact can then be accessed globally using hostvars. After running hl-smi check, add a new task that delegate to localhost, which will store the is_gaudi_cluster fact if Gaudi device detected on any of kube_node. Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>
* Revert documentation changes. * Add all packages to intelgaudi.json. * Remove C code and related steps. * Refactor repeated steps into looping. Signed-off-by: Yupeng Zhang <yupeng.zhang@intel.com>
Do changes in a Habana specific role instead of doing them in the general k8s prepare services role. Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>
Fix repo for Gaudi dependencies and add device plugin image in local repo. Signed-off-by: Adam Ghandoura <adam.ghandoura@intel.com>
Remove .github/workflow/security-scanning.yaml; upstream will provide this. Signed-off-by: David Weinehall <david.weinehall@intel.com>
|
The following Todos will be handled in different PRs:
|
Revert incorrect change. Signed-off-by: David Weinehall <david.weinehall@intel.com>
|
|
||
| - name: Install prerequisite | ||
| ansible.builtin.include_tasks: install_prerequisite_ubuntu.yml | ||
| when: |
There was a problem hiding this comment.
we are checking when condition here, so the when condition inside the install_prerequisite_ubuntu.yml is not required
| --- | ||
|
|
||
| - name: Uncomment habana-container-runtime config mount_accelerators line | ||
| lineinfile: |
There was a problem hiding this comment.
please update all the modules with ansible.builtin commands
Ex: instead of lineinfile use ansible.builtin.lineinfile . replicate the same for other tasks in this file
| --- | ||
|
|
||
| k8s_version: "{{ hostvars['localhost']['k8s_version'] }}" | ||
| containerd_cfg_file_path: "/etc/containerd/config.toml" |
There was a problem hiding this comment.
these variables are used in k8s_habana_container_runtime role. so please move the variables to k8s_habana_container_runtime/vars/main.yml
| --- | ||
|
|
||
| k8s_version: "{{ hostvars['localhost']['k8s_version'] }}" | ||
| habana_container_runtime_cfg_file_path: "/etc/habana-container-runtime/config.toml" |
There was a problem hiding this comment.
This variable is not used in this role. Please move to k8s_habana_container_runtime/vars/main.yml
This PR adds initial support for Intel Gaudi accelerators; this rectifies issues found in PR version #2.