Skip to content

Initial support for Intel Gaudi accelerators (PR version #3)#2301

Merged
sujit-jadhav merged 19 commits into
dell:release_1.7from
dweineha:release_1.7_fixes2
Aug 5, 2024
Merged

Initial support for Intel Gaudi accelerators (PR version #3)#2301
sujit-jadhav merged 19 commits into
dell:release_1.7from
dweineha:release_1.7_fixes2

Conversation

@dweineha
Copy link
Copy Markdown

@dweineha dweineha commented Aug 2, 2024

This PR adds initial support for Intel Gaudi accelerators; this rectifies issues found in PR version #2.

yhluo946 and others added 16 commits July 18, 2024 14:45
Add OS prerequisite for Gaudi drivers.
Only ubuntu support

---------
Signed-off-by:  Yuhao Luo <yuhao.luo@intel.com>
Add Preliminary Intel Gaudi support to omnia accelerator playbooks

Signed-off-by: Yupeng Zhang <yupeng.zhang@intel.com>
Fix Intel Gaudi scripts to make ansible-lint happy.

Signed-off-by: David Weinehall <david.weinehall@intel.com>
Use true instead of yes, and add surrounding spaces for jinja2
expressions.

Signed-off-by: David Weinehall <david.weinehall@intel.com>
Require a newer version of mpi-operator.

Signed-off-by: Fengfeng Tao <fengfeng.tao@intel.com>
Intel Gaudi uses a custom runtime. Configure containerd to use
it on nodes that have Intel Gaudi accelerators installed.

Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>
Add the necessary configuration to deploy Kubernetes
with support for Intel Gaudi.

Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>
Add postscripts for installing Intel Gaudi drivers after OS provisioning.
Rename "gaudi" to "intelgaudi" in software config per Dell's feedback.
accelerator_config is going to be deprecated,
change gaudi version input and check the same way AMD does.
Do not install base packages on non-Gaudi nodes.
Remove tuneD.
Remove unused variables.

Signed-off-by: Yupeng Zhang <yupeng.zhang@intel.com>
Do CodeQL scans on push and pull requests.

Signed-off-by: David Weinehall <david.weinehall@intel.com>
Deploy pytorch image for Gaudi device.

Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>
Change containerd runtimme in prepare instead of start.

Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>
Check whether any remote node has a Gaudi device, and set that fact on
localhost, using tasks/facts delegation; this fact can then be accessed
globally using hostvars.

After running hl-smi check, add a new task that delegate to localhost,
which will store the is_gaudi_cluster fact if Gaudi device detected on
any of kube_node.

Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>
* Revert documentation changes.
* Add all packages to intelgaudi.json.
* Remove C code and related steps.
* Refactor repeated steps into looping.

Signed-off-by: Yupeng Zhang <yupeng.zhang@intel.com>
Do changes in a Habana specific role instead of doing them
in the general k8s prepare services role.

Signed-off-by: Yuhao Luo <yuhao.luo@intel.com>
Fix repo for Gaudi dependencies and add device plugin image
in local repo.

Signed-off-by: Adam Ghandoura <adam.ghandoura@intel.com>
Remove .github/workflow/security-scanning.yaml;
upstream will provide this.

Signed-off-by: David Weinehall <david.weinehall@intel.com>
@ghandoura
Copy link
Copy Markdown

ghandoura commented Aug 2, 2024

The following Todos will be handled in different PRs:

  • Fix the issue when running the validation playbook by platform: accelerator/roles/accelerator_validation/tasks/main.yml

  • Use a centralized way to install Intel Gaudi dependencies

  • Use different subgroups for Guadi drivers and Habana stack

  • omnia Gaudi provision (post)script is as reference and will be overwritten by a subsequent PR

  • Need to fix: Device plugin deployment on non gaudi nodes

  • Need to implement gaudi qualification during omnia.yaml deployment

Revert incorrect change.

Signed-off-by: David Weinehall <david.weinehall@intel.com>

- name: Install prerequisite
ansible.builtin.include_tasks: install_prerequisite_ubuntu.yml
when:
Copy link
Copy Markdown
Collaborator

@Katakam-Rakesh Katakam-Rakesh Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are checking when condition here, so the when condition inside the install_prerequisite_ubuntu.yml is not required

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

---

- name: Uncomment habana-container-runtime config mount_accelerators line
lineinfile:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please update all the modules with ansible.builtin commands
Ex: instead of lineinfile use ansible.builtin.lineinfile . replicate the same for other tasks in this file

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

---

k8s_version: "{{ hostvars['localhost']['k8s_version'] }}"
containerd_cfg_file_path: "/etc/containerd/config.toml"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these variables are used in k8s_habana_container_runtime role. so please move the variables to k8s_habana_container_runtime/vars/main.yml

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

---

k8s_version: "{{ hostvars['localhost']['k8s_version'] }}"
habana_container_runtime_cfg_file_path: "/etc/habana-container-runtime/config.toml"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable is not used in this role. Please move to k8s_habana_container_runtime/vars/main.yml

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed that one, done now

@priti-parate priti-parate self-requested a review August 5, 2024 11:27
@sujit-jadhav sujit-jadhav merged commit 077a9a9 into dell:release_1.7 Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants