From 0a6bcf2615277ead3d6ff9e05841e464db853bd3 Mon Sep 17 00:00:00 2001 From: csplinter Date: Fri, 17 Oct 2025 10:21:58 -0500 Subject: [PATCH 1/4] update EKS GPU AMI docs --- latest/ug/ml/ml-eks-k8s-device-plugin.adoc | 262 ++++++++++++++++++ latest/ug/ml/ml-eks-optimized-ami.adoc | 194 ++++++++----- latest/ug/nodes/eks-ami-build-scripts.adoc | 10 +- latest/ug/nodes/eks-ami-deprecation-faqs.adoc | 99 ++----- .../nodes/eks-optimized-ami-bottlerocket.adoc | 9 +- latest/ug/nodes/eks-optimized-ami.adoc | 25 +- 6 files changed, 412 insertions(+), 187 deletions(-) create mode 100644 latest/ug/ml/ml-eks-k8s-device-plugin.adoc diff --git a/latest/ug/ml/ml-eks-k8s-device-plugin.adoc b/latest/ug/ml/ml-eks-k8s-device-plugin.adoc new file mode 100644 index 000000000..562a6ad94 --- /dev/null +++ b/latest/ug/ml/ml-eks-k8s-device-plugin.adoc @@ -0,0 +1,262 @@ +include::../attributes.txt[] + +[.topic] +[#ml-eks-k8s-device-plugin] += Install Kubernetes device plugin for GPUs +:info_titleabbrev: Install device plugin for GPUs + +Kubernetes https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/[device plugins] have been the primary mechanism for advertising specialized infrastructure such as GPUs, network interfaces, and network adaptors as consumable resources for Kubernetes workloads. While https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/[Dynamic Resource Allocation] (DRA) is positioned as the future for device management in Kubernetes, most specialized infrastructure providers are early in their support for DRA drivers. Kubernetes device plugins remain a widely available approach for using GPUs in Kubernetes clusters today. + +== Considerations + +* When using the EKS-optimized AL2023 AMIs with NVIDIA GPUs, you must install the https://github.com/NVIDIA/k8s-device-plugin[NVIDIA Kubernetes device plugin]. You can install and manage the NVIDIA Kubernetes device plugin with Helm, your choice of Kubernetes tooling, or the NVIDIA GPU operator. +* When using the EKS-optimized Bottlerocket AMIs with NVIDIA GPUs, you do not need to install the NVIDIA Kubernetes device plugin, as it is already included in the EKS-optimized Bottlerocket AMIs. This includes when you use GPU instances with EKS Auto Mode. +* When using the EKS-optimized AL2023 or Bottlerocket AMIs with AWS Inferentia or Trainium GPUs, then you must install the Neuron Kubernetes device plugin, and optionally install the https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/k8s-neuron-scheduler.html[Neuron Kubernetes scheduler extension]. For more information, see the https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/kubernetes-getting-started.html[Neuron documentation for running on EKS]. + +[#eks-nvidia-device-plugin] +== Install NVIDIA Kubernetes device plugin + +The following procedure describes how to install the NVIDIA Kubernetes device plugin and run a sample test on NVIDIA GPU instances. + +=== Prerequisites + +* EKS cluster created +* NVIDIA GPU nodes running in the cluster using EKS-optimized AL2023 NVIDIA AMI +* Helm installed in your command-line environment, see <>. + +=== Procedure + +. Add the `nvdp` Helm chart repository. ++ +[source,bash] +---- +helm repo add nvdp https://nvidia.github.io/k8s-device-plugin +---- ++ +. Update your local Helm repository to make sure that you have the most recent charts. ++ +[source,bash] +---- +helm repo update +---- ++ +. Get the latest version of the NVIDIA Kubernetes device plugin ++ +[source,bash] +---- +helm search repo nvdp --devel +---- ++ +[source,bash] +---- +NAME CHART VERSION APP VERSION DESCRIPTION +nvdp/gpu-feature-discovery 0.17.4 0.17.4 ... +nvdp/nvidia-device-plugin 0.17.4 0.17.4 ... +---- ++ +. Install the NVIDIA Kubernetes device plugin on your cluster, replacing `0.17.4` with the latest version from the command above. ++ +[source,bash,subs="verbatim,attributes,quotes"] +---- +helm install nvdp nvdp/nvidia-device-plugin \ + --namespace nvidia \ + --create-namespace \ + --version [.replaceable]`0.17.4` \ + --set gfd.enabled=true +---- ++ +. Verify the NVIDIA Kubernetes device plugin is running in your cluster. The output below shows the output with two nodes in the cluster. ++ +[source,bash] +---- +kubectl get ds -n nvidia nvdp-nvidia-device-plugin +---- ++ +[source, bash] +---- +NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE +nvdp-nvidia-device-plugin 2 2 2 2 2 11m +---- ++ +. Verify that your nodes have allocatable GPUs with the following command. ++ +[source,bash,subs="verbatim,attributes"] +---- +kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu" +---- ++ +[source,bash] +---- +NAME GPU +ip-192-168-11-225.us-west-2.compute.internal 1 +ip-192-168-24-96.us-west-2.compute.internal 1 +---- ++ +. Create a file named `nvidia-smi.yaml` with the following contents. Replace [.replaceable]`12.9.1-base-amzn2023` with your desired tag for https://hub.docker.com/r/nvidia/cuda/tags[nvidia/cuda]. This manifest launches an https://developer.nvidia.com/cuda-zone[NVIDIA CUDA] container that runs `nvidia-smi` on a node. ++ +[source,yaml,subs="verbatim,attributes,quotes"] +---- +apiVersion: v1 +kind: Pod +metadata: + name: nvidia-smi +spec: + restartPolicy: OnFailure + containers: + - name: nvidia-smi + image: nvidia/cuda:12.9.1-base-amzn2023 + args: + - "nvidia-smi" + resources: + limits: + nvidia.com/gpu: 1 +---- ++ +. Apply the manifest with the following command. ++ +[source,bash,subs="verbatim,attributes"] +---- +kubectl apply -f nvidia-smi.yaml +---- +. After the Pod has finished running, view its logs with the following command. ++ +[source,bash,subs="verbatim,attributes"] +---- +kubectl logs nvidia-smi +---- ++ +An example output is as follows. ++ +[source,bash,subs="verbatim,attributes"] +---- ++-----------------------------------------------------------------------------------------+ +| NVIDIA-SMI XXX.XXX.XX Driver Version: XXX.XXX.XX CUDA Version: XX.X | +|-----------------------------------------+------------------------+----------------------+ +| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | +| | | MIG M. | +|=========================================+========================+======================| +| 0 NVIDIA L4 On | 00000000:31:00.0 Off | 0 | +| N/A 27C P8 11W / 72W | 0MiB / 23034MiB | 0% Default | +| | | N/A | ++-----------------------------------------+------------------------+----------------------+ + ++-----------------------------------------------------------------------------------------+ +| Processes: | +| GPU GI CI PID Type Process name GPU Memory | +| ID ID Usage | +|=========================================================================================| +| No running processes found | ++-----------------------------------------------------------------------------------------+ +---- + +[#eks-neuron-device-plugin] +== Install Neuron Kubernetes device plugin + +The following procedure describes how to install the Neuron Kubernetes device plugin and run a sample test on an Inferentia instance. + +=== Prerequisites + +* EKS cluster created +* Neuron GPU nodes running in the cluster using EKS-optimized AL2023 NVIDIA AMI +* Helm installed in your command-line environment, see <>. + +=== Procedure + +. Install the Neuron Kubernetes device plugin on your cluster. ++ +[source,bash] +---- +helm upgrade --install neuron-helm-chart oci://public.ecr.aws/neuron/neuron-helm-chart \ + --set "npd.enabled=false" +---- ++ +. Verify the Neuron Kubernetes device plugin is running in your cluster. The output below shows the output with a single Neuron node in the cluster. ++ +[source,bash] +---- +kubectl get ds -n kube-system neuron-device-plugin +---- ++ +[source, bash] +---- +NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE +neuron-device-plugin 1 1 1 1 1 72s +---- ++ +. Verify that your nodes have allocatable NueronCores with the following command. ++ +[source,bash,subs="verbatim,attributes"] +---- +kubectl get nodes "-o=custom-columns=NAME:.metadata.name,NeuronCore:.status.allocatable.aws\.amazon\.com/neuroncore" +---- ++ +[source,bash] +---- +NAME NeuronCore +ip-192-168-47-173.us-west-2.compute.internal 2 +---- ++ +. Verify that your nodes have allocatable NueronDevices with the following command. ++ +[source,bash,subs="verbatim,attributes"] +---- +kubectl get nodes "-o=custom-columns=NAME:.metadata.name,NeuronDevice:.status.allocatable.aws\.amazon\.com/neuron" +---- ++ +[source,bash] +---- +NAME NeuronDevice +ip-192-168-47-173.us-west-2.compute.internal 1 +---- ++ +. Create a file named `neuron-ls.yaml` with the following contents. This manifest launches an https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html[Neuron Monitor] container that has the `neuron-ls` tool installed. ++ +[source,yaml] +---- +apiVersion: v1 +kind: Pod +metadata: + name: neuron-ls +spec: + restartPolicy: Never + containers: + - name: neuron-container + image: public.ecr.aws/g4h4h0b5/neuron-monitor:1.0.0 + command: ["/bin/sh"] + args: ["-c", "neuron-ls"] + resources: + limits: + aws.amazon.com/neuron: 1 + tolerations: + - key: "aws.amazon.com/neuron" + operator: "Exists" + effect: "NoSchedule" +---- ++ +. Apply the manifest with the following command. ++ +[source,bash,subs="verbatim,attributes"] +---- +kubectl apply -f neuron-ls.yaml +---- +. After the Pod has finished running, view its logs with the following command. ++ +[source,bash,subs="verbatim,attributes"] +---- +kubectl logs neuron-ls +---- ++ +An example output is below. ++ +[source,bash,subs="verbatim,attributes"] +---- +instance-type: inf2.xlarge +instance-id: ... ++--------+--------+--------+---------+ +| NEURON | NEURON | NEURON | PCI | +| DEVICE | CORES | MEMORY | BDF | ++--------+--------+--------+---------+ +| 0 | 2 | 32 GB | 00:1f.0 | ++--------+--------+--------+---------+ +---- \ No newline at end of file diff --git a/latest/ug/ml/ml-eks-optimized-ami.adoc b/latest/ug/ml/ml-eks-optimized-ami.adoc index 3c94a8739..8df6b65d6 100644 --- a/latest/ug/ml/ml-eks-optimized-ami.adoc +++ b/latest/ug/ml/ml-eks-optimized-ami.adoc @@ -2,83 +2,131 @@ include::../attributes.txt[] [.topic] [#ml-eks-optimized-ami] -= Run GPU-accelerated containers (Linux on EC2) -:info_titleabbrev: Set up Linux GPU AMIs += Use EKS-optimized accelerated AMIs for GPU instances +:info_titleabbrev: Use EKS Linux GPU AMIs -The Amazon EKS optimized accelerated Amazon Linux AMIs are built on top of the standard Amazon EKS optimized Amazon Linux AMIs. For details on these AMIs, see <>. -The following text describes how to enable {aws} Neuron-based workloads. +Amazon EKS supports EKS-optimized Amazon Linux and Bottlerocket AMIs for GPU instances. The EKS-optimized accelerated AMIs simplify running AI and ML workloads in EKS clusters by providing pre-built, validated operating system images for the accelerated Kubernetes stack. In addition to the core Kubernetes components that are included in the standard EKS-optimized AMIs, the EKS-optimized accelerated AMIs include the kernel modules and drivers required to run the NVIDIA GPU `G` and `P` EC2 instances, and the AWS GPU link:machine-learning/inferentia/[Inferentia,type="marketing"] and link:machine-learning/trainium/[Trainium,type="marketing"] EC2 instances in EKS clusters. -.To enable {aws} Neuron (ML accelerator) based workloads -For details on training and inference workloads using Neuron in Amazon EKS, see the following references: +The table below shows the supported GPU instance types for each EKS-optimized accelerated AMI variant. See the EKS-optimized https://github.com/awslabs/amazon-eks-ami/releases[AL2023 releases] and https://github.com/bottlerocket-os/bottlerocket/blob/develop/CHANGELOG.md[Bottlerocket releases] on GitHub for the latest updates to the AMI variants. -* https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/kubernetes-getting-started.html[Containers - Kubernetes - Getting Started] in the _{aws} Neuron Documentation_ -* https://github.com/aws-neuron/aws-neuron-eks-samples/blob/master/README.md#training[Training] in {aws} Neuron EKS Samples on GitHub -* <> +[%header,cols="2,4"] +|=== +|EKS AMI variant | EC2 instance types -The following procedure describes how to run a workload on a GPU based instance with the Amazon EKS optimized accelerated AMIs. +|AL2023 x86_64 NVIDIA +|p6-b200, p5, p5e, p5en, p4d, p4de, p3, p3dn, gr6, g6, g6e, g5, g4dn -. After your GPU nodes join your cluster, you must apply the https://github.com/NVIDIA/k8s-device-plugin[NVIDIA device plugin for Kubernetes] as a DaemonSet on your cluster. Replace [.replaceable]`vX.X.X` with your desired https://github.com/NVIDIA/k8s-device-plugin/releases[NVIDIA/k8s-device-plugin] version before running the following command. -+ -[source,bash,subs="verbatim,attributes"] ----- -kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/vX.X.X/deployments/static/nvidia-device-plugin.yml ----- -. You can verify that your nodes have allocatable GPUs with the following command. -+ -[source,bash,subs="verbatim,attributes"] ----- -kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu" ----- -. Create a file named `nvidia-smi.yaml` with the following contents. Replace [.replaceable]`tag` with your desired tag for https://hub.docker.com/r/nvidia/cuda/tags[nvidia/cuda]. This manifest launches an https://developer.nvidia.com/cuda-zone[NVIDIA CUDA] container that runs `nvidia-smi` on a node. -+ -[source,yaml,subs="verbatim,attributes"] ----- -apiVersion: v1 -kind: Pod -metadata: - name: nvidia-smi -spec: - restartPolicy: OnFailure - containers: - - name: nvidia-smi - image: nvidia/cuda:tag - args: - - "nvidia-smi" - resources: - limits: - nvidia.com/gpu: 1 ----- -. Apply the manifest with the following command. -+ -[source,bash,subs="verbatim,attributes"] ----- -kubectl apply -f nvidia-smi.yaml ----- -. After the Pod has finished running, view its logs with the following command. -+ -[source,bash,subs="verbatim,attributes"] ----- -kubectl logs nvidia-smi +|AL2023 ARM NVIDIA +|p6e-gb200, g5g + +|AL2023 x86_64 Neuron +|inf1, inf2, trn1, trn2 + +|Bottlerocket x86_64 aws-k8s-nvidia +|p6-b200, p5, p5e, p5en, p4d, p4de, p3, p3dn, gr6, g6, g6e, g5, g4dn + +|Bottlerocket aarch64/arm64 aws-k8s-nvidia +|g5g + +|Bottlerocket x86_64 aws-k8s +|inf1, inf2, trn1, trn2 +|=== + +[#eks-amis-nvidia] +== EKS-optimized NVIDIA AMIs + +By using the EKS-optimized NVIDIA AMIs, you agree to https://s3.amazonaws.com/EULA/NVidiaEULAforAWS.pdf[NVIDIA's Cloud End User License Agreement (EULA)]. + +To find the latest EKS-optimized NVIDIA AMIs, see <> and <>. + +When using Amazon Elastic Fabric Adaptor (EFA) with the EKS-optimized AL2023 or Bottlerocket NVIDIA AMIs, you must install the EFA device plugin separately. For more information, see <>. + +[#eks-amis-nvidia-al2023] +== EKS AL2023 NVIDIA AMIs + +When using the https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/overview.html[NVIDIA GPU operator] with the EKS-optimized AL2023 NVIDIA AMIs, you must disable the operator installation of the driver and toolkit, as these are already included in the EKS AMIs. The EKS-optimized AL2023 NVIDIA AMIs do not include the NVIDIA Kubernetes device plugin or the NVIDIA DRA driver, and these must be installed separately. For more information, see <>. + +In addition to the standard EKS AMI components, the EKS-optimized AL2023 NVIDIA AMIs include the following components. + +* NVIDIA driver +* NVIDIA CUDA runtime libraries +* NVIDIA container toolkit +* NVIDIA fabric manager +* NVIDIA persistenced +* NVIDIA IMEX driver +* NVIDIA NVLink Subnet Manager +* EFA minimal (kernel module and rdma-core) + +See the EKS AL2023 NVIDIA AMI https://github.com/awslabs/amazon-eks-ami/blob/main/templates/al2023/provisioners/install-nvidia-driver.sh[installation script] and https://github.com/awslabs/amazon-eks-ami/blob/main/templates/al2023/runtime/gpu/nvidia-kmod-load.sh[kernel loading script] for details on how the EKS AMIs configure the NVIDIA dependencies. See the EKS-optimized https://github.com/awslabs/amazon-eks-ami/releases[AL2023 releases] on GitHub to see the component versions included in the AMIs. You can find the list of installed packages and their versions on a running EC2 instance with the `dnf list installed` command. + +To track the status of the EKS-optimized NVIDIA AMIs upgrade to NVIDIA driver 580 version, see https://github.com/awslabs/amazon-eks-ami/issues/2470[GitHub issue #2470]. The NVIDIA 580 driver is required to use CUDA 13+. + +When building custom AMIs with the EKS-optimized AMIs as the base, it is not recommended or supported to run an operating system upgrade (ie. `dnf upgrade`) or upgrade any of the Kubernetes or GPU packages that are included in the EKS-optimized AMIs, as this risks breaking component compatibility. If you do upgrade the operating system or packages that are included in the EKS-optimized AMIs, it is recommended to thoroughly test in a development or staging environment before deploying to production. + +When building custom AMIs for GPU instances, it is recommended to build separate custom AMIs for each instance type generation and family that you will run. The EKS-optimized accelerated AMIs selectively install drivers and packages at runtime based on the underlying instance type generation and family. For more information, see the EKS AMI scripts for https://github.com/awslabs/amazon-eks-ami/blob/main/templates/al2023/provisioners/install-nvidia-driver.sh[installation] and https://github.com/awslabs/amazon-eks-ami/blob/main/templates/al2023/runtime/gpu/nvidia-kmod-load.sh[runtime]. + +[#eks-amis-nvidia-bottlerocket] +== EKS Bottlerocket NVIDIA AMIs + +When using the https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/overview.html[NVIDIA GPU operator] with the EKS-optimized Bottlerocket NVIDIA AMIs, you must disable the operator installation of the driver, toolkit, and device plugin as these are already included in the EKS AMIs. + +In addition to the standard EKS AMI components, the EKS-optimized Bottlerocket NVIDIA AMIs include the following components. + +* NVIDIA driver +* NVIDIA CUDA runtime libraries +* NVIDIA container toolkit +* NVIDIA fabric manager +* NVIDIA IMEX driver +* NVIDIA NVLink Subnet Manager +* EFA minimal (kernel module and rdma-core) + +See the Bottlerocket Version Information in the https://bottlerocket.dev/en/[Bottlerocket documentation] for details on the installed packages and their versions. The EKS-optimized Bottlerocket NVIDIA AMIs support kernel 6.12 and NVIDIA driver 580 version for Kubernetes versions 1.33 and above. The NVIDIA 580 driver is required to use CUDA 13+. + +[#eks-amis-neuron] +== EKS-optimized Neuron AMIs + +For details on how to run training and inference workloads using Neuron with Amazon EKS, see the following references: + +* https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/kubernetes-getting-started.html[Containers - Kubernetes - Getting Started] in the {aws} Neuron Documentation +* https://github.com/aws-neuron/aws-neuron-eks-samples/blob/master/README.md#training[Training example] in {aws} Neuron EKS Samples on GitHub +* <> + +To find the latest EKS-optimized Neuron AMIs, see <> and <>. + +When using Amazon Elastic Fabric Adaptor (EFA) with the EKS-optimized AL2023 or Bottlerocket Neuron AMIs, you must install the EFA device plugin separately. For more information, see <>. + +[#eks-amis-neuron-al2023] +== EKS AL2023 Neuron AMIs + +The EKS-optimized AL2023 Neuron AMIs do not include the Neuron Kubernetes device plugin or the https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/k8s-neuron-scheduler.html[Neuron Kubernetes scheduler extension], and these must be installed separately. For more information, see <>. + +In addition to the standard EKS AMI components, the EKS-optimized AL2023 Neuron AMIs include the following components. + +* Neuron driver (aws-neuronx-dkms) +* Neuron tools (aws-neuronx-tools) +* EFA minimal (kernel module and rdma-core) + +See the EKS AL2023 Neuron AMI https://github.com/awslabs/amazon-eks-ami/blob/main/templates/al2023/provisioners/install-neuron-driver.sh[installation script] for details on how the EKS AMIs configure the Neuron dependencies. See the EKS-optimized https://github.com/awslabs/amazon-eks-ami/releases[AL2023 releases] on GitHub to see the component versions included in the AMIs. You can find the list of installed packages and their versions on a running EC2 instance with the `dnf list installed` command. + +[#eks-amis-neuron-bottlerocket] +== EKS Bottlerocket Neuron AMIs + +The standard Bottlerocket variants (aws-k8s) include the Neuron dependencies that are automatically detected and loaded when running on AWS Inferentia or Trainium EC2 instances. + +The EKS-optimized Bottlerocket AMIs do not include the Neuron Kubernetes device plugin or the https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/k8s-neuron-scheduler.html[Neuron Kubernetes scheduler extension], and these must be installed separately. For more information, see <>. + +In addition to the standard EKS AMI components, the EKS-optimized Bottlerocket Neuron AMIs include the following components. + +* Neuron driver (aws-neuronx-dkms) +* EFA minimal (kernel module and rdma-core) + +When using the EKS-optimized Bottlerocket AMIs with Neuron instances, the following must be configured in the Bottlerocket user-data. This setting allows the container to take ownership of the mounted Neuron device based on the `runAsUser` and `runAsGroup` values provided in the workload specification. For more information on Neuron support in Bottlerocket, see the https://github.com/bottlerocket-os/bottlerocket/blob/develop/QUICKSTART-EKS.md#neuron-support[Quickstart on EKS readme] on GitHub. + +[source,toml] ---- -+ -An example output is as follows. -+ -[source,bash,subs="verbatim,attributes"] +[settings] +[settings.kubernetes] +device-ownership-from-security-context = true ---- -Mon Aug 6 20:23:31 20XX -+-----------------------------------------------------------------------------+ -| NVIDIA-SMI XXX.XX Driver Version: XXX.XX | -|-------------------------------+----------------------+----------------------+ -| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | -| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | -|===============================+======================+======================| -| 0 Tesla V100-SXM2... On | 00000000:00:1C.0 Off | 0 | -| N/A 46C P0 47W / 300W | 0MiB / 16160MiB | 0% Default | -+-------------------------------+----------------------+----------------------+ -+-----------------------------------------------------------------------------+ -| Processes: GPU Memory | -| GPU PID Type Process name Usage | -|=============================================================================| -| No running processes found | -+-----------------------------------------------------------------------------+ ----- \ No newline at end of file + +See the https://github.com/bottlerocket-os/bottlerocket-kernel-kit/blob/develop/CHANGELOG.md[Bottlerocket kernel kit changelog] for information on the Neuron driver version included in the EKS-optimized Bottlerocket AMIs. \ No newline at end of file diff --git a/latest/ug/nodes/eks-ami-build-scripts.adoc b/latest/ug/nodes/eks-ami-build-scripts.adoc index 09e8d237e..3f0e0fa8f 100644 --- a/latest/ug/nodes/eks-ami-build-scripts.adoc +++ b/latest/ug/nodes/eks-ami-build-scripts.adoc @@ -12,12 +12,16 @@ Amazon Elastic Kubernetes Service (Amazon EKS) has open-source scripts that are [IMPORTANT] ==== -Amazon EKS will no longer publish EKS-optimized Amazon Linux 2 (AL2) AMIs after November 26th, 2025. Additionally, Kubernetes version `1.32` is the last version for which Amazon EKS will release AL2 AMIs. From version `1.33` onwards, Amazon EKS will continue to release AL2023 and Bottlerocket based AMIs. +Amazon EKS will no longer publish EKS-optimized Amazon Linux 2 (AL2) AMIs after November 26th, 2025. Additionally, Kubernetes version `1.32` is the last version for which Amazon EKS will release AL2 AMIs. From version `1.33` onwards, Amazon EKS will continue to release AL2023 and Bottlerocket based AMIs. For more information, see <>. ==== -The Amazon EKS optimized Amazon Linux (AL) AMIs are built on top of AL2 and AL2023, specifically for use as nodes in Amazon EKS clusters. Amazon EKS provides open-source build scripts in the https://github.com/awslabs/amazon-eks-ami[Amazon EKS AMI Build Specification] repository that you can use to view the configurations for `kubelet`, the runtime, the {aws} IAM Authenticator for Kubernetes, and build your own AL-based AMI from scratch. +The Amazon EKS-optimized Amazon Linux (AL) AMIs are built on top of AL2 and AL2023, specifically for use as nodes in Amazon EKS clusters. Amazon EKS provides open-source build scripts in the https://github.com/awslabs/amazon-eks-ami[Amazon EKS AMI Build Specification] repository that you can use to view the configurations for `kubelet`, the runtime, the {aws} IAM Authenticator for Kubernetes, and build your own AL-based AMI from scratch. -This repository contains the specialized https://github.com/awslabs/amazon-eks-ami/blob/main/templates/al2/runtime/bootstrap.sh[bootstrap script] and https://awslabs.github.io/amazon-eks-ami/nodeadm/[nodeadm script] that runs at boot time. These scripts configure your instance's certificate data, control plane endpoint, cluster name, and more. The scripts are considered the source of truth for Amazon EKS optimized AMI builds, so you can follow the GitHub repository to monitor changes to our AMIs. +This repository contains the specialized https://github.com/awslabs/amazon-eks-ami/blob/main/templates/al2/runtime/bootstrap.sh[bootstrap script for AL2] and https://awslabs.github.io/amazon-eks-ami/nodeadm/[nodeadm tool for AL2023] that runs at boot time. These scripts configure your instance's certificate data, control plane endpoint, cluster name, and more. The scripts are considered the source of truth for Amazon EKS-optimized AMI builds, so you can follow the GitHub repository to monitor changes to our AMIs. + +When building custom AMIs with the EKS-optimized AMIs as the base, it is not recommended or supported to run an operating system upgrade (ie. `dnf upgrade`) or upgrade any of the Kubernetes or GPU packages that are included in the EKS-optimized AMIs, as this risks breaking component compatibility. If you do upgrade the operating system or packages that are included in the EKS-optimized AMIs, it is recommended to thoroughly test in a development or staging environment before deploying to production. + +When building custom AMIs for GPU instances, it is recommended to build separate custom AMIs for each instance type generation and family that you will run. The EKS-optimized accelerated AMIs selectively install drivers and packages at runtime based on the underlying instance type generation and family. For more information, see the EKS AMI scripts for https://github.com/awslabs/amazon-eks-ami/blob/main/templates/al2023/provisioners/install-nvidia-driver.sh[installation] and https://github.com/awslabs/amazon-eks-ami/blob/main/templates/al2023/runtime/gpu/nvidia-kmod-load.sh[runtime]. == Prerequisites diff --git a/latest/ug/nodes/eks-ami-deprecation-faqs.adoc b/latest/ug/nodes/eks-ami-deprecation-faqs.adoc index 147cf3042..972b2572e 100644 --- a/latest/ug/nodes/eks-ami-deprecation-faqs.adoc +++ b/latest/ug/nodes/eks-ami-deprecation-faqs.adoc @@ -72,83 +72,14 @@ We do not recommend the continued use of `cgroupv1`. Instead, we recommend migra Kubernetes version 1.32 is the last version for which Amazon EKS will release AL2 (Amazon Linux 2) AMIs. For https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html[supported] Kubernetes versions up to 1.32, EKS will continue to release AL2 AMIs (AL2_ARM_64, AL2_x86_64) and AL2-accelerated AMIs (AL2_x86_64_GPU) until November 26, 2025. After this date, EKS will stop releasing AL2-optimized and AL2-accelerated AMIs for all Kubernetes versions. Note that the EOS date for EKS AL2-optimized and AL2-accelerated AMIs is independent of the standard and extended support timelines for Kubernetes versions by EKS. -=== NVIDIA drivers comparison for AL2, AL2023, and Bottlerocket AMIs - -[%header,cols="5"] -|=== -|Driver Branch -|Amazon Linux 2 AMI -|Amazon Linux 2023 AMI -|Bottlerocket AMI -|End-of-Life Date - -|R535 -|Not Supported -|Not Supported -|Not Supported -|https://docs.nvidia.com/ai-enterprise/index.html#release-documentation[September 2027] - -|R550 -|Supported -|Supported -|Not Supported -|https://docs.nvidia.com/ai-enterprise/index.html#release-documentation[April 2025] - -|R560 -|Not Supported -|Supported -|Not Supported -|https://docs.nvidia.com/ai-enterprise/index.html#release-documentation[March 2025] - -|R570 -|Not Supported -|Supported -|Supported -|https://docs.nvidia.com/ai-enterprise/index.html#release-documentation[February 2026] -|=== - -To learn more, see https://docs.nvidia.com/ai-enterprise/index.html#release-documentation[Nvidia Release Documentation]. - -=== NVIDIA CUDA versions comparison for AL2, AL2023, and Bottlerocket AMIs - -[%header,cols="4"] -|=== -|CUDA Version -|AL2 Support -|AL2023 Support -|Bottlerocket Support - -|https://developer.nvidia.com/cuda-toolkit-archive[10.1] -|Supported -|Not supported -|Not Supported - -|https://developer.nvidia.com/cuda-toolkit-archive[11.8] -|Supported -|Supported -|Supported - -|https://developer.nvidia.com/cuda-toolkit-archive[12.0] -|Not supported -|Supported -|Supported - -|https://developer.nvidia.com/cuda-toolkit-archive[12.5] -|Not supported -|Supported -|Supported -|=== - -To learn more, see https://developer.nvidia.com/cuda-toolkit-archive[CUDA Release Documentation]. - === Supported drivers and Linux kernel versions comparison for AL2, AL2023, and Bottlerocket AMIs [%header,cols="4"] |=== |Component -|AL2 AMI Source -|AL2023 AMI Source -|Bottlerocket AMI Source +|EKS AL2 AMI +|EKS AL2023 AMI +|EKS Bottlerocket AMI |Base OS Compatibility |RHEL7/CentOS 7 @@ -156,26 +87,28 @@ To learn more, see https://developer.nvidia.com/cuda-toolkit-archive[CUDA Releas |N/A |CUDA Toolkit -|https://developer.nvidia.com/cuda-toolkit-archive[CUDA 11.x–12.x] -|https://developer.nvidia.com/cuda-toolkit-archive[CUDA 12.5+] -|CUDA 11.x (12.5 coming soon) +|12.x +|12.x +|12.x,13.x |NVIDIA GPU Driver -|https://docs.nvidia.com/ai-enterprise/index.html#infrastructure-software[R550] -|https://docs.nvidia.com/ai-enterprise/index.html#infrastructure-software[R565] -|https://docs.nvidia.com/ai-enterprise/index.html#infrastructure-software[R570] +|R570 +|R570 +|R570, R580 |{aws} Neuron Driver -|https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/announcements/neuron2.x/announce-no-support-al2.html[2.19] -|https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/announcements/neuron2.x/announce-no-support-al2.html[2.19+] -|2.20 +|2.20+ +|2.20+ +|2.20+ |Linux Kernel -|https://docs.aws.amazon.com/linux/al2/ug/aml2-kernel.html[5.10] -|https://docs.aws.amazon.com/linux/al2023/ug/compare-with-al2-kernel.html[6.1], 6.12 +|5.10 +|6.1, 6.12 |6.1, 6.12 |=== +For more information on NVIDIA driver and CUDA compatibility, see the https://docs.nvidia.com/datacenter/tesla/drivers/index.html#supported-drivers-and-cuda-toolkit-versions[NVIDIA documentation]. + === {aws} Neuron compatibility with AL2 AMIs Starting from https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/rn.html#neuron-2-20-0-whatsnew[{aws} Neuron release 2.20], the Neuron Runtime (`aws-neuronx-runtime-lib`) used by EKS AL-based AMIs no longer supports Amazon Linux 2 (AL2). diff --git a/latest/ug/nodes/eks-optimized-ami-bottlerocket.adoc b/latest/ug/nodes/eks-optimized-ami-bottlerocket.adoc index 31098508e..d0bb545bd 100644 --- a/latest/ug/nodes/eks-optimized-ami-bottlerocket.adoc +++ b/latest/ug/nodes/eks-optimized-ami-bottlerocket.adoc @@ -17,21 +17,17 @@ link:bottlerocket/[Bottlerocket,type="marketing"] is an open source Linux distri Using Bottlerocket with your Amazon EKS cluster has the following advantages: - - * *Higher uptime with lower operational cost and lower management complexity* – Bottlerocket has a smaller resource footprint, shorter boot times, and is less vulnerable to security threats than other Linux distributions. Bottlerocket's smaller footprint helps to reduce costs by using less storage, compute, and networking resources. * *Improved security from automatic OS updates* – Updates to Bottlerocket are applied as a single unit which can be rolled back, if necessary. This removes the risk of corrupted or failed updates that can leave the system in an unusable state. With Bottlerocket, security updates can be automatically applied as soon as they're available in a minimally disruptive manner and be rolled back if failures occur. * *Premium support* – {aws} provided builds of Bottlerocket on Amazon EC2 is covered under the same {aws} Support plans that also cover {aws} services such as Amazon EC2, Amazon EKS, and Amazon ECR. - [#bottlerocket-considerations] == Considerations Consider the following when using Bottlerocket for your AMI type: - - -* Bottlerocket supports Amazon EC2 instances with `x86_64` and `arm64` processors. The Bottlerocket AMI isn't recommended for use with Amazon EC2 instances with an Inferentia chip. +* Bottlerocket supports Amazon EC2 instances with `x86_64` and `arm64` processors. +* Bottlerocket supports Amazon EC2 instances with GPUs. For more information, see <>. * Bottlerocket images don't include an SSH server or a shell. You can employ out-of-band access methods to allow SSH. These approaches enable the admin container and to pass some bootstrapping configuration steps with user data. For more information, refer to the following sections in https://github.com/bottlerocket-os/bottlerocket/blob/develop/README.md[Bottlerocket OS] on GitHub: + ** https://github.com/bottlerocket-os/bottlerocket/blob/develop/README.md#exploration[Exploration] @@ -42,7 +38,6 @@ Consider the following when using Bottlerocket for your AMI type: ** By default, a https://github.com/bottlerocket-os/bottlerocket-control-container[control container] is enabled. This container runs the https://github.com/aws/amazon-ssm-agent[{aws} Systems Manager agent] that you can use to run commands or start shell sessions on Amazon EC2 Bottlerocket instances. For more information, see link:systems-manager/latest/userguide/session-manager-getting-started.html[Setting up Session Manager,type="documentation"] in the _{aws} Systems Manager User Guide_. ** If an SSH key is given when creating the node group, an admin container is enabled. We recommend using the admin container only for development and testing scenarios. We don't recommend using it for production environments. For more information, see https://github.com/bottlerocket-os/bottlerocket/blob/develop/README.md#admin-container[Admin container] on GitHub. - [#bottlerocket-more-information] == More information diff --git a/latest/ug/nodes/eks-optimized-ami.adoc b/latest/ug/nodes/eks-optimized-ami.adoc index 746feebe3..ebd70192b 100644 --- a/latest/ug/nodes/eks-optimized-ami.adoc +++ b/latest/ug/nodes/eks-optimized-ami.adoc @@ -28,31 +28,14 @@ The Amazon EKS optimized Amazon Linux AMIs are built on top of Amazon Linux 2 (A ==== [#gpu-ami] -== Amazon EKS optimized accelerated Amazon Linux AMIs +== Amazon EKS-optimized accelerated Amazon Linux AMIs -The Amazon EKS optimized accelerated Amazon Linux AMIs are built on top of the standard Amazon EKS optimized Amazon Linux AMIs. They are configured to serve as optional images for Amazon EKS nodes to support GPU, link:machine-learning/inferentia/[Inferentia,type="marketing"], and link:machine-learning/trainium/[Trainium,type="marketing"] based workloads. +The Amazon EKS-optimized accelerated Amazon Linux AMIs are built on top of the standard Amazon EKS optimized Amazon Linux AMIs. They are configured to serve as optional images for Amazon EKS nodes to support GPU, link:machine-learning/inferentia/[Inferentia,type="marketing"], and link:machine-learning/trainium/[Trainium,type="marketing"] based workloads. -In addition to the standard Amazon EKS optimized AMI configuration, the accelerated AMIs include the following: - -* NVIDIA drivers -* `nvidia-container-toolkit` -* {aws} Neuron driver - -For a list of the latest components included in the accelerated AMIs, see the `amazon-eks-ami` https://github.com/awslabs/amazon-eks-ami/releases[Releases] on GitHub. - -[NOTE] -==== - -* Make sure to specify the applicable instance type in your node {aws} CloudFormation template. By using the Amazon EKS optimized accelerated AMIs, you agree to https://s3.amazonaws.com/EULA/NVidiaEULAforAWS.pdf[NVIDIA's Cloud End User License Agreement (EULA)]. -* The Amazon EKS optimized accelerated AMIs were previously referred to as the _Amazon EKS optimized AMIs with GPU support_. -* Previous versions of the Amazon EKS optimized accelerated AMIs installed the `nvidia-docker` repository. The repository is no longer included in Amazon EKS AMI version `v20200529` and later. - -==== - -For details on running workloads on Amazon EKS optimized accelerated Amazon Linux AMIs, see <>. +For more information, see <>. [#arm-ami] -== Amazon EKS optimized Arm Amazon Linux AMIs +== Amazon EKS-optimized Arm Amazon Linux AMIs Arm instances deliver significant cost savings for scale-out and Arm-based applications such as web servers, containerized microservices, caching fleets, and distributed data stores. When adding Arm nodes to your cluster, review the following considerations. From 742c4c24e8aaace5a911e0c35c0678c983baeeec Mon Sep 17 00:00:00 2001 From: csplinter Date: Fri, 17 Oct 2025 10:28:03 -0500 Subject: [PATCH 2/4] correct typo --- latest/ug/ml/ml-eks-k8s-device-plugin.adoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/latest/ug/ml/ml-eks-k8s-device-plugin.adoc b/latest/ug/ml/ml-eks-k8s-device-plugin.adoc index 562a6ad94..47e047624 100644 --- a/latest/ug/ml/ml-eks-k8s-device-plugin.adoc +++ b/latest/ug/ml/ml-eks-k8s-device-plugin.adoc @@ -158,7 +158,7 @@ The following procedure describes how to install the Neuron Kubernetes device pl === Prerequisites * EKS cluster created -* Neuron GPU nodes running in the cluster using EKS-optimized AL2023 NVIDIA AMI +* Neuron GPU nodes running in the cluster using EKS-optimized AL2023 or Bottlerocket Neuron AMI * Helm installed in your command-line environment, see <>. === Procedure From 8622ce695b36332be7ed968fc4c4bd552219ba56 Mon Sep 17 00:00:00 2001 From: csplinter Date: Fri, 17 Oct 2025 13:43:52 -0500 Subject: [PATCH 3/4] addressing feedback --- latest/ug/ml/ml-eks-k8s-device-plugin.adoc | 23 +++++++++++-------- latest/ug/ml/ml-eks-optimized-ami.adoc | 18 +++++++++------ latest/ug/nodes/eks-ami-deprecation-faqs.adoc | 2 +- latest/ug/nodes/eks-optimized-ami.adoc | 14 +++++------ 4 files changed, 33 insertions(+), 24 deletions(-) diff --git a/latest/ug/ml/ml-eks-k8s-device-plugin.adoc b/latest/ug/ml/ml-eks-k8s-device-plugin.adoc index 47e047624..fa5942fb3 100644 --- a/latest/ug/ml/ml-eks-k8s-device-plugin.adoc +++ b/latest/ug/ml/ml-eks-k8s-device-plugin.adoc @@ -92,7 +92,7 @@ ip-192-168-11-225.us-west-2.compute.internal 1 ip-192-168-24-96.us-west-2.compute.internal 1 ---- + -. Create a file named `nvidia-smi.yaml` with the following contents. Replace [.replaceable]`12.9.1-base-amzn2023` with your desired tag for https://hub.docker.com/r/nvidia/cuda/tags[nvidia/cuda]. This manifest launches an https://developer.nvidia.com/cuda-zone[NVIDIA CUDA] container that runs `nvidia-smi` on a node. +. Create a file named `nvidia-smi.yaml` with the following contents. This manifest launches a https://docs.aws.amazon.com/linux/al2023/ug/minimal-container.html[minimal AL2023 container image] that runs `nvidia-smi` on a node. + [source,yaml,subs="verbatim,attributes,quotes"] ---- @@ -103,13 +103,18 @@ metadata: spec: restartPolicy: OnFailure containers: - - name: nvidia-smi - image: nvidia/cuda:12.9.1-base-amzn2023 - args: - - "nvidia-smi" - resources: - limits: - nvidia.com/gpu: 1 + - name: gpu-demo + image: public.ecr.aws/amazonlinux/amazonlinux:2023-minimal + command: ['/bin/sh', '-c'] + args: ['nvidia-smi && tail -f /dev/null'] + resources: + limits: + nvidia.com/gpu: 1 + tolerations: + - key: 'nvidia.com/gpu' + operator: 'Equal' + value: 'true' + effect: 'NoSchedule' ---- + . Apply the manifest with the following command. @@ -158,7 +163,7 @@ The following procedure describes how to install the Neuron Kubernetes device pl === Prerequisites * EKS cluster created -* Neuron GPU nodes running in the cluster using EKS-optimized AL2023 or Bottlerocket Neuron AMI +* Neuron GPU nodes running in the cluster using EKS-optimized AL2023 Neuron AMI or Bottlerocket AMI * Helm installed in your command-line environment, see <>. === Procedure diff --git a/latest/ug/ml/ml-eks-optimized-ami.adoc b/latest/ug/ml/ml-eks-optimized-ami.adoc index 8df6b65d6..5207641d3 100644 --- a/latest/ug/ml/ml-eks-optimized-ami.adoc +++ b/latest/ug/ml/ml-eks-optimized-ami.adoc @@ -14,7 +14,7 @@ The table below shows the supported GPU instance types for each EKS-optimized ac |EKS AMI variant | EC2 instance types |AL2023 x86_64 NVIDIA -|p6-b200, p5, p5e, p5en, p4d, p4de, p3, p3dn, gr6, g6, g6e, g5, g4dn +|p6-b200, p5, p5e, p5en, p4d, p4de, p3, p3dn, gr6, g6, g6e, g6f, gr6f, g5, g4dn |AL2023 ARM NVIDIA |p6e-gb200, g5g @@ -23,7 +23,7 @@ The table below shows the supported GPU instance types for each EKS-optimized ac |inf1, inf2, trn1, trn2 |Bottlerocket x86_64 aws-k8s-nvidia -|p6-b200, p5, p5e, p5en, p4d, p4de, p3, p3dn, gr6, g6, g6e, g5, g4dn +|p6-b200, p5, p5e, p5en, p4d, p4de, p3, p3dn, gr6, g6, g6e, g6f, gr6f, g5, g4dn |Bottlerocket aarch64/arm64 aws-k8s-nvidia |g5g @@ -49,7 +49,7 @@ When using the https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/late In addition to the standard EKS AMI components, the EKS-optimized AL2023 NVIDIA AMIs include the following components. * NVIDIA driver -* NVIDIA CUDA runtime libraries +* NVIDIA CUDA user mode driver * NVIDIA container toolkit * NVIDIA fabric manager * NVIDIA persistenced @@ -57,10 +57,12 @@ In addition to the standard EKS AMI components, the EKS-optimized AL2023 NVIDIA * NVIDIA NVLink Subnet Manager * EFA minimal (kernel module and rdma-core) -See the EKS AL2023 NVIDIA AMI https://github.com/awslabs/amazon-eks-ami/blob/main/templates/al2023/provisioners/install-nvidia-driver.sh[installation script] and https://github.com/awslabs/amazon-eks-ami/blob/main/templates/al2023/runtime/gpu/nvidia-kmod-load.sh[kernel loading script] for details on how the EKS AMIs configure the NVIDIA dependencies. See the EKS-optimized https://github.com/awslabs/amazon-eks-ami/releases[AL2023 releases] on GitHub to see the component versions included in the AMIs. You can find the list of installed packages and their versions on a running EC2 instance with the `dnf list installed` command. +For details on the NVIDIA CUDA user mode driver and the CUDA runtime/libraries used within application containers, see the https://docs.nvidia.com/deploy/cuda-compatibility/why-cuda-compatibility.html#why-cuda-compatibility[NVIDIA documentation]. The CUDA version shown from `nvidia-smi` is the version of the NVIDIA CUDA user mode driver installed on the host, which must be compatible with the CUDA runtime/libraries used in application containers. To track the status of the EKS-optimized NVIDIA AMIs upgrade to NVIDIA driver 580 version, see https://github.com/awslabs/amazon-eks-ami/issues/2470[GitHub issue #2470]. The NVIDIA 580 driver is required to use CUDA 13+. +See the EKS AL2023 NVIDIA AMI https://github.com/awslabs/amazon-eks-ami/blob/main/templates/al2023/provisioners/install-nvidia-driver.sh[installation script] and https://github.com/awslabs/amazon-eks-ami/blob/main/templates/al2023/runtime/gpu/nvidia-kmod-load.sh[kernel loading script] for details on how the EKS AMIs configure the NVIDIA dependencies. See the EKS-optimized https://github.com/awslabs/amazon-eks-ami/releases[AL2023 releases] on GitHub to see the component versions included in the AMIs. You can find the list of installed packages and their versions on a running EC2 instance with the `dnf list installed` command. + When building custom AMIs with the EKS-optimized AMIs as the base, it is not recommended or supported to run an operating system upgrade (ie. `dnf upgrade`) or upgrade any of the Kubernetes or GPU packages that are included in the EKS-optimized AMIs, as this risks breaking component compatibility. If you do upgrade the operating system or packages that are included in the EKS-optimized AMIs, it is recommended to thoroughly test in a development or staging environment before deploying to production. When building custom AMIs for GPU instances, it is recommended to build separate custom AMIs for each instance type generation and family that you will run. The EKS-optimized accelerated AMIs selectively install drivers and packages at runtime based on the underlying instance type generation and family. For more information, see the EKS AMI scripts for https://github.com/awslabs/amazon-eks-ami/blob/main/templates/al2023/provisioners/install-nvidia-driver.sh[installation] and https://github.com/awslabs/amazon-eks-ami/blob/main/templates/al2023/runtime/gpu/nvidia-kmod-load.sh[runtime]. @@ -70,15 +72,17 @@ When building custom AMIs for GPU instances, it is recommended to build separate When using the https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/overview.html[NVIDIA GPU operator] with the EKS-optimized Bottlerocket NVIDIA AMIs, you must disable the operator installation of the driver, toolkit, and device plugin as these are already included in the EKS AMIs. -In addition to the standard EKS AMI components, the EKS-optimized Bottlerocket NVIDIA AMIs include the following components. +In addition to the standard EKS AMI components, the EKS-optimized Bottlerocket NVIDIA AMIs include the following components. The minimal dependencies for EFA (kernel module and rdma-core) are installed in all Bottlerocket variants. * NVIDIA driver -* NVIDIA CUDA runtime libraries +* NVIDIA CUDA user mode driver * NVIDIA container toolkit * NVIDIA fabric manager +* NVIDIA persistenced * NVIDIA IMEX driver * NVIDIA NVLink Subnet Manager -* EFA minimal (kernel module and rdma-core) + +For details on the NVIDIA CUDA user mode driver and the CUDA runtime/libraries used within application containers, see the https://docs.nvidia.com/deploy/cuda-compatibility/why-cuda-compatibility.html#why-cuda-compatibility[NVIDIA documentation]. The CUDA version shown from `nvidia-smi` is the version of the NVIDIA CUDA user mode driver installed on the host, which must be compatible with the CUDA runtime/libraries used in application containers. See the Bottlerocket Version Information in the https://bottlerocket.dev/en/[Bottlerocket documentation] for details on the installed packages and their versions. The EKS-optimized Bottlerocket NVIDIA AMIs support kernel 6.12 and NVIDIA driver 580 version for Kubernetes versions 1.33 and above. The NVIDIA 580 driver is required to use CUDA 13+. diff --git a/latest/ug/nodes/eks-ami-deprecation-faqs.adoc b/latest/ug/nodes/eks-ami-deprecation-faqs.adoc index 972b2572e..63d224845 100644 --- a/latest/ug/nodes/eks-ami-deprecation-faqs.adoc +++ b/latest/ug/nodes/eks-ami-deprecation-faqs.adoc @@ -86,7 +86,7 @@ Kubernetes version 1.32 is the last version for which Amazon EKS will release AL |Fedora/CentOS 9 |N/A -|CUDA Toolkit +|https://docs.nvidia.com/deploy/cuda-compatibility/why-cuda-compatibility.html#why-cuda-compatibility[CUDA user mode driver] |12.x |12.x |12.x,13.x diff --git a/latest/ug/nodes/eks-optimized-ami.adoc b/latest/ug/nodes/eks-optimized-ami.adoc index ebd70192b..e5e01e221 100644 --- a/latest/ug/nodes/eks-optimized-ami.adoc +++ b/latest/ug/nodes/eks-optimized-ami.adoc @@ -7,10 +7,10 @@ include::../attributes.txt[] [abstract] -- -The Amazon EKS optimized Amazon Linux AMIs are built on top of Amazon Linux 2 (AL2) and Amazon Linux 2023 (AL2023). They are configured to serve as the base images for Amazon EKS nodes. +The Amazon EKS-optimized Amazon Linux AMIs are built on top of Amazon Linux 2 (AL2) and Amazon Linux 2023 (AL2023). They are configured to serve as the base images for Amazon EKS nodes. -- -The Amazon EKS optimized Amazon Linux AMIs are built on top of Amazon Linux 2 (AL2) and Amazon Linux 2023 (AL2023). They are configured to serve as the base images for Amazon EKS nodes. The AMIs are configured to work with Amazon EKS and they include the following components: +The Amazon EKS-optimized Amazon Linux AMIs are built on top of Amazon Linux 2 (AL2) and Amazon Linux 2023 (AL2023). They are configured to serve as the base images for Amazon EKS nodes. The AMIs are configured to work with Amazon EKS and they include the following components: * `kubelet` * {aws} IAM Authenticator @@ -20,7 +20,7 @@ The Amazon EKS optimized Amazon Linux AMIs are built on top of Amazon Linux 2 (A ==== * You can track security or privacy events for Amazon Linux at the https://alas.aws.amazon.com/[Amazon Linux security center] by choosing the tab for your desired version. You can also subscribe to the applicable RSS feed. Security and privacy events include an overview of the issue, what packages are affected, and how to update your instances to correct the issue. -* Before deploying an accelerated or Arm AMI, review the information in <> and <>. +* Before deploying an accelerated or Arm AMI, review the information in <> and <>. * Amazon EC2 `P2` instances aren't supported on Amazon EKS because they require `NVIDIA` driver version 470 or earlier. * Any newly created managed node groups in clusters on version `1.30` or newer will automatically default to using AL2023 as the node operating system. Previously, new node groups would default to AL2. You can continue to use AL2 by choosing it as the AMI type when creating a new node group. * Amazon EKS will no longer publish EKS-optimized Amazon Linux 2 (AL2) AMIs after November 26th, 2025. Additionally, Kubernetes version `1.32` is the last version for which Amazon EKS will release AL2 AMIs. From version `1.33` onwards, Amazon EKS will continue to release AL2023 and Bottlerocket based AMIs. @@ -30,7 +30,7 @@ The Amazon EKS optimized Amazon Linux AMIs are built on top of Amazon Linux 2 (A [#gpu-ami] == Amazon EKS-optimized accelerated Amazon Linux AMIs -The Amazon EKS-optimized accelerated Amazon Linux AMIs are built on top of the standard Amazon EKS optimized Amazon Linux AMIs. They are configured to serve as optional images for Amazon EKS nodes to support GPU, link:machine-learning/inferentia/[Inferentia,type="marketing"], and link:machine-learning/trainium/[Trainium,type="marketing"] based workloads. +The Amazon EKS-optimized accelerated Amazon Linux AMIs are built on top of the standard Amazon EKS-optimized Amazon Linux AMIs. They are configured to serve as optional images for Amazon EKS nodes to support GPU, link:machine-learning/inferentia/[Inferentia,type="marketing"], and link:machine-learning/trainium/[Trainium,type="marketing"] based workloads. For more information, see <>. @@ -47,13 +47,13 @@ Arm instances deliver significant cost savings for scale-out and Arm-based appli [#linux-more-information] == More information -For more information about using Amazon EKS optimized Amazon Linux AMIs, see the following sections: +For more information about using Amazon EKS-optimized Amazon Linux AMIs, see the following sections: * To use Amazon Linux with managed node groups, see <>. * To launch self-managed Amazon Linux nodes, see <>. * For version information, see <>. -* To retrieve the latest IDs of the Amazon EKS optimized Amazon Linux AMIs, see <>. -* For open-source scripts that are used to build the Amazon EKS optimized AMIs, see <>. +* To retrieve the latest IDs of the Amazon EKS-optimized Amazon Linux AMIs, see <>. +* For open-source scripts that are used to build the Amazon EKS-optimized AMIs, see <>. include::al2023.adoc[leveloffset=+1] From 7a41584237fd5d938716e04770362b74d31a82b1 Mon Sep 17 00:00:00 2001 From: csplinter Date: Thu, 23 Oct 2025 09:40:00 -0500 Subject: [PATCH 4/4] add device plugin, mig manager to BR components --- latest/ug/ml/ml-eks-optimized-ami.adoc | 2 ++ 1 file changed, 2 insertions(+) diff --git a/latest/ug/ml/ml-eks-optimized-ami.adoc b/latest/ug/ml/ml-eks-optimized-ami.adoc index 5207641d3..736af8d92 100644 --- a/latest/ug/ml/ml-eks-optimized-ami.adoc +++ b/latest/ug/ml/ml-eks-optimized-ami.adoc @@ -74,6 +74,7 @@ When using the https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/late In addition to the standard EKS AMI components, the EKS-optimized Bottlerocket NVIDIA AMIs include the following components. The minimal dependencies for EFA (kernel module and rdma-core) are installed in all Bottlerocket variants. +* NVIDIA Kubernetes device plugin * NVIDIA driver * NVIDIA CUDA user mode driver * NVIDIA container toolkit @@ -81,6 +82,7 @@ In addition to the standard EKS AMI components, the EKS-optimized Bottlerocket N * NVIDIA persistenced * NVIDIA IMEX driver * NVIDIA NVLink Subnet Manager +* NVIDIA MIG manager For details on the NVIDIA CUDA user mode driver and the CUDA runtime/libraries used within application containers, see the https://docs.nvidia.com/deploy/cuda-compatibility/why-cuda-compatibility.html#why-cuda-compatibility[NVIDIA documentation]. The CUDA version shown from `nvidia-smi` is the version of the NVIDIA CUDA user mode driver installed on the host, which must be compatible with the CUDA runtime/libraries used in application containers.