Skip to content

Releases: aws/aws-ofi-nccl

AWS OFI NCCL v1.12.1

25 Oct 05:10
v1.12.1-aws
2301579
Compare
Choose a tag to compare

All users of v1.12.0-aws are strongly recommended to take this fix when using EFA Installer >= 1.35.0.

Bug fixes:

  • platform-aws vf sorting code produces significant performance regressions or
    crashes when used atop latest EFA driver releases. This sorting code has been
    reverted and mitigates the problem. (adb47dc)

The plugin has been tested with following libfabric providers using tests bundled in the source code and nccl-tests suite:

  • efa

Digests:

3722e0790b98e65d04f143fe8484fd0a05dcec6419eeac1cdcad5e49f6c7cf8e  aws-ofi-nccl-1.12.1-aws.tar.gz

AWS OFI NCCL v1.11.1

25 Oct 05:10
v1.11.1-aws
2db8375
Compare
Choose a tag to compare

All users of v1.11.0-aws are strongly recommended to take this fix when using EFA Installer >= 1.35.0.

Bug fixes:

  • platform-aws vf sorting code produces significant performance regressions or
    crashes when used atop latest EFA driver releases. This sorting code has been
    reverted and mitigates the problem. (84b7cfa)

The plugin has been tested with following libfabric providers using tests bundled in the source code and nccl-tests suite:

  • efa

Digests:

6d95eff619208e30d11044068c3781c1c079b180a683d422ce9f6a96ebeadb80  aws-ofi-nccl-1.11.1-aws.tar.gz

AWS OFI NCCL v1.12.0

08 Oct 01:41
v1.12.0-aws
Compare
Choose a tag to compare

This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This release requires Libfabric v1.18.0 or later and supports NCCL 2.23.4-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.17.1 and later).

New Features:

  • Support for tuner v3 APIs
  • Support for AllGather and ReduceScatter in the tuner
  • Support for PAT algorithm in the tuner

Bug fixes:

  • Fixed NULL pointer access in the endpoint per communicator path
  • Replaced the NVLSTree option in the tuner with RING if nRanks==nNodes

The plugin has been tested with following libfabric providers using tests bundled in the source code and nccl-tests suite:

  • efa

Checksum (sha512) for the release tarball:

7d9e41ce04253a32a13542e7f4c2d20c2a5a43cdfb575fe153954c5faed8cf85eb08dab76ee0f883109f7610bb43cb8b703fe2f1e98b8f02bbfa866dd1c268e1  aws-ofi-nccl-1.12.0-aws.tar.gz

AWS OFI NCCL v1.11.0

19 Aug 20:28
v1.11.0-aws
Compare
Choose a tag to compare

This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This release requires Libfabric v1.18.0 or later and supports NCCL 2.22.3-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.17.1 and later).

New Features:

  • Autogenerate topology file on P5 by default, with detected topology, instead of using a static file
  • Support for AWS P5e instance type

Bug fixes:

  • Fixed segfault for platform-aws builds for instance types not explicitly configured
  • Fixed failure in mr cache in SENDRECV protocol for providers that don't require memory registration
  • Re-enabled WRITE_IN_ORDER_ALIGNED_128_BYTES setting and check on P5.
  • Added check to cause an error when using old blocking connect_v4/accept_v4 interfaces with RDMA protocol. The previous release changed connection establishment such that these interfaces cause deadlock.

The plugin has been tested with following libfabric providers using tests bundled in the source code and nccl-tests suite:

  • efa

Checksum (sha512) for the release tarball:

17063f1e10a885fe6cd48e275c9a0d5748b73d04d6514103a5e9a0f28dff604c1766f8a85a55e89ad5691830c54199936d88442d28c65180c2f79be939f0b208  aws-ofi-nccl-1.11.0-aws.tar.gz

AWS OFI NCCL v1.10.0

06 Aug 21:38
v1.10.0-aws
Compare
Choose a tag to compare

This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This release requires Libfabric v1.18.0 or later and supports NCCL 2.21.5-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.17.1 and later).

New Features:

  • Replaced the model-based tuner with one based on regions derived from experimental evaluations.
  • Changed properties reported to NCCL to signal that registered MRs are global, in order to support user buffer registrations.
  • Added the option to use different endpoints for receive communicators connected to the same source endpoint, while using a shared completion queue.
  • Updated plugin to use the zero-copy path in the EFA provider for fi_send/fi_recv operations.
  • Shrank the control message to 32 bytes to fit in inline data for EFA.

Bug Fixes:

  • Disabled Libfabric shared memory when possible.
  • Disabled RDMA eager messages on Neuron by default for better performance.
  • Ensured plugin's multi-rail protocol consistently sorts rails in order of VF index for better performance.

The plugin has been tested with following libfabric providers using tests bundled in the source code and nccl-tests suite:

  • efa

Checksum (sha512) for the release tarball:

fa296339a7e40fa420e2934c3a44f9a18ad3a9d798b7f129b35f46892f76532b70996fe36f309e3dedd2823ed9a819a4578f7c8241d8549805c49811b38ae14f  aws-ofi-nccl-1.10.0-aws.tar.gz

AWS OFI NCCL v1.9.2

17 Jun 23:33
v1.9.2-aws
Compare
Choose a tag to compare

This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This is a bugfix release which requires Libfabric v1.18.0 or later and supports NCCL 2.21.5-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.4.8 and later).

Bug Fixes:

  • Improved tuner model to make better decisions on P5 instances.
  • Added support, in RDMA protocol, for truncation when receiving a size in the isend call greater than the size in the correspond irecv.
  • Fixed bug that prevented the tuner from getting loaded with NCCL 2.19 and 2.20.
  • Fixed logging statement regarding if a domain is created per thread or per process.
  • Updated plugin to not advertise global MR support, to avoid a performance regression in user-registered buffers.

The plugin has been tested with following libfabric providers using tests bundled in the source code and nccl-tests suite:

  • efa

Checksum (sha512) for the release tarball:

1e344f38baa1080c04d2c99a1390f51e2a9ce2a57d69c7494061bf4e5da5a4310328bafc323cb36f43b5fcd0d330bd1bd5eec257596de2125aa5c38096b78a01  aws-ofi-nccl-1.9.2-aws.tar.gz

AWS OFI NCCL v1.9.1

15 Apr 21:45
v1.9.1-aws
Compare
Choose a tag to compare

This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This is a bugfix release which requires Libfabric v1.18.0 or later and supports NCCL 2.21.5-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.4.8 and later).

Bug Fixes:

  • Fix release distribution generation to include missing headers introduced in v1.9.0. This fixes issue #382.
  • Restrict libcuda link-time dependency to builds with testing enabled
  • Build fixes to explicitly link against libm and libpthread used by the plugin

The plugin has been tested with following libfabric providers using tests bundled in the source code and nccl-tests suite:

  • efa

Checksum (sha512) for the release tarball:

77e44dcdb77e6b25cae882d2124b6d9a2a66f2b85321ae827ec7e3fd88bacd214a537a2490a578af44b7457cc655b2e382fc148b6ed8594a68a30d145f3ce70e  aws-ofi-nccl-1.9.1-aws.tar.gz

AWS OFI NCCL v1.9.0

05 Apr 22:07
Compare
Choose a tag to compare

This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This release requires Libfabric v1.18.0 or later and supports NCCL 2.21.5-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.4.8 and later).

New Features:

  • Support v8 plugin interface introduced with NCCL 2.20. This enables the use of the user memory registration feature recently introduced in NCCL.
  • Update the tuner component to support v2 ext-tuner interface introduced with NCCL 2.21.
  • Reduce ordering constraints for control messages, to reduce head of line blocking under congestion.

Bug Fixes:

  • Increase the number of communicators to 256K (from 4K), supporting larger all-to-all groups.
  • Improve logging in some corner case error conditions.

The plugin has been tested with following libfabric providers using tests bundled in the source code and nccl-tests suite:

  • efa

Checksum (sha512) for the release tarball:

7c86650f2f275b97bd08ff66b24ae8fef593269c068ec543259903d0eec80a0fe4153a3f171700e7e3dcb3b809a1d6aba82d5e7dc52ec138eacd7353629d1bc0  aws-ofi-nccl-1.9.0-aws.tar.gz

AWS OFI NCCL v1.8.1

25 Feb 21:40
v1.8.1-aws
Compare
Choose a tag to compare

This is a bugfix release that requires Libfabric v1.18.0 or later and supports NCCL v2.19.4-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.4.8 and later).

Bug Fixes:

  • Fix an issue with the ID pool's reference counting and allocation
  • Improved error propagation for failed NCCL requests, allowing applications to fail early instead of blocking on requests that can never be completed.

The plugin has been tested with following libfabric providers using tests bundled in the source code and nccl-tests suite:

  • efa

Checksum (sha512) for the release tarball:

4ee21380176d5a76e4af0233ac44d1d46f92fd34941ecfaa104b7567a16cc84503c0abe59e540d36d79675bb3cc443979ed319f39582e301814d0653ea184508  aws-ofi-nccl-1.8.1-aws.tar.gz

AWS OFI NCCL v1.8.0

19 Feb 17:48
v1.8.0-aws
Compare
Choose a tag to compare

This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This release requires Libfabric v1.18.0 or later and supports NCCL v2.19.4-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.4.8 and later).

New Features:

  • A tuner component for the plugin that picks the optimal NCCL algorithm and protocol at a given scale and message size.
  • Improved communicator and memory region identifier management.
  • Migrated from CUDA Runtime API to functional equivalents in CUDA Driver API in preparation for dma-buf support for memory registration. With this change, the plugin uses the same mechanism as NCCL to interact with the CUDA subsystem.
  • No longer forcing a flush operation for network operations when running with H100 GPUs, even when running with older NCCL versions (< v2.19.1).
  • Improvements to internal device-agnostic APIs.
  • Support for NCCL v7 ext-net plugin interface introduced in NCCL v2.19.3.
  • Support for Ubuntu 22.04 LTS distribution.

Bug Fixes:

  • Set the maximum NVLS tree chunk size used to 512KiB to recover from a performance regression introduced in NCCL v2.19.4, using a parameter introduced in NCCL v2.20.3.
  • Prevent possible invocation of CUDA calls in libfabric by requiring a libfabric version of v1.18.0 or newer.
  • Fix debug prints that reported incorrect device IDs during initialization
  • Fixes to MAX_COMM computation.
  • Better handling of NVLS enablement when NCCL is statically linked to applications
  • Fixes to internal API return codes
  • Configuration system fixes for Neuron builds
  • Fixes to plugin environment parsing to be case insensitive
  • Miscellaneous fixes that address memory leaks, NULL derefences, and compiler warnings.
  • Updates and improvements to the project documentation.

Testing:

This release has been tested extensively with NCCL v2.19.4-1 for functionality and performance. This release has also been lightly tested with NCCL v2.20.3-1 that was released earlier this week. It was tested with Libfabric versions up to Libfabric v1.19.0.

Checksum (sha512) for the release tarball:

7bad7995e99649dc3ae4c46b2b0011225134703050ae83ab837cd46a7ff979079809cbd117e50cf5169428dd397ab099fea6249d12f891bff94b2d5579b0c0d9  aws-ofi-nccl-1.8.0-aws.tar.gz