Skip to content

[Feature]: AWS EFA (Elastic Fabric Adapter) support #1781

@un-def

Description

@un-def

Problem

https://aws.amazon.com/hpc/efa/

Elastic Fabric Adapter (EFA) is a network interface for Amazon EC2 instances that enables customers to run applications requiring high levels of inter-node communications at scale on AWS. Its custom-built operating system (OS) bypass hardware interface enhances the performance of inter-instance communications, which is critical to scaling these applications. With EFA, High Performance Computing (HPC) applications using the Message Passing Interface (MPI) and Machine Learning (ML) applications using NVIDIA Collective Communications Library (NCCL) can scale to thousands of CPUs or GPUs.

This feature “is available as an optional EC2 networking feature that you can enable on any supported EC2 instance at no additional cost”, but currently cannot be used with dstack as:

  1. I's not requested by dstack during provisioning.
  2. Not supported in VM images used by dstack.
  3. Host devices are not mounted inside run container.

Solution

  1. Request this feature by default on supported instances.
  2. Update VM images to include preinstalled kernel driver/libfabric.
  3. Mount EFA devices inside containers.

Workaround

No response

Would you like to help us implement this feature by sending a PR?

Yes

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions