Distributed Training APIs

SageMaker distributed training libraries offer both data parallel and model parallel training strategies. They combine software and hardware technologies to improve inter-GPU and inter-node communications. They extend SageMaker’s training capabilities with built-in options that require only small code changes to your training scripts.

The SageMaker Distributed Data Parallel Library

smd_data_parallel sdp_versions/latest smd_data_parallel_use_sm_pysdk smd_data_parallel_release_notes/smd_data_parallel_change_log

The SageMaker Distributed Model Parallel Library

Note

Since the release of the SageMaker model parallelism (SMP) version 2 in December 2023, this documentation is no longer supported for maintenence. The live documentation is available at SageMaker model parallelism library v2 in the Amazon SageMaker User Guide.

The documentation for the SMP library v1.x is archived and available at Run distributed training with the SageMaker model parallelism library in the Amazon SageMaker User Guide, and the SMP v1.x API reference is available in the SageMaker Python SDK v2.199.0 documentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distributed.rst

distributed.rst

Distributed Training APIs

The SageMaker Distributed Data Parallel Library

The SageMaker Distributed Model Parallel Library

Files

distributed.rst

Latest commit

History

distributed.rst

File metadata and controls

Distributed Training APIs

The SageMaker Distributed Data Parallel Library

The SageMaker Distributed Model Parallel Library