SageMaker's Model Parallelism Library

Use Amazon SageMaker's model parallel library to train large deep learning (DL) models that are difficult to train due to GPU memory limitations. The library automatically and efficiently splits a model across multiple GPUs and instances. Using the library, you can achieve a target prediction accuracy faster by efficiently training larger DL models with billions or trillions of parameters.

You can use the library to automatically partition your own TensorFlow and PyTorch models across multiple GPUs and multiple nodes with minimal code changes. You can access the library's API through the SageMaker Python SDK.

Use the following sections to learn more about model parallelism and the SageMaker model parallel library. This library's API documentation is located at Distributed Training APIs in the SageMaker Python SDK documentation.

To track the latest updates of the library, see the SageMaker Model Parallel Release Notes in the SageMaker Python SDK documentation.

Topics

Introduction to Model Parallelism
Supported Frameworks and AWS Regions
Core Features of the SageMaker Model Parallelism Library
Run a SageMaker Distributed Training Job with Model Parallelism
Extended Features of the SageMaker Model Parallel Library for PyTorch
SageMaker Distributed Model Parallelism Best Practices
The SageMaker Distributed Model Parallelism Library Configuration Tips and Pitfalls
Model Parallel Troubleshooting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model-parallel.md

model-parallel.md

SageMaker's Model Parallelism Library

Files

model-parallel.md

Latest commit

History

model-parallel.md

File metadata and controls

SageMaker's Model Parallelism Library