Skip to content
This repository has been archived by the owner on Jun 15, 2023. It is now read-only.

Latest commit

 

History

History
19 lines (14 loc) · 1.99 KB

model-parallel.md

File metadata and controls

19 lines (14 loc) · 1.99 KB

SageMaker's Model Parallelism Library

Use Amazon SageMaker's model parallel library to train large deep learning (DL) models that are difficult to train due to GPU memory limitations. The library automatically and efficiently splits a model across multiple GPUs and instances. Using the library, you can achieve a target prediction accuracy faster by efficiently training larger DL models with billions or trillions of parameters.

You can use the library to automatically partition your own TensorFlow and PyTorch models across multiple GPUs and multiple nodes with minimal code changes. You can access the library's API through the SageMaker Python SDK.

Use the following sections to learn more about model parallelism and the SageMaker model parallel library. This library's API documentation is located at Distributed Training APIs in the SageMaker Python SDK documentation.

To track the latest updates of the library, see the SageMaker Model Parallel Release Notes in the SageMaker Python SDK documentation.

Topics