New features, bug fixes, and improvements are regularly made to the SageMaker distributed model parallel library.
Date: March. 07. 2022
Currency Updates
- Support for PyTorch 1.10.2
- Support for Hugging Face Transformers 4.16.2
Improvements
Additional support for the :ref:`smdmp-pytorch-tensor-parallel`.
Added support for FP32 residual addition to avoid overflow (NaN loss values) for large models with more than 100 billion parameters when using FP16. This is integrated to the following module:
Added support for the following two NVIDIA Megatron fused kernels:
- Fusion of attention masking and softmax (
fused_softmax
) - Fusion of bias addition and Gelu activation (
fused_bias_gelu
)
To learn more about these options and how to use them, see the :class:`smp.tensor_parallelism` context manager.
- Fusion of attention masking and softmax (
Migration to AWS Deep Learning Containers
This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers:
PyTorch 1.10.2
763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.10.2-gpu-py38-cu113-ubuntu20.04-sagemaker
Date: December. 20. 2021
New Features
PyTorch
Added extended memory-saving features for PyTorch 1.8.1:
For more information, see the following documentation:
Migration to AWS Deep Learning Containers
This version passed benchmark testing and is migrated to the following AWS Deep Learning Container(s):
Deep Learning Container for PyTorch 1.8.1:
763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.8.1-gpu-py36-cu111-ubuntu18.04
Date: November. 03. 2021
New Features
- PyTorch
- Currency update for PyTorch 1.10.0
Migration to AWS Deep Learning Containers
This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers:
Deep Learning Container for PyTorch 1.10.0:
763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.10.0-gpu-py38-cu113-ubuntu20.04-sagemaker
Date: June. 29. 2021
New Features
- TensorFlow
- Added support for TensorFlow v2.5.0.
- Added support for
keras.model.fit()
.
Migration to AWS Deep Learning Containers
This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers:
Deep Learning Container for TensorFlow 2.5.0:
763104351884.dkr.ecr.<region>.amazonaws.com/tensorflow-training:2.5.0-gpu-py37-cu112-ubuntu18.04-v1.0
Deep Learning Container for PyTorch 1.9.1:
763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.9.1-gpu-py38-cu111-ubuntu20.04
- New Features
- Bug Fixes
- Known Issues
New Features
- TensorFlow
- Exposes a new decorator
register_post_partition_hook
. This allows invoking the decorated methods just after model partition but before executing the first step. For example loading a checkpoint. Refer to the SageMaker distributed model parallel API documentation for more information.
- Exposes a new decorator
Bug Fixes
- PyTorch
- Improved memory efficiency when using active microbatches by clearing activations at end of each microbatch.
- TensorFlow
- Fixed issue that caused hangs when training some models with XLA enabled.
Known Issues
- PyTorch
- A crash was observed when
optimizer.step()
was called for certain optimizers such as AdaDelta, when the partition on which this method was called has no local parameters assigned to it after partitioning. This is due to a bug in PyTorch which has since been fixed. Till that makes its way to the next release of PyTorch, only calloptimizer.step()
on processes which have at least one local parameter. This can be checked like thislen(list(model.local_parameters())) > 0
. - A performance regression still exists when training on SMP with
PyTorch 1.7.1 compared to 1.6. The rootcause was found to be the
slowdown in performance of
.grad
method calls in PyTorch 1.7.1 compared to 1.6. See the related discussion: pytorch/pytorch#50636. This issue does not exist with PyTorch 1.8.
- A crash was observed when
- New Features
- Bug Fixes
- Known Issues
New Features
PyTorch
Add support for PyTorch 1.8
- Adds a new method to DistributedModel
register_comm_hook
(for PyTorch 1.8 and newer only). This method behaves the same as the corresponding method with the same name intorch.DistributedDataParallel
API. Refer to the SageMaker distributed model parallel API documentation for more information.
- Adds a new method to DistributedModel
Improvements
- Adds a configuration
active_microbatches
to the SageMaker SDK API for launching jobs, to control the number of active microbatches during training. This helps limit memory usage in cases where the number of microbatches is high. Refer to the SageMaker Python SDK parameters API documentation for more information. - Adds a configuration
deterministic_server
to the SageMaker SDK API for launching jobs, which ensures that the execution server for pipeline parallelism processes requests in a deterministic order across data parallel ranks. Refer to the SageMaker Python SDK parameters API documentation for more information. - Parameter passing is now supported in
module.forward
methods for DistributedModel and its submodules. This removes the restriction of having to passnn.Parameter
to the__init__
call and making it a member of the module to use it. ## Bug Fixes
- PyTorch
- Fixed a case where training hangs due to a module having computation which requires grads that is not used by the final output of the module. Now such a situtation raises an error with suggestions on making such computation compatible.
- Fixed an issue with buffers which caused the buffers to not be on the
correct device after a model is partitioned, and not be synchronized
across steps (when
broadcast_buffers
is True). This could have caused correctness issues in models with buffers.
Known Issues
- PyTorch
mp_barrier
andget_mp_process_group
are wrongly marked as deprecated methods. Ignore the deprecation warning.- A crash was observed when
optimizer.step()
was called for certain optimizers such as AdaDelta, when the partition on which this method was called has no local parameters assigned to it after partitioning. This is due to a bug in PyTorch which has since been fixed. Till that makes its way to the next release of PyTorch, only calloptimizer.step()
on processes which have at least one local parameter. This can be checked like thislen(list(model.local_parameters())) > 0
. - A performance regression still exists when training on SMP with
PyTorch 1.7.1 compared to 1.6. The rootcause was found to be the
slowdown in performance of
.grad
method calls in PyTorch 1.7.1 compared to 1.6. See the related discussion: pytorch/pytorch#50636. This issue does not exist with PyTorch 1.8.
- New Features
- Bug Fixes
- Known Issues
New Features
PyTorch
Add support for PyTorch 1.7.1
- Adds support for
gradient_as_bucket_view
(PyTorch 1.7.1 only),find_unused_parameters
(PyTorch 1.7.1 only) andbroadcast_buffers
options tosmp.DistributedModel
. These options behave the same as the corresponding options (with the same names) intorch.DistributedDataParallel
API. Refer to the SageMaker distributed model parallel API documentation for more information. - Adds support for
join
(PyTorch 1.7.1 only) context manager, which is to be used in conjunction with an instance ofsmp.DistributedModel
to be able to train with uneven inputs across participating processes. - Adds support for
_register_comm_hook
(PyTorch 1.7.1 only) which will register the callable as a communication hook for DDP. NOTE: Like in DDP, this is an experimental API and subject to change.
- Adds support for
- Tensorflow
- Adds support for Tensorflow 2.4.1
Bug Fixes
- PyTorch
Serialization
: Fix a bug with serialization/flattening where instances of subclasses of dict/OrderedDicts were serialized/deserialized or internally flattened/unflattened as regular dicts.
- Tensorflow
- Fix a bug that may cause a hang during evaluation when there is no model input for one partition.
Known Issues
- PyTorch
- A performance regression was observed when training on SMP with
PyTorch 1.7.1 compared to 1.6.0. The rootcause was found to be the
slowdown in performance of
.grad
method calls in PyTorch 1.7.1 compared to 1.6.0. See the related discussion: pytorch/pytorch#50636.
- A performance regression was observed when training on SMP with
PyTorch 1.7.1 compared to 1.6.0. The rootcause was found to be the
slowdown in performance of
- New Features
- Bug Fixes
- Improvements
- Performance
- Known Issues
New Features
The following sections describe new feature releases that are common across frameworks and that are framework specific.
Common across frameworks*
Custom slicing support (
smp_slice
method) for objects passed tosmp.step
decorated functionsTo pass an object to
smp.step
that contains tensors that needs to be split across microbatches and is not an instance of list, dict, tuple or set, you should implementsmp_slice
method for the object.Below is an example of how to use this with PyTorch:
class CustomType: def __init__(self, tensor): self.data = tensor # SMP will call this to invoke slicing on the object passing in total microbatches (num_mb) # and the current microbatch index (mb). def smp_slice(self, num_mb, mb, axis): dim_size = list(self.data.size())[axis] split_size = dim_size // num_mb sliced_tensor = self.data.narrow(axis, mb * split_size, split_size) return CustomType(sliced_tensor, self.other) custom_obj = CustomType(torch.ones(4,)) @smp.step() def step(custom_obj): loss = model(custom_obj) model.backward(loss) return loss
PyTorch
Add support for smp.DistributedModel.cpu()
smp.DistributedModel.cpu()
allgathers parameters and buffers across allmp_ranks
and moves them to the CPU.Add
trace_memory_usage
option tosmp.DistributedModel
to measure memory usage per moduleAdds
trace_memory_usage
option tosmp.DistributedModel
. This attempts to measure memory usage per module during tracing. If this is disabled, memory usage is estimated through the sizes of tensors returned from the module. This option is disabled by default.
Bug Fixes
- PyTorch
torch.nn.Sequential
: Fix a bug withtorch.nn.Sequential
which causes a failure with the error message :shouldnt go less than 0, there is a bug
when the inputs to the first module don’t require grads.smp.DistributedModel
: Fix a bug withDistributedModel
execution when a module has multiple parents. The bug surfaces with the error message:actual_parent should be different than module_execution_stack parent only for torch.nn.ModuleList
apex.optimizers.FusedNovoGrad
: Fix a bug withapex.optimizers.FusedNovoGrad
which surfaces with the error message:KeyError: 'exp_avg_sq'
Improvements
Usability
- PyTorch
smp.DistributedModel
: Improve the error message when the forward pass onsmp.DistributedModel
is called outside thesmp.step
decorated function.smp.load
: Add user friendly error messages when loading checkpoints withsmp.load
.
Partitioning Algorithm
- PyTorch
- Better memory balancing by taking into account the existing modules already assigned to the parent, while partitioning the children of a given module.
Performance
- Tensorflow
- Addresses long pre-processing times introduced by SMP XLA optimizer when dealing with large graphs and large number of microbatches. BERT (large) preprocessing time goes down from 40 minutes to 6 minutes on p3.16xlarge.
Known Issues
- PyTorch
- Serialization for Torch in SMP overwrites instances of dict subclass
to be dict itself, instead of the instances of subclass. One of the
use cases which fails because of this issue is when a user implements
a subclass of OrderedDict with the
__getitem__
method. After serialization/deserialization in SMP, indexing on the object will lead to errors. A workaround is to use the dict keys to access the underlying item.
- Serialization for Torch in SMP overwrites instances of dict subclass
to be dict itself, instead of the instances of subclass. One of the
use cases which fails because of this issue is when a user implements
a subclass of OrderedDict with the