Skip to content

Latest commit



902 lines (561 loc) · 30.9 KB


File metadata and controls

902 lines (561 loc) · 30.9 KB

Release Notes

New features, bug fixes, and improvements are regularly made to the SageMaker model parallelism library.

SageMaker Distributed Model Parallel 1.15.0 Release Notes

Date: Apr. 27. 2023

Currency Updates

  • Added support for PyTorch v2.0.0. Note that the library does not support torch.compile in this release.

New Features

  • Using sharded data parallelism with tensor parallelism together is now available for PyTorch 1.13.1. It allows you to train with smaller global batch sizes while scaling up to large clusters. For more information, see Sharded data parallelism with tensor parallelism in the Amazon SageMaker Developer Guide.
  • Added support for saving and loading full model checkpoints when using sharded data parallelism. This is enabled by using the standard checkpointing API, smp.save_checkpoint with partial=False. Before, full checkpoints needed to be created by merging partial checkpoint files after training finishes.
  • DistributedTransformer now supports the ALiBi position embeddings. When using DistributedTransformer, you can set the use_alibi parameter to True to use the Triton-based flash attention kernels. This helps evaluate sequences longer than those used for training.

Bug Fixes

  • When using tensor parallelism, parameters were initialized multiple times unncessarily. This release fixed the multiple initialization of parameters so that each parameter is initialized exactly once. It not only saves time, but also ensures that the random generator behavior is similar to the non-tensor parallelism case.

Known issues

  • Model initialization might take longer with PyTorch 2.0 than that with PyTorch 1.13.

Migration to AWS Deep Learning Containers

This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):

  • SageMaker training container for PyTorch v2.0.0
  • SageMaker training container for PyTorch v1.13.1

Binary file of this version of the library for custom container users:

  • For PyTorch v2.0.0
  • For PyTorch v1.13.1

Release History

SageMaker Distributed Model Parallel 1.14.0 Release Notes

Date: Jan. 30. 2023

Currency Updates

  • Added support for PyTorch v1.13.1


Migration to AWS Deep Learning Containers

This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):

  • SageMaker training container for PyTorch v1.13.1


Binary file of this version of the library for custom container users:

  • For PyTorch 1.13.1

SageMaker Distributed Model Parallel 1.13.0 Release Notes

Date: Dec. 15. 2022

New Features

  • Sharded data parallelism now supports a new backend for collectives called SMDDP Collectives. For supported scenarios, SMDDP Collectives are on by default for the AllGather operation. For more information, see Sharded data parallelism with SMDDP Collectives in the Amazon SageMaker Developer Guide.
  • Introduced FlashAttention for DistributedTransformer to improve memory usage and computational performance of models such as GPT2, GPTNeo, GPTJ, GPTNeoX, BERT, and RoBERTa.

Bug Fixes

  • Fixed initialization of lm_head in DistributedTransformer to use a provided range for initialization, when weights are not tied with the embeddings.


  • When a module has no parameters, we have introduced an optimization to execute such a module on the same rank as its parent during pipeline parallelism.

Migration to AWS Deep Learning Containers

This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):

  • SageMaker training container for PyTorch v1.12.1


Binary file of this version of the library for custom container users:

  • For PyTorch 1.12.1

SageMaker Distributed Model Parallel 1.11.0 Release Notes

Date: August. 17. 2022

New Features

The following new features are added for PyTorch.

  • The library implements sharded data parallelism, which is a memory-saving distributed training technique that splits the training state of a model (model parameters, gradients, and optimizer states) across data parallel groups. With sharded data parallelism, you can reduce the per-GPU memory footprint of a model by sharding the training state over multiple GPUs. To learn more, see Sharded Data Parallelism in the Amazon SageMaker Developer Guide.

Migration to AWS Deep Learning Containers

This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):

  • DLC for PyTorch 1.12.0


Binary file of this version of the library for custom container users:

  • For PyTorch 1.12.0

SageMaker Distributed Model Parallel 1.10.1 Release Notes

Date: August. 8. 2022

Currency Updates

  • Added support for Transformers v4.21.

Migration to AWS Deep Learning Containers

This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):

  • DLC for PyTorch 1.11.0


Binary file of this version of the library for custom container users:

  • For PyTorch 1.11.0

SageMaker Distributed Model Parallel 1.10.0 Release Notes

Date: July. 19. 2022

New Features

The following new features are added for PyTorch.


  • The SageMaker distributed model parallel library manages and optimizes CPU memory by garbage-collecting non-local parameters in general and during checkpointing.
  • Changes in the GPT-2 translate functions (smdistributed.modelparallel.torch.nn.huggingface.gpt2) to save memory by not maintaining two copies of weights at the same time.

Migration to AWS Deep Learning Containers

This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):

  • DLC for PyTorch 1.11.0

  • DLC for PyTorch 1.12.0


Binary file of this version of the library for custom container users:

  • For PyTorch 1.11.0
  • For PyTorch 1.12.0

SageMaker Distributed Model Parallel 1.9.0 Release Notes

Date: May. 3. 2022

Currency Updates

  • Added support for PyTorch 1.11.0

Migration to AWS Deep Learning Containers

This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):

  • PyTorch 1.11.0 DLC


Binary file of this version of the library for custom container users:

SageMaker Distributed Model Parallel 1.8.1 Release Notes

Date: April. 23. 2022

New Features

  • Added support for more configurations of the Hugging Face Transformers GPT-2 and GPT-J models with tensor parallelism: scale_attn_weights, scale_attn_by_inverse_layer_idx, reorder_and_upcast_attn. To learn more about these features, please refer to the following model configuration classes in the Hugging Face Transformers documentation:
  • Added support for activation checkpointing of modules which pass keyword value arguments and arbitrary structures in their forward methods. This helps support activation checkpointing with Hugging Face Transformers models even when tensor parallelism is not enabled.

Bug Fixes

  • Fixed a correctness issue with tensor parallelism for GPT-J model which was due to improper scaling during gradient reduction for some layer normalization modules.
  • Fixed the creation of unnecessary additional processes which take up some GPU memory on GPU 0 when the :class:`smp.allgather` collective is called.


  • Improved activation offloading so that activations are preloaded on a per-layer basis as opposed to all activations for a micro batch earlier. This not only improves memory efficiency and performance, but also makes activation offloading a useful feature for non-pipeline parallelism cases.

Migration to AWS Deep Learning Containers

This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers:

  • HuggingFace 4.17.0 DLC with PyTorch 1.10.2
  • The binary file of this version of the library for custom container users

SageMaker Distributed Model Parallel 1.8.0 Release Notes

Date: March. 23. 2022

New Features

  • Added tensor parallelism support for the GPT-J model. When using the GPT-J model of Hugging Face Transformers v4.17.0 with tensor parallelism, the SageMaker model parallel library automatically replaces the model with a tensor parallel distributed GPT-J model. For more information, see Support for Hugging Face Transformer Models in the Amazon SageMaker Model Parallel Training developer guide.

Migration to AWS Deep Learning Containers

This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers:

  • HuggingFace 4.17.0 DLC with PyTorch 1.10.2

The binary file of this version of the library for custom container users:

SageMaker Distributed Model Parallel 1.7.0 Release Notes

Date: March. 07. 2022

Currency Updates

  • Support for PyTorch 1.10.2
  • Support for Hugging Face Transformers 4.16.2


Migration to AWS Deep Learning Containers

This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers:

  • PyTorch 1.10.2


SageMaker Distributed Model Parallel 1.6.0 Release Notes

Date: December. 20. 2021

New Features

Migration to AWS Deep Learning Containers

This version passed benchmark testing and is migrated to the following AWS Deep Learning Container(s):

  • Deep Learning Container for PyTorch 1.8.1:


SageMaker Distributed Model Parallel 1.5.0 Release Notes

Date: November. 03. 2021

New Features

  • PyTorch
    • Currency update for PyTorch 1.10.0

Migration to AWS Deep Learning Containers

This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers:

  • Deep Learning Container for PyTorch 1.10.0:


SageMaker Distributed Model Parallel 1.4.0 Release Notes

Date: June. 29. 2021

New Features

  • TensorFlow
    • Added support for TensorFlow v2.5.0.
    • Added support for

Migration to AWS Deep Learning Containers

This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers:

  • Deep Learning Container for TensorFlow 2.5.0:

  • Deep Learning Container for PyTorch 1.9.1:


SageMaker Distributed Model Parallel 1.3.1 Release Notes

  • New Features
  • Bug Fixes
  • Known Issues

New Features

  • TensorFlow
    • Exposes a new decorator register_post_partition_hook. This allows invoking the decorated methods just after model partition but before executing the first step. For example loading a checkpoint. Refer to the SageMaker distributed model parallel API documentation for more information.

Bug Fixes

  • PyTorch
    • Improved memory efficiency when using active microbatches by clearing activations at end of each microbatch.
  • TensorFlow
    • Fixed issue that caused hangs when training some models with XLA enabled.

Known Issues

  • PyTorch
    • A crash was observed when optimizer.step() was called for certain optimizers such as AdaDelta, when the partition on which this method was called has no local parameters assigned to it after partitioning. This is due to a bug in PyTorch which has since been fixed. Till that makes its way to the next release of PyTorch, only call optimizer.step() on processes which have at least one local parameter. This can be checked like this len(list(model.local_parameters())) > 0.
    • A performance regression still exists when training on SMP with PyTorch 1.7.1 compared to 1.6. The rootcause was found to be the slowdown in performance of .grad method calls in PyTorch 1.7.1 compared to 1.6. See the related discussion: pytorch/pytorch#50636. This issue does not exist with PyTorch 1.8.

SageMaker Distributed Model Parallel 1.3.0 Release Notes

  • New Features
  • Bug Fixes
  • Known Issues

New Features

  • PyTorch

    Add support for PyTorch 1.8

    • Adds a new method to DistributedModel register_comm_hook (for PyTorch 1.8 and newer only). This method behaves the same as the corresponding method with the same name in torch.DistributedDataParallel API. Refer to the SageMaker distributed model parallel API documentation for more information.


  • Adds a configuration active_microbatches to the SageMaker SDK API for launching jobs, to control the number of active microbatches during training. This helps limit memory usage in cases where the number of microbatches is high. Refer to the SageMaker Python SDK parameters API documentation for more information.
  • Adds a configuration deterministic_server to the SageMaker SDK API for launching jobs, which ensures that the execution server for pipeline parallelism processes requests in a deterministic order across data parallel ranks. Refer to the SageMaker Python SDK parameters API documentation for more information.
  • Parameter passing is now supported in module.forward methods for DistributedModel and its submodules. This removes the restriction of having to pass nn.Parameter to the __init__ call and making it a member of the module to use it. ## Bug Fixes
  • PyTorch
    • Fixed a case where training hangs due to a module having computation which requires grads that is not used by the final output of the module. Now such a situtation raises an error with suggestions on making such computation compatible.
    • Fixed an issue with buffers which caused the buffers to not be on the correct device after a model is partitioned, and not be synchronized across steps (when broadcast_buffers is True). This could have caused correctness issues in models with buffers.

Known Issues

  • PyTorch
    • mp_barrier and get_mp_process_group are wrongly marked as deprecated methods. Ignore the deprecation warning.
    • A crash was observed when optimizer.step() was called for certain optimizers such as AdaDelta, when the partition on which this method was called has no local parameters assigned to it after partitioning. This is due to a bug in PyTorch which has since been fixed. Till that makes its way to the next release of PyTorch, only call optimizer.step() on processes which have at least one local parameter. This can be checked like this len(list(model.local_parameters())) > 0.
    • A performance regression still exists when training on SMP with PyTorch 1.7.1 compared to 1.6. The rootcause was found to be the slowdown in performance of .grad method calls in PyTorch 1.7.1 compared to 1.6. See the related discussion: pytorch/pytorch#50636. This issue does not exist with PyTorch 1.8.

SageMaker Distributed Model Parallel 1.2.0 Release Notes

  • New Features
  • Bug Fixes
  • Known Issues

New Features

  • PyTorch

    Add support for PyTorch 1.7.1

    • Adds support for gradient_as_bucket_view (PyTorch 1.7.1 only), find_unused_parameters (PyTorch 1.7.1 only) and broadcast_buffers options to smp.DistributedModel. These options behave the same as the corresponding options (with the same names) in torch.DistributedDataParallel API. Refer to the SageMaker distributed model parallel API documentation for more information.
    • Adds support for join (PyTorch 1.7.1 only) context manager, which is to be used in conjunction with an instance of smp.DistributedModel to be able to train with uneven inputs across participating processes.
    • Adds support for _register_comm_hook (PyTorch 1.7.1 only) which will register the callable as a communication hook for DDP. NOTE: Like in DDP, this is an experimental API and subject to change.
  • Tensorflow
    • Adds support for Tensorflow 2.4.1

Bug Fixes

  • PyTorch
    • Serialization: Fix a bug with serialization/flattening where instances of subclasses of dict/OrderedDicts were serialized/deserialized or internally flattened/unflattened as regular dicts.
  • Tensorflow
    • Fix a bug that may cause a hang during evaluation when there is no model input for one partition.

Known Issues

  • PyTorch
    • A performance regression was observed when training on SMP with PyTorch 1.7.1 compared to 1.6.0. The rootcause was found to be the slowdown in performance of .grad method calls in PyTorch 1.7.1 compared to 1.6.0. See the related discussion: pytorch/pytorch#50636.

SageMaker Distributed Model Parallel 1.1.0 Release Notes

  • New Features
  • Bug Fixes
  • Improvements
  • Performance
  • Known Issues

New Features

The following sections describe new feature releases that are common across frameworks and that are framework specific.

Common across frameworks*

  • Custom slicing support (smp_slice method) for objects passed to smp.step decorated functions

    To pass an object to smp.step that contains tensors that needs to be split across microbatches and is not an instance of list, dict, tuple or set, you should implement smp_slice method for the object.

    Below is an example of how to use this with PyTorch:

    class CustomType:
        def __init__(self, tensor):
   = tensor
        # SMP will call this to invoke slicing on the object passing in total microbatches (num_mb)
        # and the current microbatch index (mb).
        def smp_slice(self, num_mb, mb, axis):
            dim_size = list([axis]
            split_size = dim_size // num_mb
            sliced_tensor =, mb * split_size, split_size)
            return CustomType(sliced_tensor, self.other)
    custom_obj = CustomType(torch.ones(4,))
    def step(custom_obj):
        loss = model(custom_obj)
        return loss
  • PyTorch

    • Add support for smp.DistributedModel.cpu()

      smp.DistributedModel.cpu() allgathers parameters and buffers across all mp_ranks and moves them to the CPU.

    • Add trace_memory_usage option to smp.DistributedModel to measure memory usage per module

      Adds trace_memory_usage option to smp.DistributedModel. This attempts to measure memory usage per module during tracing. If this is disabled, memory usage is estimated through the sizes of tensors returned from the module. This option is disabled by default.

Bug Fixes

  • PyTorch
    • torch.nn.Sequential: Fix a bug with torch.nn.Sequential which causes a failure with the error message : shouldnt go less than 0, there is a bug when the inputs to the first module don’t require grads.
    • smp.DistributedModel: Fix a bug with DistributedModel execution when a module has multiple parents. The bug surfaces with the error message: actual_parent should be different than module_execution_stack parent only for torch.nn.ModuleList
    • apex.optimizers.FusedNovoGrad: Fix a bug with apex.optimizers.FusedNovoGrad which surfaces with the error message: KeyError: 'exp_avg_sq'



  • PyTorch
    • smp.DistributedModel: Improve the error message when the forward pass on smp.DistributedModel is called outside the smp.step decorated function.
    • smp.load: Add user friendly error messages when loading checkpoints with smp.load.

Partitioning Algorithm

  • PyTorch
    • Better memory balancing by taking into account the existing modules already assigned to the parent, while partitioning the children of a given module.


  • Tensorflow
    • Addresses long pre-processing times introduced by SMP XLA optimizer when dealing with large graphs and large number of microbatches. BERT (large) preprocessing time goes down from 40 minutes to 6 minutes on p3.16xlarge.

Known Issues

  • PyTorch
    • Serialization for Torch in SMP overwrites instances of dict subclass to be dict itself, instead of the instances of subclass. One of the use cases which fails because of this issue is when a user implements a subclass of OrderedDict with the __getitem__ method. After serialization/deserialization in SMP, indexing on the object will lead to errors. A workaround is to use the dict keys to access the underlying item.