From 0035dff700ed7b5410a67f1fde0400ff831d6f1f Mon Sep 17 00:00:00 2001 From: Ajay P Date: Mon, 5 Apr 2021 12:28:39 -0700 Subject: [PATCH 1/4] Updated release notes and API docs for smd model parallel 1.3.1 --- .../smd_model_parallel_change_log.md | 30 +++++++++++++++++++ .../latest/smd_model_parallel_tensorflow.rst | 24 ++++++++++----- 2 files changed, 46 insertions(+), 8 deletions(-) diff --git a/doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.md b/doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.md index 1c573ff45d..1a3790b3e1 100644 --- a/doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.md +++ b/doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.md @@ -1,3 +1,33 @@ +# Sagemaker Distributed Model Parallel 1.3.1 Release Notes + +- New Features +- Bug Fixes +- Known Issues + +## New Features + +### TensorFlow + +- Exposes a new decorator ``register_post_partition_hook``. This allows invoking the decorated methods just after model partition but before executing the first step. For example loading a checkpoint. Please refer to the [SageMaker distributed model parallel API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smp_versions/latest/smd_model_parallel_tensorflow.html) for more information. + +## Bug Fixes + +### PyTorch + +- Improved memory efficiency when using active microbatches by clearing activations at end of each microbatch. + +### TensorFlow + +- Fixed issue that caused hangs when training some models with XLA enabled. + +## Known Issues + +### PyTorch + +- A crash was observed when ``optimizer.step()`` was called for certain optimizers such as AdaDelta, when the partition on which this method was called has no local parameters assigned to it after partitioning. This is due to a bug in PyTorch which [has since been fixed](https://github.com/pytorch/pytorch/pull/52944). Till that makes its way to the next release of PyTorch, please only call ``optimizer.step()`` on processes which have at least one local parameter. This can be checked like this ``len(list(model.local_parameters())) > 0``. + +- A performance regression still exists when training on SMP with PyTorch 1.7.1 compared to 1.6. The rootcause was found to be the slowdown in performance of `.grad` method calls in PyTorch 1.7.1 compared to 1.6. Please see the related discussion: https://github.com/pytorch/pytorch/issues/50636. This issue does not exist with PyTorch 1.8. + # Sagemaker Distributed Model Parallel 1.3.0 Release Notes - New Features diff --git a/doc/api/training/smp_versions/latest/smd_model_parallel_tensorflow.rst b/doc/api/training/smp_versions/latest/smd_model_parallel_tensorflow.rst index 49e6c5363b..9fb32dc78b 100644 --- a/doc/api/training/smp_versions/latest/smd_model_parallel_tensorflow.rst +++ b/doc/api/training/smp_versions/latest/smd_model_parallel_tensorflow.rst @@ -102,13 +102,6 @@ TensorFlow API                       max_to_keep=None,                       checkpoint_name="ckpt") - - **Important:** ``smp.CheckpointManager.restore()`` must be called after - the first training step. This is because the first call of the - ``smp.step`` function constructs and partitions the model, which must - take place before the checkpoint restore. Calling it before the first - ``smp.step`` call might result in hangs or unexpected behavior. - **Parameters** - ``checkpoint``: A `tf.train.Checkpoint @@ -154,7 +147,22 @@ TensorFlow API .. code:: python for step, inputs in enumerate(train_ds): -     if step == 1:                    # NOTE: restore occurs on the second step +     if step == 0:                            ckpt_manager.restore()     loss = train_step(inputs) +.. function:: register_post_partition_hook(hook) + + Registers a callable ``hook`` to + be executed after the model is partitioned. This is useful in situations + where an operation needs to be executed after the model partition during + the first call to ``smp.step``, but before the actual execution of the + first forward pass. + + .. code:: python + + @smp.register_post_partition_hook + def test_eager(): + # All statements here will be executed right after partition but before the first forward pass + tf.print("Entered hook through eager context") + From c833826d5580edb0659f1aeca511895328355011 Mon Sep 17 00:00:00 2001 From: Ajay P Date: Mon, 5 Apr 2021 12:35:26 -0700 Subject: [PATCH 2/4] Updated API doc --- .../latest/smd_model_parallel_tensorflow.rst | 30 +++++++++---------- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/doc/api/training/smp_versions/latest/smd_model_parallel_tensorflow.rst b/doc/api/training/smp_versions/latest/smd_model_parallel_tensorflow.rst index 9fb32dc78b..cc9a015065 100644 --- a/doc/api/training/smp_versions/latest/smd_model_parallel_tensorflow.rst +++ b/doc/api/training/smp_versions/latest/smd_model_parallel_tensorflow.rst @@ -83,7 +83,21 @@ TensorFlow API     with smp.partition(3):         z = tf.reduce_sum(y)             # placed in partition 3 - ​ + +.. function:: register_post_partition_hook(hook) + + Registers a callable ``hook`` to + be executed after the model is partitioned. This is useful in situations + where an operation needs to be executed after the model partition during + the first call to ``smp.step``, but before the actual execution of the + first forward pass. + + .. code:: python + + @smp.register_post_partition_hook + def test_eager(): + # All statements here will be executed right after partition but before the first forward pass + tf.print("Entered hook through eager context") .. class:: smp.CheckpointManager @@ -151,18 +165,4 @@ TensorFlow API         ckpt_manager.restore()     loss = train_step(inputs) -.. function:: register_post_partition_hook(hook) - - Registers a callable ``hook`` to - be executed after the model is partitioned. This is useful in situations - where an operation needs to be executed after the model partition during - the first call to ``smp.step``, but before the actual execution of the - first forward pass. - - .. code:: python - - @smp.register_post_partition_hook - def test_eager(): - # All statements here will be executed right after partition but before the first forward pass - tf.print("Entered hook through eager context") From 35eda59460f798006920bc0ab8d7f14e3e015948 Mon Sep 17 00:00:00 2001 From: Ajay P Date: Wed, 7 Apr 2021 13:26:43 -0700 Subject: [PATCH 3/4] Modified statements to conform to AWS style --- .../smd_model_parallel_change_log.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.md b/doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.md index 1a3790b3e1..b775acbfab 100644 --- a/doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.md +++ b/doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.md @@ -8,7 +8,7 @@ ### TensorFlow -- Exposes a new decorator ``register_post_partition_hook``. This allows invoking the decorated methods just after model partition but before executing the first step. For example loading a checkpoint. Please refer to the [SageMaker distributed model parallel API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smp_versions/latest/smd_model_parallel_tensorflow.html) for more information. +- Exposes a new decorator ``register_post_partition_hook``. This allows invoking the decorated methods just after model partition but before executing the first step. For example loading a checkpoint. Refer to the [SageMaker distributed model parallel API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smp_versions/latest/smd_model_parallel_tensorflow.html) for more information. ## Bug Fixes @@ -24,9 +24,9 @@ ### PyTorch -- A crash was observed when ``optimizer.step()`` was called for certain optimizers such as AdaDelta, when the partition on which this method was called has no local parameters assigned to it after partitioning. This is due to a bug in PyTorch which [has since been fixed](https://github.com/pytorch/pytorch/pull/52944). Till that makes its way to the next release of PyTorch, please only call ``optimizer.step()`` on processes which have at least one local parameter. This can be checked like this ``len(list(model.local_parameters())) > 0``. +- A crash was observed when ``optimizer.step()`` was called for certain optimizers such as AdaDelta, when the partition on which this method was called has no local parameters assigned to it after partitioning. This is due to a bug in PyTorch which [has since been fixed](https://github.com/pytorch/pytorch/pull/52944). Till that makes its way to the next release of PyTorch, only call ``optimizer.step()`` on processes which have at least one local parameter. This can be checked like this ``len(list(model.local_parameters())) > 0``. -- A performance regression still exists when training on SMP with PyTorch 1.7.1 compared to 1.6. The rootcause was found to be the slowdown in performance of `.grad` method calls in PyTorch 1.7.1 compared to 1.6. Please see the related discussion: https://github.com/pytorch/pytorch/issues/50636. This issue does not exist with PyTorch 1.8. +- A performance regression still exists when training on SMP with PyTorch 1.7.1 compared to 1.6. The rootcause was found to be the slowdown in performance of `.grad` method calls in PyTorch 1.7.1 compared to 1.6. See the related discussion: https://github.com/pytorch/pytorch/issues/50636. This issue does not exist with PyTorch 1.8. # Sagemaker Distributed Model Parallel 1.3.0 Release Notes From 2e2c6316ad9e73c72f255f24961af29e99494feb Mon Sep 17 00:00:00 2001 From: Talia <31782251+TEChopra1000@users.noreply.github.com> Date: Thu, 8 Apr 2021 16:21:57 -0700 Subject: [PATCH 4/4] documentation: removing white space --- .../smp_versions/latest/smd_model_parallel_tensorflow.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/api/training/smp_versions/latest/smd_model_parallel_tensorflow.rst b/doc/api/training/smp_versions/latest/smd_model_parallel_tensorflow.rst index cc9a015065..5f61bf8040 100644 --- a/doc/api/training/smp_versions/latest/smd_model_parallel_tensorflow.rst +++ b/doc/api/training/smp_versions/latest/smd_model_parallel_tensorflow.rst @@ -92,7 +92,7 @@ TensorFlow API the first call to ``smp.step``, but before the actual execution of the first forward pass. - .. code:: python + .. code:: python @smp.register_post_partition_hook def test_eager(): @@ -161,7 +161,7 @@ TensorFlow API .. code:: python for step, inputs in enumerate(train_ds): -     if step == 0:                    +     if step == 0:         ckpt_manager.restore()     loss = train_step(inputs)