Sync with sagemaker-debugger master branch and fix issue with tensorf…

…low_datasets version (#114) * Update sagemaker.md (#250) * Bumping version to 0.9.0 (#251) * Skip using standalone keras Py3.7+ (#253) * Gradtape zcc (#252) * Fix Incorrect Log Statement (#256) * Incorrect number of tensors saved with MirroredStrategy (#257) * Change Version to 0.8.1 (#258) * Save Scalars With Mirrored Strategy (#259) * skip flaky test (#262) * Don't export to collections for all workers with unsupported distrib training (#263) * version bump (#265) * Avoiding Basehook object pickling (#266) * handle eager tensors (#271) * TF 2.x: Support for keras to estimator (#268) * Revert "TF 2.x: Support for keras to estimator (#268)" (#273) This reverts commit 749bded. * Disable TB Testing (#275) * Support for TF 2 estimator (#274) * Adding a TF2 Hvd example and test (#279) * Moved end of training log from info to debug (#281) #280 * Adding action class (#285) * Adding action class Actions added: stop trianing job, email, sms * Fix buildspec used for PR CI (#287) * Adding a test to check that PT model is saved without issues (#283) * test that model can be pickled without issues * Save Model Inputs, Model Outputs, Gradients, Custom Tensors, Layer Inputs, Layer Outputs (#282) * Pin pytest version (#293) * Load IRIS Dataset from S3 (#298) * Load dataset from s3 (#299) * remove problematic log (#300) * Change Enum (#301) * Doc update (#292) * rename enum (#305) * version bump to 0.9.1 (#304) * modify asserts (#307) * version compare (#306) * Support TF 2.3 Tests (#312) * Disable TB in ZCC for AWS TF 2.3.0 (#316) * Update Assert Statements For New TF 2.2.0 DLC (#320) * Version Bump (#319) * add a note for TF 2.2 limited support (#303) Co-authored-by: Miyoung Choi <cmiyoung@amazon.com> Co-authored-by: Nihal Harish <nihal42harish@gmail.com> * TF 2.2 documentation update (#322) * update TF 2.2 smdebug features * Update code samples/notes for new pySDK and smdebug/add and fix links * add 'New features' note Co-authored-by: Miyoung Choi <cmiyoung@amazon.com> * Adding pagination in list_training_jobs (#323) * Adding pagination in list_Training_jobs * Test Custom Step Usecase (#331) * save tf2 model (#333) * Add ability to only save shapes of tensors (#328) * Revert "Add ability to only save shapes of tensors (#328)" (#337) This reverts commit c9eb769. * Function to Test If the hook has been configured with the Default hook config (#332) * Default hook config (#338) * version bump (#339) * TF ZCC limitation footnote (#342) * Ability to save shapes (#341) * WIP saveshape * Add shape writer * Add pytorch test * Add untested keras test * fix syntax * fix syntax * Import * Import * Add tests for TF * Simplify read code * Add read API and tests * Add mxnet test * Add s3 and json tests * lint * Fix payload * fix import * Handle different num tensors for losses * Fix exact equal condition * Fix mode bug * trigger CI * Add support for distributed training with writer map * Check that value throws exception * Fix tests to make them more resilient * Fix mxnet and pytorch tests * Remove tensor names * pre-commmit * Fix get_mode * Fix bug with old index files * Fix keras test with names of tensors * Set original name to None if tf_obj is None * Fix mirrored test for cpu * Add docs * trigger CI * Fix shape writer get * Simplify by removing shape writer * Cleanup * Fix name of writer * Addressed review comments * trigger ci * retrigger CI Co-authored-by: NihalHarish <nihal42harish@gmail.com> * Support Inputs and Labels in the dict format (#345) * 0.9.4 (#347) * Refactor Make Numpy Array (#329) * warn gradtape users about tf.function support (#348) * Support all tf types (#346) * Model Subclassing Test (#351) * Modify Should Save Tensor Test To Work on Any Version of TF (#352) * framework version updates (#360) * list training jobs improvements (#349) * Earlier list training job would make 50 attempts irrespective. This may be bad because of unnecessary traffic. * if there are training jobs found with prefix, we break * if there are exceptions caught more than 5 times we break. * Handle Deprecation Of experimental_ref api (#356) * check file exist before moving (#364) * check file exist before moving when closing the file. * Support Saving Tensors in Graph Mode with add_for_mode (#353) * Change layer name logic (#357) * Pass Variable Length Argument To Old Function Call (#366) * test concat layers (#367) * Update README.md (#371) * Pinning the version of tensorflow_datasets package so that it does not require updating TF (#373) Co-authored-by: NihalHarish <nihal42harish@gmail.com> * Bugfix: Debugger breaks if should_save_tensor is called before collections are prepared (#372) * Fixing the nightly build pipelines. Avoid force reinstall of rules package when not necessary (#374) * returning list instead of dict keys (#376) fix in reuturn of _get_sm_tj_jobs_with_prefix . This function should return list always. * Add support for mixed precision training (#378) * Modify Asserts to Work with TF 2.1.0 and TF 2.0.0 (#380) * pytorch tmp (#382) * extend zcc to 2.1.2 (#384) * disable pytorch (#386) * Removed the redundant installation of smdebug and smdebug-rules (#391) * Incrementing the version to 0.9.5 (#396) * pin tensorflow dataset in test config (#399) * add back test * revert some changes * unpin pytest version Co-authored-by: Nihal Harish <nihal42harish@gmail.com> Co-authored-by: Vikas-kum <vikumar@amazon.com> Co-authored-by: Vandana Kannan <vandanavk@users.noreply.github.com> Co-authored-by: Anirudh <anirudhkrec@gmail.com> Co-authored-by: Miyoung <myoung8739@gmail.com> Co-authored-by: Miyoung Choi <cmiyoung@amazon.com> Co-authored-by: Rahul Huilgol <huilgolr@amazon.com> Co-authored-by: Amol Lele <19983848+leleamol@users.noreply.github.com>
awslabs · Nov 6, 2020 · a996f1e · a996f1e
1 parent 29fd12e
commit a996f1e
Show file tree

Hide file tree

Showing 13 changed files with 75 additions and 29 deletions.
diff --git a/config/buildspec_vanilla_framework_tests.yml b/config/buildspec_vanilla_framework_tests.yml
@@ -18,7 +18,7 @@ phases:
         - sudo apt-get install unzip -qq -o=Dpkg::Use-Pty=0
         - cd $CODEBUILD_SRC_DIR  && chmod +x config/protoc_downloader.sh && ./config/protoc_downloader.sh
         - pip install --upgrade pip==19.3.1
-        - pip install -q pytest==5.3.3 pytest-cov wheel pyYaml pytest-html keras==2.3.1 mxnet torch xgboost pre-commit tensorflow_datasets torchvision
+        - pip install -q pytest pytest-cov wheel pyYaml pytest-html keras==2.3.1 mxnet torch xgboost pre-commit tensorflow_datasets==4.0.1 torchvision
         - cd $CODEBUILD_SRC_DIR && chmod +x config/install_smdebug.sh && chmod +x config/check_smdebug_install.sh && ./config/install_smdebug.sh;
 
   build:

diff --git a/config/install_smdebug.sh b/config/install_smdebug.sh
@@ -75,7 +75,7 @@ if [ "$run_pytest_mxnet" == 'enable' ]; then
 fi
 if [ "$run_pytest_tensorflow" == 'enable' ]; then
   ./config/check_smdebug_install.sh tensorflow
-  pip install tensorflow_datasets
+  pip install tensorflow_datasets==4.0.1
 fi
 if [ "$run_pytest_pytorch" == 'enable' ]; then
   ./config/check_smdebug_install.sh torch

diff --git a/config/tests.sh b/config/tests.sh
@@ -73,7 +73,7 @@ if [ "$run_pytest_tensorflow" = "enable" ] ; then
 fi
 
 if [ "$run_pytest_tensorflow2" = "enable" ] ; then
-    pip install tensorflow_datasets
+    pip install tensorflow_datasets==4.0.1
     run_for_framework tensorflow2
     run_profiler_test tensorflow2
 fi

diff --git a/smdebug/_version.py b/smdebug/_version.py
@@ -1 +1 @@
-__version__ = "0.9.4"
+__version__ = "0.9.5"
diff --git a/smdebug/pytorch/hook.py b/smdebug/pytorch/hook.py
@@ -28,6 +28,7 @@
 except ImportError:
     herring = None
 
+
 DEFAULT_INCLUDE_COLLECTIONS = [CollectionKeys.LOSSES]
 
 

diff --git a/smdebug/tensorflow/keras.py b/smdebug/tensorflow/keras.py
@@ -152,8 +152,6 @@ def register_model(self, model):
         # It attaches a hook to every layer of the model to capture
         # layer values
         self.model = model
-        if self.tape is not None:
-            self._wrap_model_with_input_output_saver()
         self._wrap_model_with_input_output_saver()
         self.has_registered_model = True
 

diff --git a/tests/tensorflow2/test_keras.py b/tests/tensorflow2/test_keras.py
@@ -7,8 +7,10 @@
 `python tests/tensorflow2/test_keras.py` from the main directory.
 """
 # Standard Library
+import json
 import re
 import time
+from pathlib import Path
 
 # Third Party
 import pytest
@@ -27,6 +29,7 @@
 from smdebug.core.modes import ModeKeys
 from smdebug.core.reduction_config import ALLOWED_NORMS, ALLOWED_REDUCTIONS
 from smdebug.exceptions import TensorUnavailableForStep
+from smdebug.profiler.profiler_constants import DEFAULT_PREFIX
 from smdebug.tensorflow import ReductionConfig, SaveConfig
 
 
@@ -558,7 +561,7 @@ def test_include_regex(out_dir, tf_eager_mode):
 
     tr = create_trial_fast_refresh(out_dir)
     tnames = tr.tensor_names(collection="custom_coll")
-    assert len(tnames) == 12
+    assert len(tnames) == (12 if is_tf_2_2() else 4)
     for tname in tnames:
         assert tr.tensor(tname).value(0) is not None
 
@@ -729,10 +732,7 @@ def test_keras_fit_pure_eager(out_dir, tf_eager_mode):
     helper_keras_fit(trial_dir=out_dir, hook=hook, eager=tf_eager_mode, run_eagerly=True)
 
     trial = smd.create_trial(path=out_dir)
-    if is_tf_2_2():
-        assert len(trial.tensor_names()) == 27
-    else:
-        assert len(trial.tensor_names()) == (20 if is_tf_2_3() else 21)
+    assert len(trial.tensor_names()) == (27 if is_tf_2_2() else 13)
     assert len(trial.tensor_names(collection=CollectionKeys.BIASES)) == 2
     assert len(trial.tensor_names(collection=CollectionKeys.WEIGHTS)) == 2
     assert len(trial.tensor_names(collection=CollectionKeys.OPTIMIZER_VARIABLES)) == 5
@@ -882,3 +882,19 @@ def test_save_layer_inputs_and_outputs(out_dir, tf_eager_mode):
         "dense_1/inputs"
     ).value(0)
     assert boolean_matrix.all()
+
+
+def test_hook_timeline_file_write(set_up_smprofiler_config_path, out_dir, tf_eager_mode):
+    hook = smd.KerasHook(out_dir=out_dir, save_all=False)
+    helper_keras_fit(trial_dir=out_dir, hook=hook, eager=tf_eager_mode, steps=["train", "eval"])
+
+    files = []
+    for path in Path(out_dir + "/" + DEFAULT_PREFIX).rglob("*.json"):
+        files.append(path)
+
+    assert len(files) == 1
+
+    with open(files[0]) as timeline_file:
+        events_dict = json.load(timeline_file)
+
+    assert events_dict
diff --git a/tests/tensorflow2/test_model_subclassing.py b/tests/tensorflow2/test_model_subclassing.py
@@ -2,6 +2,7 @@
 import tensorflow as tf
 from tensorflow.keras.layers import BatchNormalization, Conv2D, Dense, Flatten
 from tensorflow.keras.models import Model
+from tests.tensorflow2.utils import is_tf_2_2
 
 # First Party
 import smdebug.tensorflow as smd
@@ -78,7 +79,12 @@ def test_subclassed_model(out_dir):
     trial = smd.create_trial(out_dir)
     assert len(trial.tensor_names(collection=smd.CollectionKeys.LAYERS)) == 8
 
-    assert trial.tensor_names(collection=smd.CollectionKeys.INPUTS) == ["model_input"]
-    assert trial.tensor_names(collection=smd.CollectionKeys.OUTPUTS) == ["labels", "predictions"]
     assert trial.tensor_names(collection=smd.CollectionKeys.LOSSES) == ["loss"]
-    assert len(trial.tensor_names(collection=smd.CollectionKeys.GRADIENTS)) == 6
+    if is_tf_2_2():
+        # Feature to save model inputs and outputs was first added for TF 2.2.0
+        assert trial.tensor_names(collection=smd.CollectionKeys.INPUTS) == ["model_input"]
+        assert trial.tensor_names(collection=smd.CollectionKeys.OUTPUTS) == [
+            "labels",
+            "predictions",
+        ]
+        assert len(trial.tensor_names(collection=smd.CollectionKeys.GRADIENTS)) == 6
diff --git a/tests/tensorflow2/test_support_dicts.py b/tests/tensorflow2/test_support_dicts.py
@@ -1,6 +1,8 @@
 # Third Party
 import numpy as np
+import pytest
 import tensorflow as tf
+from tests.tensorflow2.utils import is_tf_2_2
 
 # First Party
 import smdebug.tensorflow as smd
@@ -29,6 +31,10 @@ def create_model():
     return model
 
 
+@pytest.mark.skipif(
+    is_tf_2_2() is False,
+    reason="Feature to save model inputs and outputs was first added for TF 2.2.0",
+)
 def test_support_dicts(out_dir):
     model = create_model()
     optimizer = tf.keras.optimizers.Adadelta(lr=1.0, rho=0.95, epsilon=None, decay=0.0)

diff --git a/tests/tensorflow2/test_tensorflow2_datatypes.py b/tests/tensorflow2/test_tensorflow2_datatypes.py
@@ -1,31 +1,32 @@
 # Third Party
 import numpy as np
-from packaging import version
+import pytest
 from tensorflow.python.framework.dtypes import _NP_TO_TF
+from tests.tensorflow2.utils import is_tf_2_2
 
 # First Party
 from smdebug.core.tfevent.util import _get_proto_dtype
 
 
+@pytest.mark.skipif(
+    is_tf_2_2() is False, reason="Brain Float Is Unavailable in lower versions of TF"
+)
 def test_tensorflow2_datatypes():
     # _NP_TO_TF contains all the mappings
     # of numpy to tf types
     try:
-        from tensorflow import __version__ as tf_version
+        from tensorflow.python import _pywrap_bfloat16
 
-        if version.parse(tf_version) >= version.parse("2.0.0"):
-            from tensorflow.python import _pywrap_bfloat16
-
-            # TF 2.x.x Implements a Custom Numpy Datatype for Brain Floating Type
-            # Which is currently only supported on TPUs
-            _np_bfloat16 = _pywrap_bfloat16.TF_bfloat16_type()
-            _NP_TO_TF.pop(_np_bfloat16)
+        # TF 2.x.x Implements a Custom Numpy Datatype for Brain Floating Type
+        # Which is currently only supported on TPUs
+        _np_bfloat16 = _pywrap_bfloat16.TF_bfloat16_type()
+        _NP_TO_TF.pop(_np_bfloat16)
     except (ModuleNotFoundError, ValueError, ImportError):
         pass
 
     for _type in _NP_TO_TF:
         try:
             _get_proto_dtype(np.dtype(_type))
         except Exception:
-            assert False
+            assert False, f"{_type} not supported"
     assert True
diff --git a/tests/zero_code_change/pt_utils.py b/tests/zero_code_change/pt_utils.py
@@ -7,22 +7,34 @@
 import torch.nn.functional as F
 import torchvision
 import torchvision.transforms as transforms
+from packaging import version
 
 
 def get_dataloaders() -> Tuple[torch.utils.data.DataLoader, torch.utils.data.DataLoader]:
     transform = transforms.Compose(
         [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]
     )
 
+    # Temporary Change to allow the test to run with pytorch 1.7 RC3
+    # Smdebug breaks when num_workers>0 for Pytorch 1.7.0
+    if version.parse(torch.__version__) >= version.parse("1.7.0"):
+        num_workers = 0
+    else:
+        num_workers = 2
+
     trainset = torchvision.datasets.CIFAR10(
         root="./data", train=True, download=True, transform=transform
     )
-    trainloader = torch.utils.data.DataLoader(trainset, batch_size=4, shuffle=True, num_workers=2)
+    trainloader = torch.utils.data.DataLoader(
+        trainset, batch_size=4, shuffle=True, num_workers=num_workers
+    )
 
     testset = torchvision.datasets.CIFAR10(
         root="./data", train=False, download=True, transform=transform
     )
-    testloader = torch.utils.data.DataLoader(testset, batch_size=4, shuffle=False, num_workers=2)
+    testloader = torch.utils.data.DataLoader(
+        testset, batch_size=4, shuffle=False, num_workers=num_workers
+    )
 
     classes = ("plane", "car", "bird", "cat", "deer", "dog", "frog", "horse", "ship", "truck")
     return trainloader, testloader

diff --git a/tests/zero_code_change/test_pytorch_integration.py b/tests/zero_code_change/test_pytorch_integration.py
@@ -12,6 +12,7 @@
 
 # Third Party
 import pytest
+import torch
 import torch.nn as nn
 import torch.nn.functional as F
 import torch.optim as optim
@@ -22,6 +23,10 @@
 from smdebug.core.utils import SagemakerSimulator, ScriptSimulator
 
 
+@pytest.mark.skipif(
+    torch.__version__ == "1.7.0",
+    reason="Disabling the test temporarily until we root cause the version incompatibility",
+)
 @pytest.mark.parametrize("script_mode", [False])
 @pytest.mark.parametrize("use_loss_module", [True, False])
 def test_pytorch(script_mode, use_loss_module):

diff --git a/tests/zero_code_change/test_tensorflow2_gradtape_integration.py b/tests/zero_code_change/test_tensorflow2_gradtape_integration.py
@@ -12,7 +12,8 @@
 # Third Party
 import pytest
 import tensorflow.compat.v2 as tf
-from tests.tensorflow2.utils import is_tf_2_2, is_tf_2_3
+from packaging import version
+from tests.tensorflow2.utils import is_tf_2_2
 
 # First Party
 import smdebug.tensorflow as smd
@@ -104,8 +105,8 @@ def helper_test_keras_v2_gradienttape(
                 print(log)
                 train_acc_metric.reset_states()
             hook = smd.get_hook()
-            if not (is_tf_2_2() or is_tf_2_3()):
-                assert not hook  # only supported on TF 2.2 and greater
+            if version.parse(tf.__version__) < version.parse("2.1.2"):
+                assert not hook  # only supported on TF 2.1.2 and greater
                 return
             assert hook
             hook.close()