Skip to content

Commit

Permalink
Sync with sagemaker-debugger master branch and fix issue with tensorf…
Browse files Browse the repository at this point in the history
…low_datasets version (#114)

* Update sagemaker.md (#250)

* Bumping version to 0.9.0 (#251)

* Skip using standalone keras Py3.7+ (#253)

* Gradtape zcc (#252)

* Fix Incorrect Log Statement (#256)

* Incorrect number of tensors saved with MirroredStrategy (#257)

* Change Version to 0.8.1 (#258)

* Save Scalars With Mirrored Strategy (#259)

* skip flaky test (#262)

* Don't export to collections for all workers with unsupported distrib training (#263)

* version bump (#265)

* Avoiding Basehook object pickling (#266)

* handle eager tensors (#271)

* TF 2.x: Support for keras to estimator (#268)

* Revert "TF 2.x: Support for keras to estimator (#268)" (#273)

This reverts commit 749bded.

* Disable TB Testing  (#275)

* Support for TF 2 estimator (#274)

* Adding a TF2 Hvd example and test (#279)

* Moved end of training log from info to debug (#281)

#280

* Adding action class (#285)

* Adding action class
Actions added: stop trianing job, email,  sms

* Fix buildspec used for PR CI (#287)

* Adding a test to check that PT model is saved without issues (#283)

* test that model can be pickled without issues

* Save Model Inputs, Model Outputs, Gradients, Custom Tensors, Layer Inputs, Layer Outputs (#282)

* Pin pytest version (#293)

* Load IRIS Dataset from S3 (#298)

* Load dataset from s3 (#299)

* remove problematic log (#300)

* Change Enum (#301)

* Doc update (#292)

* rename enum (#305)

* version bump to 0.9.1 (#304)

* modify asserts (#307)

* version compare (#306)

* Support TF 2.3 Tests (#312)

* Disable TB in ZCC for AWS TF 2.3.0 (#316)

* Update Assert Statements For New TF 2.2.0 DLC (#320)

* Version Bump (#319)

* add a note for TF 2.2 limited support (#303)


Co-authored-by: Miyoung Choi <cmiyoung@amazon.com>
Co-authored-by: Nihal Harish <nihal42harish@gmail.com>

* TF 2.2 documentation update  (#322)

* update TF 2.2 smdebug features
* Update code samples/notes for new pySDK and smdebug/add and fix links
* add 'New features' note
Co-authored-by: Miyoung Choi <cmiyoung@amazon.com>

* Adding pagination in list_training_jobs (#323)

* Adding pagination in list_Training_jobs

* Test Custom Step Usecase (#331)

* save tf2 model (#333)

* Add ability to only save shapes of tensors (#328)

* Revert "Add ability to only save shapes of tensors (#328)" (#337)

This reverts commit c9eb769.

* Function to Test If the hook has been configured with the Default hook config (#332)

* Default hook config (#338)

* version bump (#339)

* TF ZCC limitation footnote (#342)

* Ability to save shapes (#341)

* WIP saveshape

* Add shape writer

* Add pytorch test

* Add untested keras test

* fix syntax

* fix syntax

* Import

* Import

* Add tests for TF

* Simplify read code

* Add read API and tests

* Add mxnet test

* Add s3 and json tests

* lint

* Fix payload

* fix import

* Handle different num tensors for losses

* Fix exact equal condition

* Fix mode bug

* trigger CI

* Add support for distributed training with writer map

* Check that value throws exception

* Fix tests to make them more resilient

* Fix mxnet and pytorch tests

* Remove tensor names

* pre-commmit

* Fix get_mode

* Fix bug with old index files

* Fix keras test with names of tensors

* Set original name to None if tf_obj is None

* Fix mirrored test for cpu

* Add docs

* trigger CI

* Fix shape writer get

* Simplify by removing shape writer

* Cleanup

* Fix name of writer

* Addressed review comments

* trigger ci

* retrigger CI

Co-authored-by: NihalHarish <nihal42harish@gmail.com>

* Support Inputs and Labels in the dict format (#345)

* 0.9.4 (#347)

* Refactor Make Numpy Array (#329)

* warn gradtape users  about tf.function support (#348)

* Support all tf types (#346)

* Model Subclassing Test (#351)

* Modify Should Save Tensor Test To Work on Any Version of TF (#352)

* framework version updates (#360)

* list training jobs improvements (#349)

* Earlier list training job would make 50 attempts irrespective. This may be bad because of unnecessary traffic.
* if there are training jobs found with prefix, we break
 * if there are exceptions caught more than 5 times we break.

* Handle Deprecation Of experimental_ref api (#356)

* check file exist before moving (#364)

* check file exist before moving when closing the file.

* Support Saving Tensors in Graph Mode with add_for_mode (#353)

* Change layer name logic (#357)

* Pass Variable Length Argument To Old Function Call (#366)

* test concat layers (#367)

* Update README.md (#371)

* Pinning the version of tensorflow_datasets package so that it does not require updating TF (#373)

Co-authored-by: NihalHarish <nihal42harish@gmail.com>

* Bugfix: Debugger breaks if should_save_tensor is called before collections are prepared (#372)

* Fixing the nightly build pipelines. Avoid force reinstall of rules package when not necessary (#374)

* returning list instead of dict keys (#376)

fix in reuturn of _get_sm_tj_jobs_with_prefix . This function should return list always.

* Add support for mixed precision training (#378)

* Modify Asserts to Work with TF 2.1.0 and TF 2.0.0 (#380)

* pytorch tmp (#382)

* extend zcc to 2.1.2 (#384)

* disable pytorch (#386)

* Removed the redundant installation of smdebug and smdebug-rules (#391)

* Incrementing the version to 0.9.5 (#396)

* pin tensorflow dataset in test config (#399)

* add back test

* revert some changes

* unpin pytest version

Co-authored-by: Nihal Harish <nihal42harish@gmail.com>
Co-authored-by: Vikas-kum <vikumar@amazon.com>
Co-authored-by: Vandana Kannan <vandanavk@users.noreply.github.com>
Co-authored-by: Anirudh <anirudhkrec@gmail.com>
Co-authored-by: Miyoung <myoung8739@gmail.com>
Co-authored-by: Miyoung Choi <cmiyoung@amazon.com>
Co-authored-by: Rahul Huilgol <huilgolr@amazon.com>
Co-authored-by: Amol Lele <19983848+leleamol@users.noreply.github.com>
  • Loading branch information
9 people committed Nov 6, 2020
1 parent 29fd12e commit a996f1e
Show file tree
Hide file tree
Showing 13 changed files with 75 additions and 29 deletions.
2 changes: 1 addition & 1 deletion config/buildspec_vanilla_framework_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ phases:
- sudo apt-get install unzip -qq -o=Dpkg::Use-Pty=0
- cd $CODEBUILD_SRC_DIR && chmod +x config/protoc_downloader.sh && ./config/protoc_downloader.sh
- pip install --upgrade pip==19.3.1
- pip install -q pytest==5.3.3 pytest-cov wheel pyYaml pytest-html keras==2.3.1 mxnet torch xgboost pre-commit tensorflow_datasets torchvision
- pip install -q pytest pytest-cov wheel pyYaml pytest-html keras==2.3.1 mxnet torch xgboost pre-commit tensorflow_datasets==4.0.1 torchvision
- cd $CODEBUILD_SRC_DIR && chmod +x config/install_smdebug.sh && chmod +x config/check_smdebug_install.sh && ./config/install_smdebug.sh;

build:
Expand Down
2 changes: 1 addition & 1 deletion config/install_smdebug.sh
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ if [ "$run_pytest_mxnet" == 'enable' ]; then
fi
if [ "$run_pytest_tensorflow" == 'enable' ]; then
./config/check_smdebug_install.sh tensorflow
pip install tensorflow_datasets
pip install tensorflow_datasets==4.0.1
fi
if [ "$run_pytest_pytorch" == 'enable' ]; then
./config/check_smdebug_install.sh torch
Expand Down
2 changes: 1 addition & 1 deletion config/tests.sh
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ if [ "$run_pytest_tensorflow" = "enable" ] ; then
fi

if [ "$run_pytest_tensorflow2" = "enable" ] ; then
pip install tensorflow_datasets
pip install tensorflow_datasets==4.0.1
run_for_framework tensorflow2
run_profiler_test tensorflow2
fi
Expand Down
2 changes: 1 addition & 1 deletion smdebug/_version.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.9.4"
__version__ = "0.9.5"
1 change: 1 addition & 0 deletions smdebug/pytorch/hook.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
except ImportError:
herring = None


DEFAULT_INCLUDE_COLLECTIONS = [CollectionKeys.LOSSES]


Expand Down
2 changes: 0 additions & 2 deletions smdebug/tensorflow/keras.py
Original file line number Diff line number Diff line change
Expand Up @@ -152,8 +152,6 @@ def register_model(self, model):
# It attaches a hook to every layer of the model to capture
# layer values
self.model = model
if self.tape is not None:
self._wrap_model_with_input_output_saver()
self._wrap_model_with_input_output_saver()
self.has_registered_model = True

Expand Down
26 changes: 21 additions & 5 deletions tests/tensorflow2/test_keras.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,10 @@
`python tests/tensorflow2/test_keras.py` from the main directory.
"""
# Standard Library
import json
import re
import time
from pathlib import Path

# Third Party
import pytest
Expand All @@ -27,6 +29,7 @@
from smdebug.core.modes import ModeKeys
from smdebug.core.reduction_config import ALLOWED_NORMS, ALLOWED_REDUCTIONS
from smdebug.exceptions import TensorUnavailableForStep
from smdebug.profiler.profiler_constants import DEFAULT_PREFIX
from smdebug.tensorflow import ReductionConfig, SaveConfig


Expand Down Expand Up @@ -558,7 +561,7 @@ def test_include_regex(out_dir, tf_eager_mode):

tr = create_trial_fast_refresh(out_dir)
tnames = tr.tensor_names(collection="custom_coll")
assert len(tnames) == 12
assert len(tnames) == (12 if is_tf_2_2() else 4)
for tname in tnames:
assert tr.tensor(tname).value(0) is not None

Expand Down Expand Up @@ -729,10 +732,7 @@ def test_keras_fit_pure_eager(out_dir, tf_eager_mode):
helper_keras_fit(trial_dir=out_dir, hook=hook, eager=tf_eager_mode, run_eagerly=True)

trial = smd.create_trial(path=out_dir)
if is_tf_2_2():
assert len(trial.tensor_names()) == 27
else:
assert len(trial.tensor_names()) == (20 if is_tf_2_3() else 21)
assert len(trial.tensor_names()) == (27 if is_tf_2_2() else 13)
assert len(trial.tensor_names(collection=CollectionKeys.BIASES)) == 2
assert len(trial.tensor_names(collection=CollectionKeys.WEIGHTS)) == 2
assert len(trial.tensor_names(collection=CollectionKeys.OPTIMIZER_VARIABLES)) == 5
Expand Down Expand Up @@ -882,3 +882,19 @@ def test_save_layer_inputs_and_outputs(out_dir, tf_eager_mode):
"dense_1/inputs"
).value(0)
assert boolean_matrix.all()


def test_hook_timeline_file_write(set_up_smprofiler_config_path, out_dir, tf_eager_mode):
hook = smd.KerasHook(out_dir=out_dir, save_all=False)
helper_keras_fit(trial_dir=out_dir, hook=hook, eager=tf_eager_mode, steps=["train", "eval"])

files = []
for path in Path(out_dir + "/" + DEFAULT_PREFIX).rglob("*.json"):
files.append(path)

assert len(files) == 1

with open(files[0]) as timeline_file:
events_dict = json.load(timeline_file)

assert events_dict
12 changes: 9 additions & 3 deletions tests/tensorflow2/test_model_subclassing.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
import tensorflow as tf
from tensorflow.keras.layers import BatchNormalization, Conv2D, Dense, Flatten
from tensorflow.keras.models import Model
from tests.tensorflow2.utils import is_tf_2_2

# First Party
import smdebug.tensorflow as smd
Expand Down Expand Up @@ -78,7 +79,12 @@ def test_subclassed_model(out_dir):
trial = smd.create_trial(out_dir)
assert len(trial.tensor_names(collection=smd.CollectionKeys.LAYERS)) == 8

assert trial.tensor_names(collection=smd.CollectionKeys.INPUTS) == ["model_input"]
assert trial.tensor_names(collection=smd.CollectionKeys.OUTPUTS) == ["labels", "predictions"]
assert trial.tensor_names(collection=smd.CollectionKeys.LOSSES) == ["loss"]
assert len(trial.tensor_names(collection=smd.CollectionKeys.GRADIENTS)) == 6
if is_tf_2_2():
# Feature to save model inputs and outputs was first added for TF 2.2.0
assert trial.tensor_names(collection=smd.CollectionKeys.INPUTS) == ["model_input"]
assert trial.tensor_names(collection=smd.CollectionKeys.OUTPUTS) == [
"labels",
"predictions",
]
assert len(trial.tensor_names(collection=smd.CollectionKeys.GRADIENTS)) == 6
6 changes: 6 additions & 0 deletions tests/tensorflow2/test_support_dicts.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# Third Party
import numpy as np
import pytest
import tensorflow as tf
from tests.tensorflow2.utils import is_tf_2_2

# First Party
import smdebug.tensorflow as smd
Expand Down Expand Up @@ -29,6 +31,10 @@ def create_model():
return model


@pytest.mark.skipif(
is_tf_2_2() is False,
reason="Feature to save model inputs and outputs was first added for TF 2.2.0",
)
def test_support_dicts(out_dir):
model = create_model()
optimizer = tf.keras.optimizers.Adadelta(lr=1.0, rho=0.95, epsilon=None, decay=0.0)
Expand Down
21 changes: 11 additions & 10 deletions tests/tensorflow2/test_tensorflow2_datatypes.py
Original file line number Diff line number Diff line change
@@ -1,31 +1,32 @@
# Third Party
import numpy as np
from packaging import version
import pytest
from tensorflow.python.framework.dtypes import _NP_TO_TF
from tests.tensorflow2.utils import is_tf_2_2

# First Party
from smdebug.core.tfevent.util import _get_proto_dtype


@pytest.mark.skipif(
is_tf_2_2() is False, reason="Brain Float Is Unavailable in lower versions of TF"
)
def test_tensorflow2_datatypes():
# _NP_TO_TF contains all the mappings
# of numpy to tf types
try:
from tensorflow import __version__ as tf_version
from tensorflow.python import _pywrap_bfloat16

if version.parse(tf_version) >= version.parse("2.0.0"):
from tensorflow.python import _pywrap_bfloat16

# TF 2.x.x Implements a Custom Numpy Datatype for Brain Floating Type
# Which is currently only supported on TPUs
_np_bfloat16 = _pywrap_bfloat16.TF_bfloat16_type()
_NP_TO_TF.pop(_np_bfloat16)
# TF 2.x.x Implements a Custom Numpy Datatype for Brain Floating Type
# Which is currently only supported on TPUs
_np_bfloat16 = _pywrap_bfloat16.TF_bfloat16_type()
_NP_TO_TF.pop(_np_bfloat16)
except (ModuleNotFoundError, ValueError, ImportError):
pass

for _type in _NP_TO_TF:
try:
_get_proto_dtype(np.dtype(_type))
except Exception:
assert False
assert False, f"{_type} not supported"
assert True
16 changes: 14 additions & 2 deletions tests/zero_code_change/pt_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,22 +7,34 @@
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
from packaging import version


def get_dataloaders() -> Tuple[torch.utils.data.DataLoader, torch.utils.data.DataLoader]:
transform = transforms.Compose(
[transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]
)

# Temporary Change to allow the test to run with pytorch 1.7 RC3
# Smdebug breaks when num_workers>0 for Pytorch 1.7.0
if version.parse(torch.__version__) >= version.parse("1.7.0"):
num_workers = 0
else:
num_workers = 2

trainset = torchvision.datasets.CIFAR10(
root="./data", train=True, download=True, transform=transform
)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4, shuffle=True, num_workers=2)
trainloader = torch.utils.data.DataLoader(
trainset, batch_size=4, shuffle=True, num_workers=num_workers
)

testset = torchvision.datasets.CIFAR10(
root="./data", train=False, download=True, transform=transform
)
testloader = torch.utils.data.DataLoader(testset, batch_size=4, shuffle=False, num_workers=2)
testloader = torch.utils.data.DataLoader(
testset, batch_size=4, shuffle=False, num_workers=num_workers
)

classes = ("plane", "car", "bird", "cat", "deer", "dog", "frog", "horse", "ship", "truck")
return trainloader, testloader
Expand Down
5 changes: 5 additions & 0 deletions tests/zero_code_change/test_pytorch_integration.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@

# Third Party
import pytest
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
Expand All @@ -22,6 +23,10 @@
from smdebug.core.utils import SagemakerSimulator, ScriptSimulator


@pytest.mark.skipif(
torch.__version__ == "1.7.0",
reason="Disabling the test temporarily until we root cause the version incompatibility",
)
@pytest.mark.parametrize("script_mode", [False])
@pytest.mark.parametrize("use_loss_module", [True, False])
def test_pytorch(script_mode, use_loss_module):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,8 @@
# Third Party
import pytest
import tensorflow.compat.v2 as tf
from tests.tensorflow2.utils import is_tf_2_2, is_tf_2_3
from packaging import version
from tests.tensorflow2.utils import is_tf_2_2

# First Party
import smdebug.tensorflow as smd
Expand Down Expand Up @@ -104,8 +105,8 @@ def helper_test_keras_v2_gradienttape(
print(log)
train_acc_metric.reset_states()
hook = smd.get_hook()
if not (is_tf_2_2() or is_tf_2_3()):
assert not hook # only supported on TF 2.2 and greater
if version.parse(tf.__version__) < version.parse("2.1.2"):
assert not hook # only supported on TF 2.1.2 and greater
return
assert hook
hook.close()
Expand Down

0 comments on commit a996f1e

Please sign in to comment.