Skip to content
Amazon SageMaker Debugger provides functionality to save tensors during training of machine learning jobs and analyze those tensors
Python Shell
Branch: master
Clone or download
vandanavk and Vikas-kum Fix scalar write to event file & TF Tb write fix (#145)
* Fix scalar write to event file

* Check for saved_scalars in test

* Fix TF tensorboard write
Latest commit 59de145 Jan 16, 2020
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.github Modify pr template (#401) Nov 13, 2019
bin Remove scripts to build containers (#23) Nov 21, 2019
config
docs
examples Fix writer initialization bug affecting horovod TF (#68) Dec 19, 2019
smdebug Fix scalar write to event file & TF Tb write fix (#145) Jan 16, 2020
tests Fix scalar write to event file & TF Tb write fix (#145) Jan 16, 2020
.gitignore
.pre-commit-config.yaml autoflake (#397) Nov 13, 2019
CODE_OF_CONDUCT.md merge required files Jul 3, 2019
CONTRIBUTING.md Remove references to tornasole (#90) Dec 5, 2019
LICENSE merge required files Jul 3, 2019
LICENSE-THIRD-PARTY Publishing changes from 0.4 into master (#76) Dec 3, 2019
NOTICE
README.md Update README.md Dec 13, 2019
THIRD-PARTY Update THIRD-PARTY Dec 2, 2019
setup.cfg
setup.py Nightly Build setup (#116) Dec 23, 2019

README.md

Amazon SageMaker Debugger

Overview

Amazon SageMaker Debugger is an offering from AWS which help you automate the debugging of machine learning training jobs. This library powers Amazon SageMaker Debugger, and helps you develop better, faster and cheaper models by catching common errors quickly. It allows you to save tensors from training jobs and makes these tensors available for analysis, all through a flexible and powerful API. It supports TensorFlow, PyTorch, MXNet, and XGBoost on Python 3.6+.

  • Zero Script Change experience on SageMaker when using supported containers
  • Full visibility into any tensor part of the training process
  • Real-time training job monitoring through Rules
  • Automated anomaly detection and state assertions through built-in and custom Rules on SageMaker
  • Actions on your training jobs based on the status of Rules
  • Interactive exploration of saved tensors
  • Distributed training support
  • TensorBoard support

Examples

Notebooks

We have a bunch of example notebooks here demonstrating different functionality of SageMaker Debugger.

Running a Rule with Zero Script Change on SageMaker

This example uses a zero-script-change experience, where you can use your training script as-is. Refer Running SageMaker jobs with Amazon SageMaker Debugger for more details on this.

import sagemaker as sm
from sagemaker.debugger import rule_configs, Rule, CollectionConfig

# Choose a built-in rule to monitor your training job
rule = Rule.sagemaker(
    rule_configs.exploding_tensor(),
    # configure your rule if applicable
    rule_parameters={"tensor_regex": ".*"},
    # specify collections to save for processing your rule
    collections_to_save=[
        CollectionConfig(name="weights"),
        CollectionConfig(name="losses"),
    ],
)

# Pass the rule to the estimator
sagemaker_simple_estimator = sm.tensorflow.TensorFlow(
    entry_point="script.py",
    role=sm.get_execution_role(),
    framework_version="1.15",
    py_version="py3",
    # argument for smdebug below
    rules=[rule],
)

sagemaker_simple_estimator.fit()
tensors_path = sagemaker_simple_estimator.latest_job_debugger_artifacts_path()

import smdebug.trials as smd
trial = smd.create_trial(out_dir=tensors_path)
print(f"Saved these tensors: {trial.tensor_names()}")
print(f"Loss values during evaluation were {trial.tensor('CrossEntropyLoss:0').values(mode=smd.modes.EVAL)}")

That's it! Amazon SageMaker will automatically monitor your training job for you with the Rules specified and create a CloudWatch event which tracks the status of the Rule, so you can take any action based on them.

If you want greater configuration and control, we offer that too. Head over here for more information.

Running Locally

Requires Python 3.6+, and this example uses tf.keras. Run

pip install smdebug

To use Amazon SageMaker Debugger, simply add a callback hook:

import smdebug.tensorflow as smd
hook = smd.KerasHook(out_dir='~/smd_outputs/')

model = tf.keras.models.Sequential([ ... ])
model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
)

# Add the hook as a callback
model.fit(x_train, y_train, epochs=2, callbacks=[hook])
model.evaluate(x_test, y_test, callbacks=[hook])

# Create a trial to inspect the saved tensors
trial = smd.create_trial(out_dir='~/smd_outputs/')
print(f"Saved these tensors: {trial.tensor_names()}")
print(f"Loss values during evaluation were {trial.tensor('CrossEntropyLoss:0').values(mode=smd.modes.EVAL)}")

How It Works

Amazon SageMaker Debugger uses the construct of a Hook to save the values of requested tensors throughout the training process. You can then setup a Rule job which simultaneously monitors and validates these tensors to ensure that training is progressing as expected. A rule might check for vanishing gradients, or exploding tensor values, or poor weight initialization. Rules are attached to CloudWatch events, so that when a rule is triggered it changes the state of the CloudWatch event. You can configure any action on the CloudWatch event, such as to stop the training job saving you time and money.

Amazon SageMaker Debugger can be used inside or outside of SageMaker. However the built-in rules that AWS provides are only available for SageMaker training. Scenarios of usage can be classified into the following:

  • SageMaker Zero-Script-Change: Here you specify which rules to use when setting up the estimator and run your existing script, no changes needed. See the first example above.
  • SageMaker Bring-Your-Own-Container: Here you specify the rules to use, and modify your training script minimally to enable SageMaker Debugger.
  • Non-SageMaker: Here you write custom rules (or manually analyze the tensors) and modify your training script minimally to enable SageMaker Debugger. See the second example above.

The reason for different setups is that SageMaker Zero-Script-Change (via AWS Deep Learning Containers) uses custom framework forks of TensorFlow, PyTorch, MXNet, and XGBoost which add our Hook to the training job and save requested tensors automatically. These framework forks are not available in custom containers or non-SM environments, so you must modify your training script in these environments.

Docs

Section Description
SageMaker Training SageMaker users, we recommend you start with this page on how to run SageMaker training jobs with SageMaker Debugger
Frameworks See the frameworks pages for details on what's supported and how to modify your training script if applicable
APIs for Saving Tensors Full description of our APIs on saving tensors
Programming Model for Analysis For description of the programming model provided by our APIs which allows you to perform interactive exploration of tensors saved as well as to write your own Rules monitoring your training jobs.

SageMaker Debugger in action

  • Using SageMaker Debugger with XGBoost in SageMaker Studio to save feature importance values and plot them in a notebook during training.
  • Using SageMaker Debugger with TensorFlow in SageMaker Studio to run built-in rules and visualize the loss.

License

This library is licensed under the Apache 2.0 License.

You can’t perform that action at this time.