Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc update #292

Merged
merged 60 commits into from
Jul 31, 2020
Merged
Show file tree
Hide file tree
Changes from 50 commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
69b3943
Update README.md
mchoi8739 Apr 20, 2020
90ab484
response to the comment from vandanavk
mchoi8739 Apr 21, 2020
55d735e
fixing a typo
mchoi8739 Apr 21, 2020
fd39891
Fixing the example links
mchoi8739 Apr 23, 2020
f69a206
Staging for preview
mchoi8739 Apr 27, 2020
d1543aa
fixed the doc responding to the comments
mchoi8739 Apr 28, 2020
23aab4e
Update README.md
mchoi8739 Apr 27, 2020
d6b74cb
fix minor things
mchoi8739 Apr 28, 2020
4821ded
re-arange and edit README and sagemaker markdown files
mchoi8739 Apr 29, 2020
b1170d5
fixing few typos
mchoi8739 Apr 29, 2020
4c5cbaf
update README.md / add BYOC example
mchoi8739 Apr 30, 2020
ca3e7a9
minor fix
mchoi8739 Apr 30, 2020
5e549aa
fixed links
mchoi8739 Apr 30, 2020
8f3a171
Update README.md
mchoi8739 Apr 30, 2020
52a2a90
Update README.md
mchoi8739 Apr 30, 2020
fedefc0
Update README.md
mchoi8739 Apr 30, 2020
ef0707d
Update README.md
mchoi8739 Apr 30, 2020
77e0d34
Update README.md
mchoi8739 Apr 30, 2020
3c37a48
Update README.md
mchoi8739 Apr 30, 2020
5dceb53
Update README.md
mchoi8739 Apr 30, 2020
05039cc
Update README.md
mchoi8739 Apr 30, 2020
309c4a0
Update README.md
mchoi8739 Apr 30, 2020
97a1fb3
Update README.md
mchoi8739 Apr 30, 2020
a29c6c9
Update README.md
mchoi8739 Apr 30, 2020
4733be3
Update README.md
mchoi8739 Apr 30, 2020
7781628
Update README.md
mchoi8739 Apr 30, 2020
1e87842
Update README.md
mchoi8739 Apr 30, 2020
9e3e14c
Update README.md
mchoi8739 Apr 30, 2020
ed1b75b
Update README.md
mchoi8739 Apr 30, 2020
5059f99
Update README.md
mchoi8739 Apr 30, 2020
7935b87
Update README.md
mchoi8739 Apr 30, 2020
fd467b4
a few changes and re-ordering
mchoi8739 Apr 30, 2020
d08795b
model pruning resnet image
mchoi8739 Apr 30, 2020
c5e0963
update README.md
mchoi8739 Apr 30, 2020
5fd2281
fix issues
mchoi8739 May 5, 2020
8cc5cd9
sync up
mchoi8739 Jul 22, 2020
c9b8648
update tensorflow.md
mchoi8739 Jul 23, 2020
1db21b4
minor changes
mchoi8739 Jul 23, 2020
ef7d832
minor changes
mchoi8739 Jul 23, 2020
175079d
minor changes
mchoi8739 Jul 23, 2020
17f98b0
minor fix
mchoi8739 Jul 23, 2020
afd1615
update docs for TF 2.2 compatibility and others (#282)
mchoi8739 Jul 24, 2020
9396e81
Clean up README and sagemaker.md
mchoi8739 Jul 24, 2020
e27cc84
Clean up and reword
mchoi8739 Jul 24, 2020
cabdc17
rewording README / minor fix
mchoi8739 Jul 24, 2020
0b178a8
minor fix
mchoi8739 Jul 24, 2020
a863608
fix links / add sample links
mchoi8739 Jul 24, 2020
a213427
add the new "layers" collection
mchoi8739 Jul 24, 2020
7fb4afe
minor fix
mchoi8739 Jul 28, 2020
dad257c
minor fix
mchoi8739 Jul 28, 2020
1b1fe4f
minor fix
mchoi8739 Jul 29, 2020
3615c94
minor fix
mchoi8739 Jul 29, 2020
2680597
Merge branch 'doc-update' of https://github.com/mchoi8739/sagemaker-d…
mchoi8739 Jul 29, 2020
fc593b9
minor fix
mchoi8739 Jul 29, 2020
8c8390b
Merge branch 'master' of https://github.com/awslabs/sagemaker-debugger
mchoi8739 Jul 29, 2020
86bfe9d
Pin pytest version (#293)
NihalHarish Jul 29, 2020
10ec3bc
Save Model Inputs, Model Outputs, Gradients, Custom Tensors, Layer In…
NihalHarish Jul 28, 2020
12f7d4e
Merge branch 'master' of https://github.com/mchoi8739/sagemaker-debug…
mchoi8739 Jul 29, 2020
e314da0
typo fix
mchoi8739 Jul 29, 2020
62531d3
retrigger CI
mchoi8739 Jul 30, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
205 changes: 165 additions & 40 deletions README.md

Large diffs are not rendered by default.

45 changes: 31 additions & 14 deletions docs/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,24 +47,37 @@ you're in. Defaults to "global".

## Hook
### Creating a Hook
Note that when using Zero Script Change supported containers in SageMaker, you generally do not need to create your hook object except for some advanced use cases where you need access to the hook.
By using AWS Deep Learning Containers, you can directly run your own training script without any additional effort to make it compatible with the SageMaker Python SDK. For a detailed developer guide for this, see [Use Debugger in AWS Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html).

`HookClass` or `hook_class` below will be `Hook` for PyTorch, MXNet, and XGBoost. It will be one of `KerasHook`, `SessionHook` or `EstimatorHook` for TensorFlow.
However, for some advanced use cases where you need access to customized tensors from targeted parts of a training script, you can manually construct the hook object. The smdebug library provides hook classes to make this process simple and compatible with the SageMaker ecosystem and Debugger.

The framework in `smd` import below refers to one of `tensorflow`, `mxnet`, `pytorch` or `xgboost`.

#### Hook when using SageMaker Python SDK
#### Hook when using the SageMaker Python SDK
If you create a SageMaker job and specify the hook configuration in the SageMaker Estimator API
as described in [AWS Docs](https://docs.aws.amazon.com/sagemaker/latest/dg/train-model.html),
a JSON file containing the hook configuration will be automatically written to the training container. In such a case, you can create a hook from that configuration file by calling
the CreateTrainingJob API operation containing the hook configuration will be automatically written to the training container.

To capture tensors from your training model, paste the following code to the top or the main function of the training script.
```python
import smdebug.{framework} as smd
hook = smd.{hook_class}.create_from_json_file()
import smdebug.Framework as smd
hook = smd.HookClass.create_from_json_file()
```
with no arguments and then use the hook Python API in your script.

Depending on your choice of framework, `HookClass` need to be replaced by one of `KerasHook`, `SessionHook` or `EstimatorHook` for TensorFlow, and `Hook` for PyTorch, MXNet, and XGBoost.

The framework in `smd.Framework` import refers to one of `tensorflow`, `mxnet`, `pytorch`, or `xgboost`.

After choosing a framework and defining the hook object, you need to embed the hooks into target parts of your training script to retrieve tensors and to use with the SageMaker Debugger Python SDK.

For more information about constructing the hook depending on a framework of your choice and adding the hooks to your model, see the following pages.

* [TensorFlow hook](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/tensorflow.md)
* [MXNet hook](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/mxnet.md)
* [PyTorch hook](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/pytorch.md)
* [XGBoost hook](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/xgboost.md)

#### Configuring Hook using SageMaker Python SDK
Parameters to the Hook are passed as below when using the SageMaker Python SDK.
After you make the minimal changes to your training script, you can configure the hook with parameters to the SageMaker Debugger API operation, `DebuggerHookConfig`.

```python
from sagemaker.debugger import DebuggerHookConfig
hook_config = DebuggerHookConfig(
Expand All @@ -73,7 +86,9 @@ hook_config = DebuggerHookConfig(
"parameter": "value"
})
```
The parameters can be one of the following. The meaning of these parameters will be clear as you review the sections of documentation below. Note that all parameters below have to be strings. So for any parameter which accepts a list (such as save_steps, reductions, include_regex), the value needs to be given as strings separated by a comma between them.

The available hook parameters are listed in the following. The meaning of these parameters will be clear as you review the sections of documentation below. Note that all parameters below have to be strings. So for any parameter which accepts a list (such as save_steps, reductions, include_regex), the value needs to be given as strings separated by a comma between them.

```
dry_run
save_all
Expand Down Expand Up @@ -147,7 +162,8 @@ Note that `smd` import below translates to `import smdebug.{framework} as smd`.
|`set_mode(mode)`| value of the enum `smd.modes` | Sets mode of the job, can be one of `smd.modes.TRAIN`, `smd.modes.EVAL`, `smd.modes.PREDICT` or `smd.modes.GLOBAL`. Refer [Modes](#modes) for more on that. |
|`create_from_json_file(`<br/>` json_file_path=None)` | `json_file_path (str)` | Takes the path of a file which holds the json configuration of the hook, and creates hook from that configuration. This is an optional parameter. <br/> If this is not passed it tries to get the file path from the value of the environment variable `SMDEBUG_CONFIG_FILE_PATH` and defaults to `/opt/ml/input/config/debughookconfig.json`. When training on SageMaker you do not have to specify any path because this is the default path that SageMaker writes the hook configuration to.
|`close()` | - | Closes all files that are currently open by the hook |
| `save_scalar(`<br/>`name, `<br/>`value, `<br/>`sm_metric=False)` | `name (str)` <br/> `value (float)` <br/> `sm_metric (bool)`| Saves a scalar value by the given name. Passing `sm_metric=True` flag also makes this scalar available as a SageMaker Metric to show up in SageMaker Studio. Note that when `sm_metric` is False, this scalar always resides only in your AWS account, but setting it to True saves the scalar also on AWS servers. The default value of `sm_metric` for this method is False. |
| `save_scalar()` | `name (str)` <br/> `value (float)` <br/> `sm_metric (bool)`| Saves a scalar value by the given name. Passing `sm_metric=True` flag also makes this scalar available as a SageMaker Metric to show up in SageMaker Studio. Note that when `sm_metric` is False, this scalar always resides only in your AWS account, but setting it to True saves the scalar also on AWS servers. The default value of `sm_metric` for this method is False. |


### TensorFlow specific Hook API
Note that there are three types of Hooks in TensorFlow: SessionHook, EstimatorHook and KerasHook based on the TensorFlow interface being used for training. [This page](tensorflow.md) shows examples of each of these.
Expand All @@ -157,12 +173,12 @@ Note that there are three types of Hooks in TensorFlow: SessionHook, EstimatorHo
| `wrap_optimizer(optimizer)` | `optimizer` (tf.train.Optimizer or tf.keras.Optimizer) | Returns the same optimizer object passed with a couple of identifying markers to help `smdebug`. This returned optimizer should be used for training. | When not using Zero Script Change environments, calling this method on your optimizer is necessary for SageMaker Debugger to identify and save gradient tensors. Note that this method returns the same optimizer object passed and does not change your optimization logic. If the hook is of type `KerasHook`, you can pass in either an object of type `tf.train.Optimizer` or `tf.keras.Optimizer`. If the hook is of type `SessionHook` or `EstimatorHook`, the optimizer can only be of type `tf.train.Optimizer`. This new
| `add_to_collection(`<br/> `collection_name, variable)` | `collection_name (str)` : name of the collection to add to. <br/> `variable` parameter to pass to the collection's `add` method. | `None` | Calls the `add` method of a collection object. See [this section](#collection) for more. |

APIs specific to training scripts using TF 2.x GradientTape ([Example](tensorflow.md#TF 2.x GradientTape example)):
The following hook APIs are specific to training scripts using the TF 2.x GradientTape ([Example](tensorflow.md#TF 2.x GradientTape example)):

| Method | Arguments | Returns | Behavior |
| --- | --- | --- | --- |
| `wrap_tape(tape)` | `tape` (tensorflow.python.eager.backprop.GradientTape) | Returns a tape object with three identifying markers to help `smdebug`. This returned tape should be used for training. | When not using Zero Script Change environments, calling this method on your tape is necessary for SageMaker Debugger to identify and save gradient tensors. Note that this method returns the same tape object passed.
| `record_tensor_value(`<br/> `tensor_name, tensor_value)` | `tensor_name (str)` : name of the tensor to save. <br/> `tensor_value` EagerTensor to save. | `None` | Manually save metrics tensors while using TF 2.x GradientTape. |
| `save_tensor()`| tensor_name (str), tensor_value (float), collections_to_write (str) | - | Manually save metrics tensors while using TF 2.x GradientTape. Note: `record_tensor_value()` is deprecated.|

### MXNet specific Hook API

Expand Down Expand Up @@ -217,6 +233,7 @@ The names of these collections are all lower case strings.
| `losses` | TensorFlow, PyTorch, MXNet | Saves the loss for the model |
| `metrics` | TensorFlow's KerasHook, XGBoost | For KerasHook, saves the metrics computed by Keras for the model. For XGBoost, the evaluation metrics computed by the algorithm. |
| `outputs` | TensorFlow's KerasHook | Matches the outputs of the model |
| `layers` | TensorFlow's KerasHook | Input and output of intermediate convolutional layers |
| `sm_metrics` | TensorFlow | You can add scalars that you want to show up in SageMaker Metrics to this collection. SageMaker Debugger will save these scalars both to the out_dir of the hook, as well as to SageMaker Metric. Note that the scalars passed here will be saved on AWS servers outside of your AWS account. |
| `optimizer_variables` | TensorFlow's KerasHook | Matches all optimizer variables, currently only supported in Keras. |
| `hyperparameters` | XGBoost | [Booster paramameters](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html) |
Expand Down
5 changes: 3 additions & 2 deletions docs/mxnet.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,9 @@ If using SageMaker, you will configure the hook in SageMaker's python SDK using
#### 2. Register the model to the hook
Call `hook.register_block(net)`.

#### 3. (Optional) Configure Collections, SaveConfig and ReductionConfig
See the [Common API](api.md) page for details on how to do this.
#### 3. Take actions using the hook APIs

For a full list of actions that the hook APIs offer to construct hooks and save tensors, see [Common hook API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#common-hook-api) and [MXNet specific hook API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#mxnet-specific-hook-api).

---

Expand Down
5 changes: 3 additions & 2 deletions docs/pytorch.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,9 @@ Call `hook.register_module(net)`.
If using a loss which is a subclass of `nn.Module`, call `hook.register_loss(loss_criterion)` once before starting training.\
If using a loss which is a subclass of `nn.functional`, call `hook.record_tensor_value(loss)` after each training step.

#### 4. (Optional) Configure Collections, SaveConfig and ReductionConfig
See the [Common API](api.md) page for details on how to do this.
#### 4. Take actions using the hook APIs

For a full list of actions that the hook APIs offer to construct hooks and save tensors, see [Common hook API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#common-hook-api) and [PyTorch specific hook API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#pytorch-specific-hook-api).

---

Expand Down
Binary file added docs/resources/results_resnet.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
57 changes: 4 additions & 53 deletions docs/sagemaker.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,6 @@
## Running SageMaker jobs with Amazon SageMaker Debugger

## Outline
- [Enabling SageMaker Debugger](#enabling-sagemaker-debugger)
- [Zero Script Change](#zero-script-change)
- [Bring your own training container](#bring-your-own-training-container)
### Outline
- [Configuring SageMaker Debugger](#configuring-sagemaker-debugger)
- [Saving data](#saving-data)
- [Saving built-in collections that we manage](#saving-built-in-collections-that-we-manage)
Expand All @@ -17,44 +14,6 @@
- [TensorBoard Visualization](#tensorboard-visualization)
- [Example Notebooks](#example-notebooks)

## Enabling SageMaker Debugger
There are two ways in which you can enable SageMaker Debugger while training on SageMaker.

### Zero Script Change
We have equipped the official Framework containers on SageMaker with custom versions of supported frameworks TensorFlow, PyTorch, MXNet and XGBoost. These containers enable you to use SageMaker Debugger with no changes to your training script, by automatically adding [SageMaker Debugger's Hook](api.md#glossary).

Here's a list of frameworks and versions which support this experience.

| Framework | Version |
| --- | --- |
| [TensorFlow](tensorflow.md) | 1.15, 2.1, 2.2 |
| [MXNet](mxnet.md) | 1.6 |
| [PyTorch](pytorch.md) | 1.4, 1.5 |
| [XGBoost](xgboost.md) | >=0.90-2 [As Built-in algorithm](xgboost.md#use-xgboost-as-a-built-in-algorithm)|

More details for the deep learning frameworks on which containers these are can be found here: [SageMaker Framework Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-frameworks-deep-learning.html) and [AWS Deep Learning Containers](https://aws.amazon.com/machine-learning/containers/). You do not have to specify any training container image if you want to use them on SageMaker. You only need to specify the version above to use these containers.

### Bring your own training container

This library `smdebug` itself supports versions other than the ones listed above. If you want to use SageMaker Debugger with a version different from the above, you will have to orchestrate your training script with a few lines. Before we discuss how these changes look like, let us take a look at the versions supported.

| Framework | Versions |
| --- | --- |
| [TensorFlow](tensorflow.md) | 1.13, 1.14, 1.15, 2.1, 2.2 |
| Keras (with TensorFlow backend) | 2.3 |
| [MXNet](mxnet.md) | 1.4, 1.5, 1.6 |
| [PyTorch](pytorch.md) | 1.2, 1.3, 1.4, 1.5 |
| [XGBoost](xgboost.md) | 0.90-2, 1.0-1 |

#### Setting up SageMaker Debugger with your script on your container

- Ensure that you are using Python3 runtime as `smdebug` only supports Python3.
- Install `smdebug` binary through `pip install smdebug`
- Make some minimal modifications to your training script to add SageMaker Debugger's Hook. Please refer to the framework pages linked below for instructions on how to do that.
- [TensorFlow](tensorflow.md)
- [PyTorch](pytorch.md)
- [MXNet](mxnet.md)
- [XGBoost](xgboost.md)

## Configuring SageMaker Debugger

Expand Down Expand Up @@ -185,17 +144,8 @@ Note that passing a `CollectionConfig` object to the Rule as `collections_to_sav
is equivalent to passing it to the `DebuggerHookConfig` object as `collection_configs`.
This is just a shortcut for your convenience.

#### Built in Rules
The Built-in Rules, or SageMaker Rules, are described in detail on [this page](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html)


Scope of Validity | Rules |
|---|---|
| Generic Deep Learning models (TensorFlow, Apache MXNet, and PyTorch) |<ul><li>[`dead_relu`](https://docs.aws.amazon.com/sagemaker/latest/dg/dead-relu.html)</li><li>[`exploding_tensor`](https://docs.aws.amazon.com/sagemaker/latest/dg/exploding-tensor.html)</li><li>[`poor_weight_initialization`](https://docs.aws.amazon.com/sagemaker/latest/dg/poor-weight-initialization.html)</li><li>[`saturated_activation`](https://docs.aws.amazon.com/sagemaker/latest/dg/saturated-activation.html)</li><li>[`vanishing_gradient`](https://docs.aws.amazon.com/sagemaker/latest/dg/vanishing-gradient.html)</li><li>[`weight_update_ratio`](https://docs.aws.amazon.com/sagemaker/latest/dg/weight-update-ratio.html)</li></ul> |
| Generic Deep learning models (TensorFlow, MXNet, and PyTorch) and the XGBoost algorithm | <ul><li>[`all_zero`](https://docs.aws.amazon.com/sagemaker/latest/dg/all-zero.html)</li><li>[`class_imbalance`](https://docs.aws.amazon.com/sagemaker/latest/dg/class-imbalance.html)</li><li>[`confusion`](https://docs.aws.amazon.com/sagemaker/latest/dg/confusion.html)</li><li>[`loss_not_decreasing`](https://docs.aws.amazon.com/sagemaker/latest/dg/loss-not-decreasing.html)</li><li>[`overfit`](https://docs.aws.amazon.com/sagemaker/latest/dg/overfit.html)</li><li>[`overtraining`](https://docs.aws.amazon.com/sagemaker/latest/dg/overtraining.html)</li><li>[`similar_across_runs`](https://docs.aws.amazon.com/sagemaker/latest/dg/similar-across-runs.html)</li><li>[`tensor_variance`](https://docs.aws.amazon.com/sagemaker/latest/dg/tensor-variance.html)</li><li>[`unchanged_tensor`](https://docs.aws.amazon.com/sagemaker/latest/dg/unchanged-tensor.html)</li></ul>|
| Deep learning applications |<ul><li>[`check_input_images`](https://docs.aws.amazon.com/sagemaker/latest/dg/checkinput-mages.html)</li><li>[`nlp_sequence_ratio`](https://docs.aws.amazon.com/sagemaker/latest/dg/nlp-sequence-ratio.html)</li></ul> |
| XGBoost algorithm | <ul><li>[`tree_depth`](https://docs.aws.amazon.com/sagemaker/latest/dg/tree-depth.html)</li></ul>|

#### Built-in Rules
To find a full list of built-in rules that you can use with the SageMaker Python SDK, see the [List of Debugger Built-in Rules](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html) page.

#### Running built-in SageMaker Rules
You can run a SageMaker built-in Rule as follows using the `Rule.sagemaker` method.
Expand Down Expand Up @@ -238,6 +188,7 @@ sagemaker_estimator = sm.tensorflow.TensorFlow(
)
sagemaker_estimator.fit()
```

#### Custom Rules

You can write your own rule custom made for your application and provide it, so SageMaker can monitor your training job using your rule. To do so, you need to understand the programming model that `smdebug` provides. Our page on [Programming Model for Analysis](analysis.md) describes the APIs that we provide to help you write your own rule.
Expand Down
Loading