Skip to content

Commit

Permalink
More doc updates (#106)
Browse files Browse the repository at this point in the history
* take out env var to new doc, and change some indentation in sagemaker.md

* Fix links in TF

* Merge API for all frameworks and increase details

* Update mxnet.md

* Update mxnet.md

* Update mxnet.md

* Update pytorch.md

* Update tensorflow.md

* Update tensorflow.md

* Update sagemaker.md

* Update api.md

* Update api.md
  • Loading branch information
rahul003 committed Dec 10, 2019
1 parent 07ba669 commit 1bb5a1e
Show file tree
Hide file tree
Showing 8 changed files with 469 additions and 467 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ These framework forks are not available in custom containers or non-SM environme
| [SageMaker Training](docs/sagemaker.md) | SageMaker users, we recommend you start with this page on how to run SageMaker training jobs with SageMaker Debugger |
| Frameworks <ul><li>[TensorFlow](docs/tensorflow.md)</li><li>[PyTorch](docs/pytorch.md)</li><li>[MXNet](docs/mxnet.md)</li><li>[XGBoost](docs/xgboost.md)</li></ul> | See the frameworks pages for details on what's supported and how to modify your training script if applicable |
| [Programming Model for Analysis](docs/analysis.md) | For description of the programming model provided by our APIs which allows you to perform interactive exploration of tensors saved as well as to write your own Rules monitoring your training jobs. |
| [APIs](docs/api.md) | Full description of our APIs |
| [APIs](docs/api.md) | Full description of our APIs on saving tensors |


## License
Expand Down
511 changes: 304 additions & 207 deletions docs/api.md

Large diffs are not rendered by default.

100 changes: 100 additions & 0 deletions docs/env_var.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@

## Environment Variables

#### `USE_SMDEBUG`:

When using official [SageMaker Framework Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-frameworks-deep-learning.html) and [AWS Deep Learning Containers](https://aws.amazon.com/machine-learning/containers/) which support the [Zero Script Change experience](sagemaker.md#zero-script-change), SageMaker Debugger can be disabled by setting this variable to `0`. In such a case, the hook is disabled regardless of what configuration is given to the job through the SageMaker Python SDK. By default this is set to `1` signifying True.

#### `SMDEBUG_CONFIG_FILE_PATH`:

Contains the path to the JSON file that describes the smdebug hook.

At the minimum, the JSON config should contain the path where smdebug should output tensors.
Example:

`{ "LocalPath": "/my/smdebug_hook/path" }`

In SageMaker environment, this path is set to point to a pre-defined location containing a valid JSON.
In non-SageMaker environment, SageMaker-Debugger is not used if this environment variable is not set and
a hook is not created manually.

Sample JSON from which a hook can be created:
```json
{
"LocalPath": "/my/smdebug_hook/path",
"HookParameters": {
"save_all": false,
"include_regex": "regex1,regex2",
"save_interval": "100",
"save_steps": "1,2,3,4",
"start_step": "1",
"end_step": "1000000",
"reductions": "min,max,mean"
},
"CollectionConfigurations": [
{
"CollectionName": "collection_obj_name1",
"CollectionParameters": {
"include_regex": "regexe5*",
"save_interval": 100,
"save_steps": "1,2,3",
"start_step": 1,
"reductions": "min"
}
},
]
}

```

#### `TENSORBOARD_CONFIG_FILE_PATH`:

Contains the path to the JSON file that specifies where TensorBoard artifacts need to
be placed.

Sample JSON file:

`{ "LocalPath": "/my/tensorboard/path" }`

In SageMaker environment, the presence of this JSON is necessary to log any Tensorboard artifact.
By default, this path is set to point to a pre-defined location in SageMaker.

tensorboard_dir can also be passed while creating the hook using the API or
in the JSON specified in SMDEBUG_CONFIG_FILE_PATH. For this, export_tensorboard should be set to True.
This option to set tensorboard_dir is available in both, SageMaker and non-SageMaker environments.


#### `CHECKPOINT_CONFIG_FILE_PATH`:

Contains the path to the JSON file that specifies where training checkpoints need to
be placed. This is used in the context of spot training.

Sample JSON file:

`{ "LocalPath": "/my/checkpoint/path" }`

In SageMaker environment, the presence of this JSON is necessary to save checkpoints.
By default, this path is set to point to a pre-defined location in SageMaker.


#### `SAGEMAKER_METRICS_DIRECTORY`:

Contains the path to the directory where metrics will be recorded for consumption by SageMaker Metrics.
This is relevant only in SageMaker environment, where this variable points to a pre-defined location.


#### `TRAINING_END_DELAY_REFRESH`:

During analysis, a [trial](analysis.md) is created to query for tensors from a specified directory. This
directory contains collections, events, and index files. This environment variable
specifies how many seconds to wait before refreshing the index files to check if training has ended
and the tensor is available. By default value, this value is set to 1.


#### `INCOMPLETE_STEP_WAIT_WINDOW`:

During analysis, a [trial](analysis.md) is created to query for tensors from a specified directory. This
directory contains collections, events, and index files. A trial checks to see if a step
specified in the smdebug hook has been completed. This environment variable
specifies the maximum number of incomplete steps that the trial will wait for before marking
half of them as complete. Default: 1000
71 changes: 13 additions & 58 deletions docs/mxnet.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,17 +3,18 @@
## Contents
- [Support](#support)
- [How to Use](#how-to-use)
- [Example](#mxnet-example)
- [Example](#example)
- [Full API](#full-api)

---

## Support

### Versions
- Zero Script Change experience where you need no modifications to your training script is supported in the official [SageMaker Framework Container for MXNet 1.6](https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-frameworks-deep-learning.html), or the [AWS Deep Learning Container for MXNet 1.6](https://aws.amazon.com/machine-learning/containers/).

- This library itself supports the following versions when you use our API which requires a few minimal changes to your training script: MXNet 1.4, 1.5, 1.6.
- Only Gluon models are supported
- When the Gluon model is hybridized, inputs and outputs of intermediate layers can not be saved
- Parameter server based distributed training is not yet supported

---

Expand All @@ -39,10 +40,13 @@ See the [Common API](api.md) page for details on how to do this.

---

## MXNet Example
## Example
```python
#######################################
# Creating a hook. Refer `API for Saving Tensors` page for more on this
import smdebug.mxnet as smd
hook = smd.Hook(out_dir=args.out_dir)
#######################################

import mxnet as mx
from mxnet import gluon
Expand All @@ -62,7 +66,7 @@ trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': args.lr})
#######################################
# Here we register the block to smdebug
hook.register_block(net)

#######################################

batch_size = 100
mnist = mx.test_utils.get_mnist()
Expand All @@ -89,58 +93,9 @@ for i in range(args.epochs):
metric.reset()
```

## Full API
See the [Common API](https://link.com) page for details about Collection, SaveConfig, and ReductionConfig.\
See the [Analysis](https://link.com) page for details about analyzing a training job.

## Hook
```python
__init__(
out_dir,
export_tensorboard = False,
tensorboard_dir = None,
dry_run = False,
reduction_config = None,
save_config = None,
include_regex = None,
include_collections= None,
save_all = False,
include_workers = "one",
)
```
Initializes the hook. Pass this object as a callback to Keras' `model.fit(), model.evaluate(), model.evaluate()`.

* `out_dir` (str): Where to write the recorded tensors and metadata.
* `export_tensorboard` (bool): Whether to use TensorBoard logs.
* `tensorboard_dir` (str): Where to save TensorBoard logs.
* `dry_run` (bool): If true, don't write any files.
* `reduction_config` (ReductionConfig object): See the Common API page.
* `save_config` (SaveConfig object): See the Common API page.
* `include_regex` (list[str]): List of additional regexes to save.
* `include_collections` (list[str]): List of collections to save.
* `save_all` (bool): Saves all tensors and collections. May be memory-intensive and slow.
* `include_workers` (str): Used for distributed training, can also be "all".

```python
register_block(
self,
block,
)
```
Adds callbacks to the module for recording tensors.

* `block` (mx.gluon.Block): The block to use.
---

```python
save_scalar(
self,
name,
value,
searchable = False,
)
```
Call this method at any point in the training script to log a scalar value, such as accuracy.
## Full API
See the [API for Saving Tensors](api.md) page for details about Hook, Collection, SaveConfig, and ReductionConfig

* `name` (str): Name of the scalar. A prefix 'scalar/' will be added to it.
* `value` (float): Scalar value.
* `searchable` (bool): If True, the scalar value will be written to SageMaker Metrics.
See the [Analysis](analysis) page for details about analyzing a training job.
72 changes: 15 additions & 57 deletions docs/pytorch.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@
- [Full API](#full-api)

## Support

### Versions
- Zero Script Change experience where you need no modifications to your training script is supported in the official [SageMaker Framework Container for PyTorch 1.3](https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-frameworks-deep-learning.html), or the [AWS Deep Learning Container for PyTorch 1.3](https://aws.amazon.com/machine-learning/containers/).

Expand Down Expand Up @@ -44,8 +43,11 @@ See the [Common API](api.md) page for details on how to do this.

## Module Loss Example
```python
#######################################
# Creating a hook. Refer `API for Saving Tensors` page for more on this
import smdebug.pytorch as smd
hook = smd.Hook(out_dir=args.out_dir)
#######################################

class Model(nn.Module)
def __init__(self):
Expand All @@ -59,9 +61,11 @@ net = Model()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(net.parameters(), lr=args.lr)

#######################################
# Register the hook and the loss
hook.register_module(net)
hook.register_loss(criterion)
#######################################

# Training loop as usual
for (inputs, labels) in trainloader:
Expand All @@ -76,8 +80,11 @@ for (inputs, labels) in trainloader:

## Functional Loss Example
```python
#######################################
# Register the hook and the loss
import smdebug.pytorch as smd
hook = smd.Hook(out_dir=args.out_dir)
#######################################

class Model(nn.Module)
def __init__(self):
Expand All @@ -90,77 +97,28 @@ class Model(nn.Module)
net = Model()
optimizer = optim.Adam(net.parameters(), lr=args.lr)

#######################################
# Register the hook
hook.register_module(net)
#######################################

# Training loop, recording the loss at each iteration
for (inputs, labels) in trainloader:
optimizer.zero_grad()
outputs = net(inputs)
loss = F.cross_entropy(outputs, labels)


#######################################
# Manually record the loss
hook.record_tensor_value(tensor_name="loss", tensor_value=loss)

#######################################

loss.backward()
optimizer.step()
```

---

## Full API
See the [Common API](api.md) page for details about Collection, SaveConfig, and ReductionConfig.\
See the [API for Saving Tensors](api.md) page for details about Hook, Collection, SaveConfig, and ReductionConfig.
See the [Analysis](analysis.md) page for details about analyzing a training job.

## Hook
```python
__init__(
out_dir,
export_tensorboard = False,
tensorboard_dir = None,
dry_run = False,
reduction_config = None,
save_config = None,
include_regex = None,
include_collections= None,
save_all = False,
include_workers = "one",
)
```
Initializes the hook. Pass this object as a callback to Keras' `model.fit(), model.evaluate(), model.evaluate()`.

* `out_dir` (str): Where to write the recorded tensors and metadata.
* `export_tensorboard` (bool): Whether to use TensorBoard logs.
* `tensorboard_dir` (str): Where to save TensorBoard logs.
* `dry_run` (bool): If true, don't write any files.
* `reduction_config` (ReductionConfig object): See the Common API page.
* `save_config` (SaveConfig object): See the Common API page.
* `include_regex` (list[str]): List of additional regexes to save.
* `include_collections` (list[str]): List of collections to save.
* `save_all` (bool): Saves all tensors and collections. May be memory-intensive and slow.
* `include_workers` (str): Used for distributed training, can also be "all".

```python
register_module(
self,
module,
)
```
Adds callbacks to the module for recording tensors.

* `module` (torch.nn.Module): The module to use.


```python
save_scalar(
self,
name,
value,
searchable = False,
)
```
Call this method at any point in the training script to log a scalar value, such as accuracy.

* `name` (str): Name of the scalar. A prefix 'scalar/' will be added to it.
* `value` (float): Scalar value.
* `searchable` (bool): If True, the scalar value will be written to SageMaker Metrics.
Loading

0 comments on commit 1bb5a1e

Please sign in to comment.