More doc updates (#106)

* take out env var to new doc, and change some indentation in sagemaker.md * Fix links in TF * Merge API for all frameworks and increase details * Update mxnet.md * Update mxnet.md * Update mxnet.md * Update pytorch.md * Update tensorflow.md * Update tensorflow.md * Update sagemaker.md * Update api.md * Update api.md
awslabs · Dec 10, 2019 · 1bb5a1e · 1bb5a1e
1 parent 07ba669
commit 1bb5a1e
Show file tree

Hide file tree

Showing 8 changed files with 469 additions and 467 deletions.
diff --git a/README.md b/README.md
@@ -116,7 +116,7 @@ These framework forks are not available in custom containers or non-SM environme
 | [SageMaker Training](docs/sagemaker.md) | SageMaker users, we recommend you start with this page on how to run SageMaker training jobs with SageMaker Debugger |
 | Frameworks <ul><li>[TensorFlow](docs/tensorflow.md)</li><li>[PyTorch](docs/pytorch.md)</li><li>[MXNet](docs/mxnet.md)</li><li>[XGBoost](docs/xgboost.md)</li></ul> | See the frameworks pages for details on what's supported and how to modify your training script if applicable |
 | [Programming Model for Analysis](docs/analysis.md) | For description of the programming model provided by our APIs which allows you to perform interactive exploration of tensors saved as well as to write your own Rules monitoring your training jobs. |
-| [APIs](docs/api.md) | Full description of our APIs |
+| [APIs](docs/api.md) | Full description of our APIs on saving tensors |
 
 
 ## License

diff --git a/docs/api.md b/docs/api.md
diff --git a/docs/env_var.md b/docs/env_var.md
@@ -0,0 +1,100 @@
+
+## Environment Variables
+
+#### `USE_SMDEBUG`:
+
+When using official [SageMaker Framework Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-frameworks-deep-learning.html) and [AWS Deep Learning Containers](https://aws.amazon.com/machine-learning/containers/) which support the [Zero Script Change experience](sagemaker.md#zero-script-change), SageMaker Debugger can be disabled by setting this variable to `0`. In such a case, the hook is disabled regardless of what configuration is given to the job through the SageMaker Python SDK. By default this is set to `1` signifying True.
+
+#### `SMDEBUG_CONFIG_FILE_PATH`:
+
+Contains the path to the JSON file that describes the smdebug hook.
+
+At the minimum, the JSON config should contain the path where smdebug should output tensors.
+Example:
+
+`{ "LocalPath": "/my/smdebug_hook/path" }`
+
+In SageMaker environment, this path is set to point to a pre-defined location containing a valid JSON.
+In non-SageMaker environment, SageMaker-Debugger is not used if this environment variable is not set and
+a hook is not created manually.
+
+Sample JSON from which a hook can be created:
+```json
+{
+  "LocalPath": "/my/smdebug_hook/path",
+  "HookParameters": {
+    "save_all": false,
+    "include_regex": "regex1,regex2",
+    "save_interval": "100",
+    "save_steps": "1,2,3,4",
+    "start_step": "1",
+    "end_step": "1000000",
+    "reductions": "min,max,mean"
+  },
+  "CollectionConfigurations": [
+    {
+      "CollectionName": "collection_obj_name1",
+      "CollectionParameters": {
+        "include_regex": "regexe5*",
+        "save_interval": 100,
+        "save_steps": "1,2,3",
+        "start_step": 1,
+        "reductions": "min"
+      }
+    },
+  ]
+}
+
+```
+
+#### `TENSORBOARD_CONFIG_FILE_PATH`:
+
+Contains the path to the JSON file that specifies where TensorBoard artifacts need to
+be placed.
+
+Sample JSON file:
+
+`{ "LocalPath": "/my/tensorboard/path" }`
+
+In SageMaker environment, the presence of this JSON is necessary to log any Tensorboard artifact.
+By default, this path is set to point to a pre-defined location in SageMaker.
+
+tensorboard_dir can also be passed while creating the hook using the API or
+in the JSON specified in SMDEBUG_CONFIG_FILE_PATH. For this, export_tensorboard should be set to True.
+This option to set tensorboard_dir is available in both, SageMaker and non-SageMaker environments.
+
+
+#### `CHECKPOINT_CONFIG_FILE_PATH`:
+
+Contains the path to the JSON file that specifies where training checkpoints need to
+be placed. This is used in the context of spot training.
+
+Sample JSON file:
+
+`{ "LocalPath": "/my/checkpoint/path" }`
+
+In SageMaker environment, the presence of this JSON is necessary to save checkpoints.
+By default, this path is set to point to a pre-defined location in SageMaker.
+
+
+#### `SAGEMAKER_METRICS_DIRECTORY`:
+
+Contains the path to the directory where metrics will be recorded for consumption by SageMaker Metrics.
+This is relevant only in SageMaker environment, where this variable points to a pre-defined location.
+
+
+#### `TRAINING_END_DELAY_REFRESH`:
+
+During analysis, a [trial](analysis.md) is created to query for tensors from a specified directory. This
+directory contains collections, events, and index files. This environment variable
+specifies how many seconds to wait before refreshing the index files to check if training has ended
+and the tensor is available. By default value, this value is set to 1.
+
+
+#### `INCOMPLETE_STEP_WAIT_WINDOW`:
+
+During analysis, a [trial](analysis.md) is created to query for tensors from a specified directory. This
+directory contains collections, events, and index files. A trial checks to see if a step
+specified in the smdebug hook has been completed. This environment variable
+specifies the maximum number of incomplete steps that the trial will wait for before marking
+half of them as complete. Default: 1000
diff --git a/docs/mxnet.md b/docs/mxnet.md
@@ -3,17 +3,18 @@
 ## Contents
 - [Support](#support)
 - [How to Use](#how-to-use)
-- [Example](#mxnet-example)
+- [Example](#example)
 - [Full API](#full-api)
 
 ---
 
 ## Support
 
-### Versions
 - Zero Script Change experience where you need no modifications to your training script is supported in the official [SageMaker Framework Container for MXNet 1.6](https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-frameworks-deep-learning.html), or the [AWS Deep Learning Container for MXNet 1.6](https://aws.amazon.com/machine-learning/containers/).
-
 - This library itself supports the following versions when you use our API which requires a few minimal changes to your training script: MXNet 1.4, 1.5, 1.6.
+- Only Gluon models are supported
+- When the Gluon model is hybridized, inputs and outputs of intermediate layers can not be saved
+- Parameter server based distributed training is not yet supported
 
 ---
 
@@ -39,10 +40,13 @@ See the [Common API](api.md) page for details on how to do this.
 
 ---
 
-## MXNet Example
+## Example
 ```python
+#######################################
+# Creating a hook. Refer `API for Saving Tensors` page for more on this
 import smdebug.mxnet as smd
 hook = smd.Hook(out_dir=args.out_dir)
+#######################################
 
 import mxnet as mx
 from mxnet import gluon
@@ -62,7 +66,7 @@ trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': args.lr})
 #######################################
 # Here we register the block to smdebug
 hook.register_block(net)
-
+#######################################
 
 batch_size = 100
 mnist = mx.test_utils.get_mnist()
@@ -89,58 +93,9 @@ for i in range(args.epochs):
     metric.reset()
 ```
 
-## Full API
-See the [Common API](https://link.com) page for details about Collection, SaveConfig, and ReductionConfig.\
-See the [Analysis](https://link.com) page for details about analyzing a training job.
-
-## Hook
-```python
-__init__(
-    out_dir,
-    export_tensorboard = False,
-    tensorboard_dir = None,
-    dry_run = False,
-    reduction_config = None,
-    save_config = None,
-    include_regex = None,
-    include_collections= None,
-    save_all = False,
-    include_workers = "one",
-)
-```
-Initializes the hook. Pass this object as a callback to Keras' `model.fit(), model.evaluate(), model.evaluate()`.
-
-* `out_dir` (str): Where to write the recorded tensors and metadata.
-* `export_tensorboard` (bool): Whether to use TensorBoard logs.
-* `tensorboard_dir` (str): Where to save TensorBoard logs.
-* `dry_run` (bool): If true, don't write any files.
-* `reduction_config` (ReductionConfig object): See the Common API page.
-* `save_config` (SaveConfig object): See the Common API page.
-* `include_regex` (list[str]): List of additional regexes to save.
-* `include_collections` (list[str]): List of collections to save.
-* `save_all` (bool): Saves all tensors and collections. May be memory-intensive and slow.
-* `include_workers` (str): Used for distributed training, can also be "all".
-
-```python
-register_block(
-    self,
-    block,
-)
-```
-Adds callbacks to the module for recording tensors.
-
-* `block` (mx.gluon.Block): The block to use.
+---
 
-```python
-save_scalar(
-    self,
-    name,
-    value,
-    searchable = False,
-)
-```
-Call this method at any point in the training script to log a scalar value, such as accuracy.
+## Full API
+See the [API for Saving Tensors](api.md) page for details about Hook, Collection, SaveConfig, and ReductionConfig
 
-* `name` (str): Name of the scalar. A prefix 'scalar/' will be added to it.
-* `value` (float): Scalar value.
-* `searchable` (bool): If True, the scalar value will be written to SageMaker Metrics.
+See the [Analysis](analysis) page for details about analyzing a training job.
diff --git a/docs/pytorch.md b/docs/pytorch.md
@@ -8,7 +8,6 @@
 - [Full API](#full-api)
 
 ## Support
-
 ### Versions
 - Zero Script Change experience where you need no modifications to your training script is supported in the official [SageMaker Framework Container for PyTorch 1.3](https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-frameworks-deep-learning.html), or the [AWS Deep Learning Container for PyTorch 1.3](https://aws.amazon.com/machine-learning/containers/).
 
@@ -44,8 +43,11 @@ See the [Common API](api.md) page for details on how to do this.
 
 ## Module Loss Example
 ```python
+#######################################
+# Creating a hook. Refer `API for Saving Tensors` page for more on this
 import smdebug.pytorch as smd
 hook = smd.Hook(out_dir=args.out_dir)
+#######################################
 
 class Model(nn.Module)
     def __init__(self):
@@ -59,9 +61,11 @@ net = Model()
 criterion = nn.CrossEntropyLoss()
 optimizer = optim.Adam(net.parameters(), lr=args.lr)
 
+#######################################
 # Register the hook and the loss
 hook.register_module(net)
 hook.register_loss(criterion)
+#######################################
 
 # Training loop as usual
 for (inputs, labels) in trainloader:
@@ -76,8 +80,11 @@ for (inputs, labels) in trainloader:
 
 ## Functional Loss Example
 ```python
+#######################################
+# Register the hook and the loss
 import smdebug.pytorch as smd
 hook = smd.Hook(out_dir=args.out_dir)
+#######################################
 
 class Model(nn.Module)
     def __init__(self):
@@ -90,77 +97,28 @@ class Model(nn.Module)
 net = Model()
 optimizer = optim.Adam(net.parameters(), lr=args.lr)
 
+#######################################
 # Register the hook
 hook.register_module(net)
+#######################################
 
 # Training loop, recording the loss at each iteration
 for (inputs, labels) in trainloader:
     optimizer.zero_grad()
     outputs = net(inputs)
     loss = F.cross_entropy(outputs, labels)
-
+
+    #######################################
     # Manually record the loss
     hook.record_tensor_value(tensor_name="loss", tensor_value=loss)
-
+    #######################################
+
     loss.backward()
     optimizer.step()
 ```
 
 ---
 
 ## Full API
-See the [Common API](api.md) page for details about Collection, SaveConfig, and ReductionConfig.\
+See the [API for Saving Tensors](api.md) page for details about Hook, Collection, SaveConfig, and ReductionConfig.
 See the [Analysis](analysis.md) page for details about analyzing a training job.
-
-## Hook
-```python
-__init__(
-    out_dir,
-    export_tensorboard = False,
-    tensorboard_dir = None,
-    dry_run = False,
-    reduction_config = None,
-    save_config = None,
-    include_regex = None,
-    include_collections= None,
-    save_all = False,
-    include_workers = "one",
-)
-```
-Initializes the hook. Pass this object as a callback to Keras' `model.fit(), model.evaluate(), model.evaluate()`.
-
-* `out_dir` (str): Where to write the recorded tensors and metadata.
-* `export_tensorboard` (bool): Whether to use TensorBoard logs.
-* `tensorboard_dir` (str): Where to save TensorBoard logs.
-* `dry_run` (bool): If true, don't write any files.
-* `reduction_config` (ReductionConfig object): See the Common API page.
-* `save_config` (SaveConfig object): See the Common API page.
-* `include_regex` (list[str]): List of additional regexes to save.
-* `include_collections` (list[str]): List of collections to save.
-* `save_all` (bool): Saves all tensors and collections. May be memory-intensive and slow.
-* `include_workers` (str): Used for distributed training, can also be "all".
-
-```python
-register_module(
-    self,
-    module,
-)
-```
-Adds callbacks to the module for recording tensors.
-
-* `module` (torch.nn.Module): The module to use.
-
-
-```python
-save_scalar(
-    self,
-    name,
-    value,
-    searchable = False,
-)
-```
-Call this method at any point in the training script to log a scalar value, such as accuracy.
-
-* `name` (str): Name of the scalar. A prefix 'scalar/' will be added to it.
-* `value` (float): Scalar value.
-* `searchable` (bool): If True, the scalar value will be written to SageMaker Metrics.