Merge latest changes from smdebug to smprofiler (#68)

awslabs · Sep 11, 2020 · b71fe99 · b71fe99
1 parent ed56c1d
commit b71fe99
Show file tree

Hide file tree

Showing 38 changed files with 1,070 additions and 339 deletions.
diff --git a/README.md b/README.md
@@ -63,21 +63,23 @@ The following frameworks are available AWS Deep Learning Containers with the dee
 
 | Framework | Version |
 | --- | --- |
-| [TensorFlow](docs/tensorflow.md) | 1.15, 2.1, 2.2 |
+| [TensorFlow](docs/tensorflow.md) | 1.15, 2.1.0, 2.2.0, 2.3.0 |
 | [MXNet](docs/mxnet.md) | 1.6 |
-| [PyTorch](docs/pytorch.md) | 1.4, 1.5 |
+| [PyTorch](docs/pytorch.md) | 1.4, 1.5, 1.6 |
 | [XGBoost](docs/xgboost.md) | 0.90-2, 1.0-1 ([As a built-in algorithm](docs/xgboost.md#use-xgboost-as-a-built-in-algorithm))|
 
+**Note**: Debugger with zero script change is partially available for TensorFlow v2.1.0 and v2.3.0. The `inputs`, `outputs`, `gradients`, and `layers` built-in collections are currently not available for these TensorFlow versions.
+
 ### AWS training containers with script mode
 
 The `smdebug` library supports frameworks other than the ones listed above while using AWS containers with script mode. If you want to use SageMaker Debugger with one of the following framework versions, you need to make minimal changes to your training script.
 
 | Framework | Versions |
 | --- | --- |
-| [TensorFlow](docs/tensorflow.md) | 1.13, 1.14, 1.15, 2.1, 2.2 |
+| [TensorFlow](docs/tensorflow.md) | 1.13, 1.14, 1.15, 2.1.0, 2.2.0, 2.3.0 |
 | Keras (with TensorFlow backend) | 2.3 |
 | [MXNet](docs/mxnet.md) | 1.4, 1.5, 1.6 |
-| [PyTorch](docs/pytorch.md) | 1.2, 1.3, 1.4, 1.5 |
+| [PyTorch](docs/pytorch.md) | 1.2, 1.3, 1.4, 1.5, 1.6 |
 | [XGBoost](docs/xgboost.md) |  0.90-2, 1.0-1 (As a framework)|
 
 ### Debugger on custom containers or local machines

diff --git a/docs/analysis.md b/docs/analysis.md
@@ -30,8 +30,10 @@ This page describes the programming model that SageMaker Debugger provides for y
 		* [steps](#steps-1)
 		* [value](#value)
 		* [reduction_value](#reduction_value)
-		* [reduction_values](#reduction_values)
+		* [shape](#shape)
 		* [values](#values)
+		* [reduction_values](#reduction_values)
+		* [shapes](#shapes)
 		* [workers](#workers-1)
 		* [prev_steps](#prev_steps)
 * [Rules](#Rules)
@@ -356,6 +358,34 @@ trial.tensor(name).reduction_value(step_num, reduction_name,
 ###### Returns
 `numpy.ndarray` The reduction value of tensor at the given step and worker (if the training job saved data from multiple workers) as a 1x1 numpy array. If this reduction was saved for the tensor during training as part of specification through reduction config, it will be loaded and returned. If the given reduction was not saved then, but the full tensor was saved, the reduction will be computed on the fly and returned. If both the chosen reduction and full tensor are not available, this method raises `TensorUnavailableForStep` exception.
 
+#### shape
+Get the shape of the chosen tensor at a particular step.
+
+```python
+trial.tensor(name).shape(step_num, mode=modes.GLOBAL, worker=None)
+
+```
+###### Arguments
+- `step_num (int)` The step number whose value is to be returned for the mode passed through the next parameter.
+- `mode (smdebug.modes enum value)` The mode applicable for the step number passed above. Defaults to `modes.GLOBAL`
+- `worker (str)` This parameter is only applicable for distributed training. You can retrieve the value of the tensor from a specific worker by passing the worker name. You can query all the workers seen by the trial with the `trial.workers()` method. You might also be interested in querying the workers which saved a value for the tensor at a specific step, this is possible with the method: `trial.tensor(name).workers(step, mode)`
+
+###### Returns
+`tuple(int)`  If only the shape of this tensor was saved through `save_shape` configuration in ReductionConfig, it will be returned. If the full tensor was saved, then shape will be computed and returned today. If both the shape and full tensor are not available, this method raises `TensorUnavailableForStep` exception.
+
+#### values
+Get the values of the tensor for all steps of a given mode.
+
+```python
+trial.tensor(name).values(mode=modes.GLOBAL, worker=None)
+```
+
+###### Arguments
+- `mode (smdebug.modes enum value)` The mode applicable for the step number passed above. Defaults to `modes.GLOBAL`
+- `worker (str)` This parameter is only applicable for distributed training. You can retrieve the value of the tensor from a specific worker by passing the worker name. You can query all the workers seen by the trial with the `trial.workers()` method. You might also be interested in querying the workers which saved a value for the tensor at a specific step, this is possible with the method: `trial.tensor(name).workers(step, mode)`
+
+###### Returns
+`dict[int -> numpy.ndarray]` A dictionary with step numbers as keys and numpy arrays representing the value of the tensor as values.
 
 #### reduction_values
 Get all reduction values saved for the chosen tensor at a particular step. A reduction value is a tensor reduced to a single value through reduction or aggregation operations. Please go through the description of the method `reduction_value` for more details.
@@ -372,19 +402,19 @@ trial.tensor(name).reduction_values(step_num, mode=modes.GLOBAL, worker=None)
 ###### Returns
 `dict[(str, bool) -> numpy.ndarray]` A dictionary with keys being tuples of the form `(reduction_name, abs)` to a 1x1 numpy ndarray value. `abs` here is a boolean that denotes whether the reduction was performed on the absolute value of the tensor or not. Note that this method only returns the reductions which were saved from the training job. It does not compute all known reductions and return them if only the raw tensor was saved.
 
-#### values
-Get the values of the tensor for all steps of a given mode.
+#### shapes
+Get the shapes of the tensor for all steps of a given mode.
 
 ```python
-trial.tensor(name).values(mode=modes.GLOBAL, worker=None)
+trial.tensor(name).shapes(mode=modes.GLOBAL, worker=None)
 ```
 
 ###### Arguments
 - `mode (smdebug.modes enum value)` The mode applicable for the step number passed above. Defaults to `modes.GLOBAL`
 - `worker (str)` This parameter is only applicable for distributed training. You can retrieve the value of the tensor from a specific worker by passing the worker name. You can query all the workers seen by the trial with the `trial.workers()` method. You might also be interested in querying the workers which saved a value for the tensor at a specific step, this is possible with the method: `trial.tensor(name).workers(step, mode)`
 
 ###### Returns
-`dict[int -> numpy.ndarray]` A dictionary with step numbers as keys and numpy arrays representing the value of the tensor as values.
+`dict[int -> tuple(int)]` A dictionary with step numbers as keys and tuples of ints representing the shapes of the tensor as values.
 
 #### workers
 Get all the workers for which this tensor was saved at a given step

diff --git a/docs/api.md b/docs/api.md
@@ -96,6 +96,7 @@ include_workers
 include_regex
 reductions
 save_raw_tensor
+save_shape
 save_interval
 save_steps
 start_step
@@ -163,6 +164,7 @@ Note that `smd` import below translates to `import smdebug.{framework} as smd`.
 |`create_from_json_file(`<br/>`  json_file_path=None)` | `json_file_path (str)` | Takes the path of a file which holds the json configuration of the hook, and creates hook from that configuration. This is an optional parameter. <br/> If this is not passed it tries to get the file path from the value of the environment variable `SMDEBUG_CONFIG_FILE_PATH` and defaults to `/opt/ml/input/config/debughookconfig.json`. When training on SageMaker you do not have to specify any path because this is the default path that SageMaker writes the hook configuration to.
 |`close()` | - | Closes all files that are currently open by the hook |
 | `save_scalar()` | `name (str)` <br/> `value (float)` <br/> `sm_metric (bool)`| Saves a scalar value by the given name. Passing `sm_metric=True` flag also makes this scalar available as a SageMaker Metric to show up in SageMaker Studio. Note that when `sm_metric` is False, this scalar always resides only in your AWS account, but setting it to True saves the scalar also on AWS servers. The default value of `sm_metric` for this method is False. |
+| `save_tensor()`| `tensor_name (str)`, `tensor_value (numpy.array or numpy.ndarray)`, `collections_to_write (str or list[str])` | Manually save metrics tensors. The `record_tensor_value()` API is deprecated in favor or `save_tensor()`.|
 
 
 ### TensorFlow specific Hook API
@@ -178,7 +180,6 @@ The following hook APIs are specific to training scripts using the TF 2.x Gradie
 | Method | Arguments | Returns | Behavior |
 | --- | --- | --- | --- |
 | `wrap_tape(tape)` | `tape` (tensorflow.python.eager.backprop.GradientTape) | Returns a tape object with three identifying markers to help `smdebug`. This returned tape should be used for training. | When not using Zero Script Change environments, calling this method on your tape is necessary for SageMaker Debugger to identify and save gradient tensors. Note that this method returns the same tape object passed.
-| `save_tensor()`| tensor_name (str), tensor_value (float), collections_to_write (str) | - | Manually save metrics tensors while using TF 2.x GradientTape. Note: `record_tensor_value()` is deprecated.|
 
 ### MXNet specific Hook API