Skip to content

Commit

Permalink
Merge latest changes from smdebug to smprofiler (#68)
Browse files Browse the repository at this point in the history
  • Loading branch information
anirudhacharya committed Sep 11, 2020
1 parent ed56c1d commit b71fe99
Show file tree
Hide file tree
Showing 38 changed files with 1,070 additions and 339 deletions.
10 changes: 6 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,21 +63,23 @@ The following frameworks are available AWS Deep Learning Containers with the dee

| Framework | Version |
| --- | --- |
| [TensorFlow](docs/tensorflow.md) | 1.15, 2.1, 2.2 |
| [TensorFlow](docs/tensorflow.md) | 1.15, 2.1.0, 2.2.0, 2.3.0 |
| [MXNet](docs/mxnet.md) | 1.6 |
| [PyTorch](docs/pytorch.md) | 1.4, 1.5 |
| [PyTorch](docs/pytorch.md) | 1.4, 1.5, 1.6 |
| [XGBoost](docs/xgboost.md) | 0.90-2, 1.0-1 ([As a built-in algorithm](docs/xgboost.md#use-xgboost-as-a-built-in-algorithm))|

**Note**: Debugger with zero script change is partially available for TensorFlow v2.1.0 and v2.3.0. The `inputs`, `outputs`, `gradients`, and `layers` built-in collections are currently not available for these TensorFlow versions.

### AWS training containers with script mode

The `smdebug` library supports frameworks other than the ones listed above while using AWS containers with script mode. If you want to use SageMaker Debugger with one of the following framework versions, you need to make minimal changes to your training script.

| Framework | Versions |
| --- | --- |
| [TensorFlow](docs/tensorflow.md) | 1.13, 1.14, 1.15, 2.1, 2.2 |
| [TensorFlow](docs/tensorflow.md) | 1.13, 1.14, 1.15, 2.1.0, 2.2.0, 2.3.0 |
| Keras (with TensorFlow backend) | 2.3 |
| [MXNet](docs/mxnet.md) | 1.4, 1.5, 1.6 |
| [PyTorch](docs/pytorch.md) | 1.2, 1.3, 1.4, 1.5 |
| [PyTorch](docs/pytorch.md) | 1.2, 1.3, 1.4, 1.5, 1.6 |
| [XGBoost](docs/xgboost.md) | 0.90-2, 1.0-1 (As a framework)|

### Debugger on custom containers or local machines
Expand Down
40 changes: 35 additions & 5 deletions docs/analysis.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,10 @@ This page describes the programming model that SageMaker Debugger provides for y
* [steps](#steps-1)
* [value](#value)
* [reduction_value](#reduction_value)
* [reduction_values](#reduction_values)
* [shape](#shape)
* [values](#values)
* [reduction_values](#reduction_values)
* [shapes](#shapes)
* [workers](#workers-1)
* [prev_steps](#prev_steps)
* [Rules](#Rules)
Expand Down Expand Up @@ -356,6 +358,34 @@ trial.tensor(name).reduction_value(step_num, reduction_name,
###### Returns
`numpy.ndarray` The reduction value of tensor at the given step and worker (if the training job saved data from multiple workers) as a 1x1 numpy array. If this reduction was saved for the tensor during training as part of specification through reduction config, it will be loaded and returned. If the given reduction was not saved then, but the full tensor was saved, the reduction will be computed on the fly and returned. If both the chosen reduction and full tensor are not available, this method raises `TensorUnavailableForStep` exception.

#### shape
Get the shape of the chosen tensor at a particular step.

```python
trial.tensor(name).shape(step_num, mode=modes.GLOBAL, worker=None)

```
###### Arguments
- `step_num (int)` The step number whose value is to be returned for the mode passed through the next parameter.
- `mode (smdebug.modes enum value)` The mode applicable for the step number passed above. Defaults to `modes.GLOBAL`
- `worker (str)` This parameter is only applicable for distributed training. You can retrieve the value of the tensor from a specific worker by passing the worker name. You can query all the workers seen by the trial with the `trial.workers()` method. You might also be interested in querying the workers which saved a value for the tensor at a specific step, this is possible with the method: `trial.tensor(name).workers(step, mode)`

###### Returns
`tuple(int)` If only the shape of this tensor was saved through `save_shape` configuration in ReductionConfig, it will be returned. If the full tensor was saved, then shape will be computed and returned today. If both the shape and full tensor are not available, this method raises `TensorUnavailableForStep` exception.

#### values
Get the values of the tensor for all steps of a given mode.

```python
trial.tensor(name).values(mode=modes.GLOBAL, worker=None)
```

###### Arguments
- `mode (smdebug.modes enum value)` The mode applicable for the step number passed above. Defaults to `modes.GLOBAL`
- `worker (str)` This parameter is only applicable for distributed training. You can retrieve the value of the tensor from a specific worker by passing the worker name. You can query all the workers seen by the trial with the `trial.workers()` method. You might also be interested in querying the workers which saved a value for the tensor at a specific step, this is possible with the method: `trial.tensor(name).workers(step, mode)`

###### Returns
`dict[int -> numpy.ndarray]` A dictionary with step numbers as keys and numpy arrays representing the value of the tensor as values.

#### reduction_values
Get all reduction values saved for the chosen tensor at a particular step. A reduction value is a tensor reduced to a single value through reduction or aggregation operations. Please go through the description of the method `reduction_value` for more details.
Expand All @@ -372,19 +402,19 @@ trial.tensor(name).reduction_values(step_num, mode=modes.GLOBAL, worker=None)
###### Returns
`dict[(str, bool) -> numpy.ndarray]` A dictionary with keys being tuples of the form `(reduction_name, abs)` to a 1x1 numpy ndarray value. `abs` here is a boolean that denotes whether the reduction was performed on the absolute value of the tensor or not. Note that this method only returns the reductions which were saved from the training job. It does not compute all known reductions and return them if only the raw tensor was saved.

#### values
Get the values of the tensor for all steps of a given mode.
#### shapes
Get the shapes of the tensor for all steps of a given mode.

```python
trial.tensor(name).values(mode=modes.GLOBAL, worker=None)
trial.tensor(name).shapes(mode=modes.GLOBAL, worker=None)
```

###### Arguments
- `mode (smdebug.modes enum value)` The mode applicable for the step number passed above. Defaults to `modes.GLOBAL`
- `worker (str)` This parameter is only applicable for distributed training. You can retrieve the value of the tensor from a specific worker by passing the worker name. You can query all the workers seen by the trial with the `trial.workers()` method. You might also be interested in querying the workers which saved a value for the tensor at a specific step, this is possible with the method: `trial.tensor(name).workers(step, mode)`

###### Returns
`dict[int -> numpy.ndarray]` A dictionary with step numbers as keys and numpy arrays representing the value of the tensor as values.
`dict[int -> tuple(int)]` A dictionary with step numbers as keys and tuples of ints representing the shapes of the tensor as values.

#### workers
Get all the workers for which this tensor was saved at a given step
Expand Down
3 changes: 2 additions & 1 deletion docs/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,7 @@ include_workers
include_regex
reductions
save_raw_tensor
save_shape
save_interval
save_steps
start_step
Expand Down Expand Up @@ -163,6 +164,7 @@ Note that `smd` import below translates to `import smdebug.{framework} as smd`.
|`create_from_json_file(`<br/>` json_file_path=None)` | `json_file_path (str)` | Takes the path of a file which holds the json configuration of the hook, and creates hook from that configuration. This is an optional parameter. <br/> If this is not passed it tries to get the file path from the value of the environment variable `SMDEBUG_CONFIG_FILE_PATH` and defaults to `/opt/ml/input/config/debughookconfig.json`. When training on SageMaker you do not have to specify any path because this is the default path that SageMaker writes the hook configuration to.
|`close()` | - | Closes all files that are currently open by the hook |
| `save_scalar()` | `name (str)` <br/> `value (float)` <br/> `sm_metric (bool)`| Saves a scalar value by the given name. Passing `sm_metric=True` flag also makes this scalar available as a SageMaker Metric to show up in SageMaker Studio. Note that when `sm_metric` is False, this scalar always resides only in your AWS account, but setting it to True saves the scalar also on AWS servers. The default value of `sm_metric` for this method is False. |
| `save_tensor()`| `tensor_name (str)`, `tensor_value (numpy.array or numpy.ndarray)`, `collections_to_write (str or list[str])` | Manually save metrics tensors. The `record_tensor_value()` API is deprecated in favor or `save_tensor()`.|


### TensorFlow specific Hook API
Expand All @@ -178,7 +180,6 @@ The following hook APIs are specific to training scripts using the TF 2.x Gradie
| Method | Arguments | Returns | Behavior |
| --- | --- | --- | --- |
| `wrap_tape(tape)` | `tape` (tensorflow.python.eager.backprop.GradientTape) | Returns a tape object with three identifying markers to help `smdebug`. This returned tape should be used for training. | When not using Zero Script Change environments, calling this method on your tape is necessary for SageMaker Debugger to identify and save gradient tensors. Note that this method returns the same tape object passed.
| `save_tensor()`| tensor_name (str), tensor_value (float), collections_to_write (str) | - | Manually save metrics tensors while using TF 2.x GradientTape. Note: `record_tensor_value()` is deprecated.|

### MXNet specific Hook API

Expand Down
Loading

0 comments on commit b71fe99

Please sign in to comment.