Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP docs #16

Closed
wants to merge 37 commits into from
Closed
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
e68b964
WIP docs
jarednielsen Nov 19, 2019
98411f4
Merge branch 'master' into docs
rahul003 Nov 19, 2019
aa4381c
Merge branch 'master' into docs
rahul003 Nov 20, 2019
0c82bf6
WIP
jarednielsen Nov 20, 2019
cc0e6b9
Merge branch 'docs' of https://github.com/awslabs/sagemaker-debugger …
jarednielsen Nov 20, 2019
2cece86
Finished common_api.md
jarednielsen Nov 20, 2019
84b9e27
Merge branch 'master' into docs
jarednielsen Nov 20, 2019
bc5f4f1
WIP
jarednielsen Nov 20, 2019
ead9e6d
Merge branch 'master' into docs
jarednielsen Nov 20, 2019
e29fa6f
MXNet first pass
jarednielsen Nov 20, 2019
a4a1d4e
WIP docs
jarednielsen Nov 20, 2019
aa7c7c3
Merge branch 'master' into docs
jarednielsen Nov 22, 2019
a380445
Address some comments, consolidate summary and glossary into README
jarednielsen Nov 22, 2019
25baa59
Address comments
jarednielsen Nov 22, 2019
ba1998c
Address comments
jarednielsen Nov 22, 2019
b8fa902
Merge branch 'master' into docs
jarednielsen Nov 23, 2019
45be1ab
Merge branch 'master' into docs
jarednielsen Nov 25, 2019
fca73b8
Address some comments
jarednielsen Nov 25, 2019
3e98c04
WIP default collections
jarednielsen Nov 25, 2019
0aebf28
More docs
jarednielsen Nov 25, 2019
ed9b6db
Docs
jarednielsen Nov 25, 2019
e0fedd0
WIP
jarednielsen Nov 26, 2019
2bd0816
Ready for first merge
jarednielsen Nov 26, 2019
2d6c046
Details about JSON file
jarednielsen Nov 26, 2019
59e0c7d
Address some of Rahul's comments, format markdown with python
jarednielsen Nov 26, 2019
566e4c5
Highlights section
jarednielsen Nov 26, 2019
9442165
Sagemaker first
jarednielsen Nov 26, 2019
e2b4372
Remove json spec
jarednielsen Nov 26, 2019
7b1a414
Typo
jarednielsen Nov 26, 2019
d0562b6
docs
jarednielsen Nov 26, 2019
d0ce252
Merge branch 'master' into docs
jarednielsen Nov 26, 2019
a3b9d1d
SageMaker ZCC front and center
jarednielsen Nov 26, 2019
ab86c46
explain zcc
jarednielsen Nov 26, 2019
9c4f730
Merge branch 'master' into docs
jarednielsen Nov 27, 2019
f9a45bd
Docs for Trial, Tensor, Rule (#45)
rahul003 Nov 27, 2019
0a509c7
Merge branch 'master' of https://github.com/awslabs/sagemaker-debugge…
rahul003 Nov 27, 2019
6e90b23
Merge branch 'master' into docs
jarednielsen Nov 27, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 93 additions & 0 deletions documentation/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Sagemaker Debugger

- [Overview](#overview)
- [Install](#install)
- [Example Usage](#example-usage)
- [Concepts](#concepts)
- [Glossary](#glossary)

## Overview
Sagemaker Debugger is an AWS service to automatically debug your machine learning training process.
It helps you develop better, faster, cheaper models by catching common errors quickly.

jarednielsen marked this conversation as resolved.
Show resolved Hide resolved
## Install
```
pip install smdebug
```

Requires Python 3.6+.

## Example Usage
This example uses tf.keras. Say your training code looks like this:
```
model = tf.keras.models.Sequential([ ... ])
jarednielsen marked this conversation as resolved.
Show resolved Hide resolved
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
)
model.fit(x_train, y_train, epochs=args.epochs)
model.evaluate(x_test, y_test)
```

To use Sagemaker Debugger, simply add a callback hook:
```
import smdebug.tensorflow as smd
hook = smd.KerasHook(out_dir=args.out_dir)

model = tf.keras.models.Sequential([ ... ])
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
)
model.fit(x_train, y_train, epochs=args.epochs, callbacks=[hook])
model.evaluate(x_test, y_test, callbacks=[hook])
```

To analyze the result of the training run, create a trial and inspect the tensors.
```
trial = smd.create_trial(out_dir=args.out_dir)
print(f"Saved tensor values for {trial.tensors()}")
print(f"Loss values were {trial.get_collection("losses").values()}")
jarednielsen marked this conversation as resolved.
Show resolved Hide resolved
```


## Concepts
The steps to use Tornasole in any framework are:

1. Create a `hook`.
2. Register your model and optimizer with the hook.
3. Specify the `rule` to be used.
4. After training, create a `trial` to manually analyze the tensors.

Framework-specific details are here:
- [Tensorflow](https://link.com)
- [PyTorch](https://link.com)
- [MXNet](https://link.com)
- [XGBoost](https://link.com)

## Glossary

The imports assume `import smdebug.{tensorflow,pytorch,mxnet,xgboost} as smd`.

**Hook**: The main interface to use training. This object can be passed as a model hook/callback
in Tensorflow and Keras. It keeps track of collections and writes output files at each step.
- `hook = smd.Hook(out_dir="/tmp/mnist_job")`

**Mode**: One of "train", "eval", "predict", or "global". Helpful for segmenting data based on the phase
you're in. Defaults to "global".
- `train_mode = smd.modes.TRAIN`

**Collection**: A group of tensors. Each collection contains its own save configuration and regexes for
tensors to include/exclude.
- `collection = hook.get_collection("losses")`

**SaveConfig**: A Python dict specifying how often to save losses and tensors.
- `save_config = smd.SaveConfig(save_interval=10)`

**ReductionConfig**: Allows you to save a reduction, such as 'mean' or 'l1 norm', instead of the full tensor.
- `reduction_config = smd.ReductionConfig(reductions=['min', 'max', 'mean'], norms=['l1'])`

**Trial**: The main interface to use when analyzing a completed training job. Access collections and tensors.
jarednielsen marked this conversation as resolved.
Show resolved Hide resolved
- `trial = smd.create_trial(out_dir="/tmp/mnist_job")`

**Rule**: A condition that will trigger an exception and terminate the training job early, for example a vanishing gradient.
3 changes: 3 additions & 0 deletions documentation/analysis.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Analysis

TODO: Describe rules and trials.
208 changes: 208 additions & 0 deletions documentation/common_api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,208 @@

# Common API
These objects exist across all frameworks.
- [Modes](#modes)
- [Collection](#collection)
- [SaveConfig](#saveconfig)
- [ReductionConfig](#reductionconfig)
- [Hook from JSON](#hooks)

jarednielsen marked this conversation as resolved.
Show resolved Hide resolved


## Modes
Used to signify which part of training you're in, similar to Keras modes. Choose from
```
smd.modes.TRAIN
smd.modes.EVAL
smd.modes.PREDICT
smd.modes.GLOBAL
```

## Collection

The Collection object groups tensors such as "losses", "weights", "biases", or "gradients".
A collection has its own list of tensors, include/exclude regex patterns, reduction config and save config.
This allows setting of different save and reduction configs for different tensors.
These collections are then also available during analysis.

You can choose which of these builtin collections (or define your own) to save in the hook's `include_collections` parameter. By default, only a few collections are saved.

| Framework | include_collections (default) |
|---|---|
| `TensorFlow` | METRICS, LOSSES, SEARCHABLE_SCALARS |
| `PyTorch` | LOSSES, SCALARS |
| `MXNet` | LOSSES, SCALARS |
| `XGBoost` | METRICS |

Each framework has pre-defined settings for certain collections. For example, TensorFlow's KerasHook
will automatically place weights into the `smd.CollectionKeys.WEIGHTS` collection. PyTorch uses the regex
`"^(?!gradient).*weight` to automatically place tensors in the weights collection.

| CollectionKey | Frameworks | Description |
|---|---|---|
| `ALL` | all | Saves all tensors. |
| `DEFAULT` | all | ??? |
| `WEIGHTS` | TensorFlow, PyTorch, MXNet | Matches all weights tensors. |
| `BIASES` | TensorFlow, PyTorch, MXNet | Matches all biases tensors. |
| `GRADIENTS` | TensorFlow, PyTorch, MXNet | Matches all gradients tensors. In TensorFlow, must use `hook.wrap_optimizer()`. |
| `LOSSES` | TensorFlow, PyTorch, MXNet | Matches all loss tensors. |
| `SCALARS` | TensorFlow, PyTorch, MXNet | Matches all scalar tensors, such as loss or accuracy. |
| `METRICS` | TensorFlow, XGBoost | ??? |
| `INPUTS` | TensorFlow | Matches all inputs to a layer (outputs of the previous layer). |
| `OUTPUTS` | TensorFlow | Matches all outputs of a layer (inputs of the following layer). |
| `SEARCHABLE_SCALARS` | TensorFlow | ??? |
| `OPTIMIZER_VARIABLES` | TensorFlow | Matches all optimizer variables. |
| `TENSORFLOW_SUMMARIES` | TensorFlow | ??? |
| `HYPERPARAMETERS` | XGBoost | ... |
| `PREDICTIONS` | XGBoost | ... |
| `LABELS` | XGBoost | ... |
| `FEATURE_IMPORTANCE` | XGBoost | ... |
| `AVERAGE_SHAP` | XGBoost | ... |
| `FULL_SHAP` | XGBoost | ... |
| `TREES` | XGBoost | ... |




```
coll = smd.Collection(
name,
include_regex = None,
tensor_names = None,
reduction_config = None,
save_config = None,
save_histogram = True,
)
```
`name` (str): Used to identify the collection.\
`include_regex` (list[str]): The regexes to match tensor names for the collection.\
`tensor_names` (list[str]): A list of tensor names to include.\
`reduction_config`: (ReductionConfig object): Which reductions to store in the collection.\
`save_config` (SaveConfig object): Settings for how often to save the collection.\
`save_histogram` (bool): Whether to save histogram data for the collection. Only used if tensorboard support is enabled. Not computed for scalar collections such as losses.
jarednielsen marked this conversation as resolved.
Show resolved Hide resolved

### Accessing a Collection

| Function | Behavior |
|---|---|
| ```hook.get_collection(collection_name)``` | Returns the collection with the given name. Creates the collection with default settings if it doesn't already exist. |
| ```hook.get_collections()``` | Returns all collections as a dictionary with the keys being names of the collections. |
| ```hook.add_to_collection(collection_name, args)``` | Equivalent to calling `coll.add(args)` on the collection with name `collection_name`. |


### Methods on a Collection

| Method | Behavior |
|---|---|
| ```coll.include(regex)``` | Takes a regex string or a list of regex strings to match tensors to include in the collection. |
| ```coll.include_regex``` | Get or set include_regex for the collection. |
| ```coll.save_config``` | Get or set save_config for the collection. |
| ```coll.reduction_config``` | Get or set reduction config for the collection. |
jarednielsen marked this conversation as resolved.
Show resolved Hide resolved
| ```coll.add(tensor)``` | **(TensorFlow only)** Takes an instance or list or set of tf.Tensor/tf.Variable/tf.MirroredVariable/tf.Operation to add to the collection. |
| ```coll.add_keras_layer(layer, inputs=False, outputs=True)``` | **(tf.keras only)** Takes an instance of a tf.keras layer and logs input/output tensors for that module. By default, only outputs are saved. |
| ```coll.add_module_tensors(module, inputs=False, outputs=True)``` | **(PyTorch only)** Takes an instance of a PyTorch module and logs input/output tensors for that module. By default, only outputs are saved. |
| ```coll.add_block_tensors(block, inputs=False, outputs=True)``` | **(MXNet only)** Takes an instance of a Gluon block,and logs input/output tensors for that module. By default, only outputs are saved. |

jarednielsen marked this conversation as resolved.
Show resolved Hide resolved


## SaveConfig
The SaveConfig class customizes the frequency of saving tensors.
The hook takes a SaveConfig object which is applied as default to all tensors included.
A collection can also have a SaveConfig object which is applied to the collection's tensors.

SaveConfig also allows you to save tensors when certain tensors become nan.
jarednielsen marked this conversation as resolved.
Show resolved Hide resolved
This list of tensors to watch for is taken as a list of strings representing names of tensors.

```
save_config = smd.SaveConfig(
mode_save_configs = None,
save_interval = 100,
start_step = 0,
end_step = None,
save_steps = None,
)
```
`mode_save_configs` (dict): Used for advanced cases; see details below.\
`save_interval` (int): How often, in steps, to save tensors. Defaults to 100. \
`start_step` (int): When to start saving tensors.\
`end_step` (int): When to stop saving tensors, exclusive.\
`save_steps` (list[int]): Specific steps to save tensors at. Union with all other parameters.

For example,

`SaveConfig()` will save at steps [0, 100, ...].\
`SaveConfig(save_interval=1)` will save at steps [0, 1, ...]\
`SaveConfig(save_interval=100, end_step=200)` will save at steps [0, 200].\
`SaveConfig(save_interval=100, end_step=201)` will save at steps [0, 100, 200].\
`SaveConfig(save_interval=100, start_step=150)` will save at steps [200, 300, ...].\
`SaveConfig(save_steps=[3, 7])` will save at steps [3, 7].

There is also a more advanced use case, where you specify a different SaveConfig for each mode.
It is best understood through an example:
```
SaveConfig(mode_save_configs={
smd.modes.TRAIN: smd.SaveConfigMode(save_interval=1),
smd.modes.EVAL: smd.SaveConfigMode(save_interval=2),
smd.modes.PREDICT: smd.SaveConfigMode(save_interval=3),
smd.modes.GLOBAL: smd.SaveConfigMode(save_interval=4)
})
jarednielsen marked this conversation as resolved.
Show resolved Hide resolved
```
Essentially, create a dictionary mapping modes to SaveConfigMode objects. The SaveConfigMode objects
take the same four parameters (save_interval, start_step, end_step, save_steps) as the main object.
Any mode not specified will default to the default configuration.

## ReductionConfig
ReductionConfig allows the saving of certain reductions of tensors instead
of saving the full tensor. The motivation here is to reduce the amount of data
saved, and increase the speed in cases where you don't need the full
tensor. The reduction operations which are computed in the training process
and then saved.

During analysis, these are available as reductions of the original tensor.
Please note that using reduction config means that you will not have
the full tensor available during analysis, so this can restrict what you can do with the tensor saved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also reduction computation happens on training job can slow down the process. this should be considered if total size of tensors sizes are large. (Tensors_Saved_size is proportional to tensor_size , num_steps_saved, num_tensors_saved) .Saved_Tensors_size can be controlled by either tweaking save_config or using reduction

The hook takes a ReductionConfig object which is applied as default to all tensors included.
A collection can also have its own ReductionConfig object which is applied
to the tensors belonging to that collection.

```
reduction_config = smd.ReductionConfig(
reductions = None,
abs_reductions = None,
norms = None,
abs_norms = None,
save_raw_tensor = False,
)
```
`reductions` (list[str]): Takes names of reductions, choosing from "min", "max", "median", "mean", "std", "variance", "sum", "prod".\
`abs_reductions` (list[str]): Same as reductions, except the reduction will be computed on the absolute value of the tensor.\
`norms` (list[str]): Takes names of norms to compute, choosing from "l1", "l2".\
`abs_norms` (list[str]): Same as norms, except the norm will be computed on the absolute value of the tensor.\
`save_raw_tensor` (bool): Saves the tensor directly, in addition to other desired reductions.

jarednielsen marked this conversation as resolved.
Show resolved Hide resolved
For example,

`ReductionConfig(reductions=['std', 'variance'], abs_reductions=['mean'], norms=['l1'])`

will return the standard deviation and variance, the mean of the absolute value, and the l1 norm.

## Hook from JSON
The simplest way to create a hook is by using the Python API, as described for each framework.
* [TensorFlow](https://link.com)
* [PyTorch](https://link.com)
* [MXNet](https://link.com)
* [XGBoost](https://link.com)

However, you may want to setup your hook configuration in a JSON file. A basic setup is shown here.
```
json_config_path = "/tmp/json_config.json"
hook = smd.get_hook(
hook_type = None,
json_config_path = json_config_path,
create_if_not_exists = True,
)
```
`hook_type` only needs to be specified for TensorFlow, in which case it is one of ["session", "estimator", "keras"].\
`create_if_not_exists` argument exists for internal reasons; set it to true.

The JSON file configuration is detailed further on [AWS Docs](https://link.com).
Loading