diff --git a/README.md b/README.md index d824a9a2e..7cfb18d2b 100644 --- a/README.md +++ b/README.md @@ -116,7 +116,7 @@ These framework forks are not available in custom containers or non-SM environme | [SageMaker Training](docs/sagemaker.md) | SageMaker users, we recommend you start with this page on how to run SageMaker training jobs with SageMaker Debugger | | Frameworks | See the frameworks pages for details on what's supported and how to modify your training script if applicable | | [Programming Model for Analysis](docs/analysis.md) | For description of the programming model provided by our APIs which allows you to perform interactive exploration of tensors saved as well as to write your own Rules monitoring your training jobs. | -| [APIs](docs/api.md) | Full description of our APIs | +| [APIs](docs/api.md) | Full description of our APIs on saving tensors | ## License diff --git a/docs/api.md b/docs/api.md index 970b0c2ca..b47ec3692 100644 --- a/docs/api.md +++ b/docs/api.md @@ -1,19 +1,27 @@ -# Common API -These objects exist across all frameworks. -- [Creating a Hook](#creating-a-hook) - - [Hook from SageMaker](#hook-from-sagemaker) - - [Hook from Python](#hook-from-python) +# Saving Tensors API + +- [Glossary](#glossary) +- [Hook](#hook) + - [Creating a Hook](#creating-a-hook) + - [Hook when using SageMaker Python SDK](#hook-when-using-sagemaker-python-sdk) + - [Configuring Hook using SageMaker Python SDK](#configuring-hook-using-sagemaker-python-sdk) + - [Hook from Python constructor](#hook-from-python-constructor) + - [Common Hook API](#common-hook-api) + - [TensorFlow specific Hook API](#tensorflow-specific-hook-api) + - [MXNet specific Hook API](#mxnet-specific-hook-api) + - [PyTorch specific Hook API](#pytorch-specific-hook-api) - [Modes](#modes) - [Collection](#collection) - [SaveConfig](#saveconfig) - [ReductionConfig](#reductionconfig) -- [Environment Variables](#environment-variables) ## Glossary The imports assume `import smdebug.{tensorflow,pytorch,mxnet,xgboost} as smd`. +**Step**: Step means one the work done by the training job for one batch (i.e. forward and backward pass). (An exception is with TensorFlow's Session interface, where a step also includes the initialization session run calls). SageMaker Debugger is designed in terms of steps. When to save data is specified using steps as well as the invocation of Rules is on a step-by-step basis. + **Hook**: The main class to pass as a callback object, or to create callback functions. It keeps track of collections and writes output files at each step. - `hook = smd.Hook(out_dir="/tmp/mnist_job")` @@ -21,133 +29,239 @@ The imports assume `import smdebug.{tensorflow,pytorch,mxnet,xgboost} as smd`. you're in. Defaults to "global". - `train_mode = smd.modes.TRAIN` -**Collection**: A group of tensors. Each collection contains its own save configuration and regexes for -tensors to include/exclude. +**Collection**: A group of tensors. Each collection contains its configuration for what tensors are part of it, and when to save them. - `collection = hook.get_collection("losses")` **SaveConfig**: A Python dict specifying how often to save losses and tensors. - `save_config = smd.SaveConfig(save_interval=10)` -**ReductionConfig**: Allows you to save a reduction, such as 'mean' or 'l1 norm', instead of the full tensor. +**ReductionConfig**: Allows you to save a reduction, such as 'mean' or 'l1 norm', instead of the full tensor. Reductions are simple floats. - `reduction_config = smd.ReductionConfig(reductions=['min', 'max', 'mean'], norms=['l1'])` **Trial**: The main interface to use when analyzing a completed training job. Access collections and tensors. See [trials documentation](analysis.md). - `trial = smd.create_trial(out_dir="/tmp/mnist_job")` -**Rule**: A condition that will trigger an exception, for example a vanishing gradient. See [rules documentation](analysis.md). - +**Rule**: A condition to monitor the saved data for. It can trigger an exception when the condition is met, for example a vanishing gradient. See [rules documentation](analysis.md). --- -## Creating a Hook +## Hook +### Creating a Hook +Note that when using Zero Script Change supported containers in SageMaker, you generally do not need to create your hook object except for some advanced use cases where you need access to the hook. + +`HookClass` or `hook_class` below will be `Hook` for PyTorch, MXNet, and XGBoost. It will be one of `KerasHook`, `SessionHook` or `EstimatorHook` for TensorFlow. -### Hook from SageMaker +The framework in `smd` import below refers to one of `tensorflow`, `mxnet`, `pytorch` or `xgboost`. + +#### Hook when using SageMaker Python SDK If you create a SageMaker job and specify the hook configuration in the SageMaker Estimator API as described in [AWS Docs](https://docs.aws.amazon.com/sagemaker/latest/dg/train-model.html), -the a JSON file will be automatically written. You can create a hook from this file by calling +a JSON file containing the hook configuration will be automatically written to the training container. In such a case, you can create a hook from that configuration file by calling ```python +import smdebug.{framework} as smd hook = smd.{hook_class}.create_from_json_file() ``` -with no arguments and then use the hook Python API in your script. `hook_class` will be `Hook` for PyTorch, MXNet, and XGBoost. It will be one of `KerasHook`, `SessionHook`, `EstimatorHook` for TensorFlow. +with no arguments and then use the hook Python API in your script. + +#### Configuring Hook using SageMaker Python SDK +Parameters to the Hook are passed as below when using the SageMaker Python SDK. +```python +from sagemaker.debugger import DebuggerHookConfig +hook_config = DebuggerHookConfig( + s3_output_path='s3://smdebug-dev-demo-pdx/mnist', + hook_parameters={ + "parameter": "value" + }) +``` +The parameters can be one of the following. The meaning of these parameters will be clear as you review the sections of documentation below. Note that all parameters below have to be strings. So for any parameter which accepts a list (such as save_steps, reductions, include_regex), the value needs to be given as strings separated by a comma between them. +``` +dry_run +save_all +include_workers +include_regex +reductions +save_raw_tensor +save_interval +save_steps +start_step +end_step +train.save_interval +train.save_steps +train.start_step +train.end_step +eval.save_interval +eval.save_steps +eval.start_step +eval.end_step +predict.save_interval +predict.save_steps +predict.start_step +predict.end_step +global.save_interval +global.save_steps +global.start_step +global.end_step +``` -### Hook from Python +#### Hook from Python constructor See the framework-specific pages for more details. -* [TensorFlow](tensorflow.md) -* [PyTorch](pytorch.md) -* [MXNet](mxnet.md) -* [XGBoost](xgboost.md) + +HookClass below can be one of `KerasHook`, `SessionHook`, `EstimatorHook` for TensorFlow, or is just `Hook` for MXNet, Pytorch and XGBoost. + +```python +hook = HookClass( + out_dir, + export_tensorboard = False, + tensorboard_dir = None, + dry_run = False, + reduction_config = None, + save_config = None, + include_regex = None, + include_collections = None, + save_all = False, + include_workers="one" +) +``` +##### Arguments +- `out_dir` (str): Path where to save tensors and metadata. This is a required argument. +- `export_tensorboard` (bool): Whether to export TensorBoard summaries (distributions and histograms for tensors saved, and scalar summaries for scalars saved). Defaults to `False`. Note that when running on SageMaker this parameter will be ignored. You will need to use the TensorBoardOutputConfig section in API to enable TensorBoard summaries. Refer [SageMaker page](sagemaker.md) for an example. +- `tensorboard_dir` (str): Path where to save TensorBoard artifacts. If this is not passed and `export_tensorboard` is True, then TensorBoard artifacts are saved in `out_dir/tensorboard` . Note that when running on SageMaker this parameter will be ignored. You will need to use the TensorBoardOutputConfig section in API to enable TensorBoard summaries. Refer [SageMaker page](sagemaker.md) for an example. +- `dry_run` (bool): If true, don't write any files +- `reduction_config`: ([ReductionConfig](#reductionconfig) object) Specifies the reductions to be applied as default for tensors saved. A collection can have its own `ReductionConfig` object which overrides this for the tensors which belong to that collection. +- `save_config`: ([SaveConfig](#saveconfig) object) Specifies when to save tensors. A collection can have its own `SaveConfig` object which overrides this for the tensors which belong to that collection. +- `include_regex` (list[str]): list of regex patterns which specify the tensors to save. Tensors whose names match these patterns will be saved +- `include_collections` (list[str]): List of which collections to save specified by name +- `save_all` (bool): Saves all tensors and collections. Increases the amount of disk space used, and can reduce the performance of the training job significantly, depending on the size of the model. +- `include_workers` (str): Used for distributed training. It can take the values `one` or `all`. `one` means only the tensors from one chosen worker will be saved. This is the default behavior. `all` means tensors from all workers will be saved. + +### Common Hook API +These methods are common for all hooks in any framework. + +Note that `smd` import below translates to `import smdebug.{framework} as smd`. + +| Method | Arguments | Behavior | +| --- | --- | --- | +|`add_collection(collection)` | `collection (smd.Collection)` | Takes a Collection object and adds it to the CollectionManager that the Hook holds. Note that you should only pass in a Collection object for the same framework as the hook | +|`get_collection(name)`| `name (str)` | Returns collection identified by the given name | +|`get_collections()` | - | Returns all collection objects held by the hook | +|`set_mode(mode)`| value of the enum `smd.modes` | Sets mode of the job, can be one of `smd.modes.TRAIN`, `smd.modes.EVAL`, `smd.modes.PREDICT` or `smd.modes.GLOBAL`. Refer [Modes](#modes) for more on that. | +|`create_from_json_file(`
` json_file_path=None)` | `json_file_path (str)` | Takes the path of a file which holds the json configuration of the hook, and creates hook from that configuration. This is an optional parameter.
If this is not passed it tries to get the file path from the value of the environment variable `SMDEBUG_CONFIG_FILE_PATH` and defaults to `/opt/ml/input/config/debughookconfig.json`. When training on SageMaker you do not have to specify any path because this is the default path that SageMaker writes the hook configuration to. +|`close()` | - | Closes all files that are currently open by the hook | + +### TensorFlow specific Hook API +Note that there are three types of Hooks in TensorFlow: SessionHook, EstimatorHook and KerasHook based on the TensorFlow interface being used for training. [This page](tensorflow.md) shows examples of each of these. + +| Method | Arguments | Returns | Behavior | +| --- | --- | --- | --- | +| `wrap_optimizer(optimizer)` | `optimizer` (tf.train.Optimizer or tf.keras.Optimizer) | Returns the same optimizer object passed with a couple of identifying markers to help `smdebug`. This returned optimizer should be used for training. | When not using Zero Script Change environments, calling this method on your optimizer is necessary for SageMaker Debugger to identify and save gradient tensors. Note that this method returns the same optimizer object passed and does not change your optimization logic. If the hook is of type `KerasHook`, you can pass in either an object of type `tf.train.Optimizer` or `tf.keras.Optimizer`. If the hook is of type `SessionHook` or `EstimatorHook`, the optimizer can only be of type `tf.train.Optimizer`. This new +| `add_to_collection(`
`collection_name, variable)` | `collection_name (str)` : name of the collection to add to.
`variable` parameter to pass to the collection's `add` method. | `None` | Calls the `add` method of a collection object. See [this section](#collection) for more. | + +### MXNet specific Hook API + +| Method | Arguments | Behavior | +| --- | --- | --- | +| `register_block(block)` | `block (mx.gluon.Block)` | Calling this method applies the hook to the Gluon block representing the model, so SageMaker Debugger gets called by MXNet and can save the tensors required. | +| `save_scalar(`
`name, `
`value, `
`sm_metric=False)` | `name (str)`
`value (float)`
`sm_metric (bool)`| Saves a scalar value by the given name. Passing `sm_metric=True` flag also makes this scalar available as a SageMaker Metric to show up in SageMaker Studio. Note that when `sm_metric` is False, this scalar always resides only in your AWS account, but setting it to True saves the scalar also on AWS servers. The default value of `sm_metric` for this method is False. | + +### PyTorch specific Hook API + + +| Method | Arguments | Behavior | +| --- | --- | --- | +| `register_module(module)` | `module (torch.nn.Module)` | Calling this method applies the hook to the Torch Module representing the model, so SageMaker Debugger gets called by PyTorch and can save the tensors required. | +| `register_loss(loss_module)` | `loss_module (torch.nn.modules.loss._Loss)` | Calling this method applies the hook to the Torch Module representing the loss, so SageMaker Debugger can save losses | +| `save_scalar(`
`name, `
`value, `
`sm_metric=False)` | `name (str)`
`value (float)`
`sm_metric (bool)`| Saves a scalar value by the given name. Passing `sm_metric=True` flag also makes this scalar available as a SageMaker Metric to show up in SageMaker Studio. Note that when `sm_metric` is False, this scalar always resides only in your AWS account, but setting it to True saves the scalar also on AWS servers. The default value of `sm_metric` for this method is False. | --- ## Modes Used to signify which part of training you're in, similar to Keras modes. `GLOBAL` mode is used as -a default. Choose from +a default when no mode was set. Choose from ```python -smd.modes.TRAIN -smd.modes.EVAL -smd.modes.PREDICT -smd.modes.GLOBAL +smdebug.modes.TRAIN +smdebug.modes.EVAL +smdebug.modes.PREDICT +smdebug.modes.GLOBAL ``` +The modes enum is also available under the alias `smdebug.{framework}.modes`. + --- ## Collection -The Collection object groups tensors such as "losses", "weights", "biases", or "gradients". -A collection has its own list of tensors, include/exclude regex patterns, reduction config and save config. -This allows setting of different save and reduction configs for different tensors. -These collections are then also available during analysis. +The construct of a Collection groups tensors together. A Collection is identified by a string representing the name of the collection. It can be used to group tensors of a particular kind such as "losses", "weights", "biases", or "gradients". A Collection has its own list of tensors specified by include regex patterns, and other parameters determining how these tensors should be saved and when. Using collections enables you to save different types of tensors at different frequencies and in different forms. These collections are then also available during analysis so you can query a group of tensors at once. -You can choose which of these builtin collections (or define your own) to save in the hook's `include_collections` parameter. By default, only a few collections are saved. - -| Framework | include_collections (default) | -|---|---| -| `TensorFlow` | METRICS, LOSSES, SEARCHABLE_SCALARS | -| `PyTorch` | LOSSES, SCALARS | -| `MXNet` | LOSSES, SCALARS | -| `XGBoost` | METRICS | +There are a number of built-in collections that SageMaker Debugger manages by default. This means that the library takes care of identifying what tensors should be saved as part of that collection. You can also define custom collections, to do which there are couple of different ways. -Each framework has pre-defined settings for certain collections. For example, TensorFlow's KerasHook -will automatically place weights into the `smd.CollectionKeys.WEIGHTS` collection. PyTorch uses the regex -`"^(?!gradient).*weight` to automatically place tensors in the weights collection. +You can specify which of these collections to save in the hook's `include_collections` parameter, or through the `collection_configs` parameter to the `DebuggerHookConfig` in the SageMaker Python SDK. -| CollectionKey | Frameworks | Description | -|---|---|---| -| `ALL` | all | Saves all tensors. | -| `DEFAULT` | all | ??? | -| `WEIGHTS` | TensorFlow, PyTorch, MXNet | Matches all weights tensors. | -| `BIASES` | TensorFlow, PyTorch, MXNet | Matches all biases tensors. | -| `GRADIENTS` | TensorFlow, PyTorch, MXNet | Matches all gradients tensors. In TensorFlow non-DLC, must use `hook.wrap_optimizer()`. | -| `LOSSES` | TensorFlow, PyTorch, MXNet | Matches all loss tensors. | -| `SCALARS` | TensorFlow, PyTorch, MXNet | Matches all scalar tensors, such as loss or accuracy. | -| `METRICS` | TensorFlow, XGBoost | Evaluation metrics computed by the algorithm. | -| `INPUTS` | TensorFlow | Matches all inputs to a layer (outputs of the previous layer). | -| `OUTPUTS` | TensorFlow | Matches all outputs of a layer (inputs of the following layer). | -| `SEARCHABLE_SCALARS` | TensorFlow | Scalars that will go to SageMaker Metrics. | -| `OPTIMIZER_VARIABLES` | TensorFlow | Matches all optimizer variables. | -| `HYPERPARAMETERS` | XGBoost | [Booster paramameters](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html) | -| `PREDICTIONS` | XGBoost | Predictions on validation set (if provided) | -| `LABELS` | XGBoost | Labels on validation set (if provided) | -| `FEATURE_IMPORTANCE` | XGBoost | Feature importance given by [get_score()](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.Booster.get_score) | -| `FULL_SHAP` | XGBoost | A matrix of (nsmaple, nfeatures + 1) with each record indicating the feature contributions ([SHAP values](https://github.com/slundberg/shap)) for that prediction. Computed on training data with [predict()](https://github.com/slundberg/shap) | -| `AVERAGE_SHAP` | XGBoost | The sum of SHAP value magnitudes over all samples. Represents the impact each feature has on the model output. | -| `TREES` | XGBoost | Boosted tree model given by [trees_to_dataframe()](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.Booster.trees_to_dataframe) | +### Built in Collections +Below is a comprehensive list of the built-in collections that are managed by SageMaker Debugger. The Hook identifes the tensors that should be saved as part of that collection for that framework and saves them if they were requested. +The names of these collections are all lower case strings. +| Name | Supported by frameworks/hooks | Description | +|---|---|---| +| `all` | all | Matches all tensors | +| `default` | all | It's a default collection created, which matches the regex patterns passed as `include_regex` to the Hook | +| `weights` | TensorFlow, PyTorch, MXNet | Matches all weights of the model | +| `biases` | TensorFlow, PyTorch, MXNet | Matches all biases of the model | +| `gradients` | TensorFlow, PyTorch, MXNet | Matches all gradients of the model. In TensorFlow when not using Zero Script Change environments, must use `hook.wrap_optimizer()`. | +| `losses` | TensorFlow, PyTorch, MXNet | Saves the loss for the model | +| `metrics` | TensorFlow's KerasHook, XGBoost | For KerasHook, saves the metrics computed by Keras for the model. For XGBoost, the evaluation metrics computed by the algorithm. | +| `outputs` | TensorFlow's KerasHook | Matches the outputs of the model | +| `sm_metrics` | TensorFlow | You can add scalars that you want to show up in SageMaker Metrics to this collection. SageMaker Debugger will save these scalars both to the out_dir of the hook, as well as to SageMaker Metric. Note that the scalars passed here will be saved on AWS servers outside of your AWS account. | +| `optimizer_variables` | TensorFlow's KerasHook | Matches all optimizer variables, currently only supported in Keras. | +| `hyperparameters` | XGBoost | [Booster paramameters](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html) | +| `predictions` | XGBoost | Predictions on validation set (if provided) | +| `labels` | XGBoost | Labels on validation set (if provided) | +| `feature_importance` | XGBoost | Feature importance given by [get_score()](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.Booster.get_score) | +| `full_shap` | XGBoost | A matrix of (nsmaple, nfeatures + 1) with each record indicating the feature contributions ([SHAP values](https://github.com/slundberg/shap)) for that prediction. Computed on training data with [predict()](https://github.com/slundberg/shap) | +| `average_shap` | XGBoost | The sum of SHAP value magnitudes over all samples. Represents the impact each feature has on the model output. | +| `trees` | XGBoost | Boosted tree model given by [trees_to_dataframe()](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.Booster.trees_to_dataframe) | + +### Default collections saved +The following collections are saved regardless of the hook configuration. + +| Framework | Default collections saved | +|---|---| +| `TensorFlow` | METRICS, LOSSES, SM_METRICS | +| `PyTorch` | LOSSES | +| `MXNet` | LOSSES | +| `XGBoost` | METRICS | -```python -coll = smd.Collection( - name, - include_regex = None, - tensor_names = None, - reduction_config = None, - save_config = None, - save_histogram = True, + If for some reason, you want to disable the saving of these collections, you can do so by setting end_step to 0 in the collection's SaveConfig. + When using the SageMaker Python SDK this would look like + ```python +from sagemaker.debugger import DebuggerHookConfig, CollectionConfig +hook_config = DebuggerHookConfig( + s3_output_path='s3://smdebug-dev-demo-pdx/mnist', + collection_configs=[ + CollectionConfig(name="metrics", parameters={"end_step": 0}) + ] ) -``` -`name` (str): Used to identify the collection.\ -`include_regex` (list[str]): The regexes to match tensor names for the collection.\ -`tensor_names` (list[str]): A list of tensor names to include.\ -`reduction_config`: (ReductionConfig object): Which reductions to store in the collection.\ -`save_config` (SaveConfig object): Settings for how often to save the collection.\ -`save_histogram` (bool): Whether to save histogram data for the collection. Only used if tensorboard support is enabled. Not computed for scalar collections such as losses. + ``` + When configuring the Collection in your Python script, it would be as follows: + ```python + hook.get_collection("metrics").save_config.end_step = 0 + ``` -### Accessing a Collection +### Creating or retrieving a Collection | Function | Behavior | |---|---| -| ```hook.get_collection(collection_name)``` | Returns the collection with the given name. Creates the collection with default settings if it doesn't already exist. | -| ```hook.get_collections()``` | Returns all collections as a dictionary with the keys being names of the collections. | -| ```hook.add_to_collection(collection_name, args)``` | Equivalent to calling `coll.add(args)` on the collection with name `collection_name`. | +| ```hook.get_collection(collection_name)``` | Returns the collection with the given name. Creates the collection with default configuration if it doesn't already exist. A new collection created by default does not match any tensor and is configured to save histograms and distributions along with the tensor if tensorboard support is enabled, and uses the reduction configuration and save configuration passed to the hook. | ### Properties of a Collection | Property | Description | |---|---| -| `tensor_names` | Get or set list of tensor names as strings. | -| `include_regex` | Get or set list of regexes to include. | -| `reduction_config` | Get or set the ReductionConfig object. | -| `save_config` | Get or set the SaveConfig object. | +| `tensor_names` | Get or set list of tensor names as strings | +| `include_regex` | Get or set list of regexes to include. Tensors whose names match these regex patterns will be included in the collection | +| `reduction_config` | Get or set the ReductionConfig object to be used for tensors part of this collection | +| `save_config` | Get or set the SaveConfig object to be used for tensors part of this collection | +| `save_histogram` | Get or set the boolean flag which determines whether to write histograms to enable histograms and distributions in TensorBoard, for tensors part of this collection. Only applicable if TensorBoard support is enabled.| ### Methods on a Collection @@ -160,17 +274,56 @@ coll = smd.Collection( | ```coll.add_module_tensors(module, inputs=False, outputs=True)``` | **(PyTorch only)** Takes an instance of a PyTorch module and logs input/output tensors for that module. By default, only outputs are saved. | | ```coll.add_block_tensors(block, inputs=False, outputs=True)``` | **(MXNet only)** Takes an instance of a Gluon block,and logs input/output tensors for that module. By default, only outputs are saved. | +### Configuring Collection using SageMaker Python SDK +Parameters to configure Collection are passed as below when using the SageMaker Python SDK. +```python +from sagemaker.debugger import CollectionConfig +coll_config = CollectionConfig( + name="weights", + parameters={ "parameter": "value" }) +``` +The parameters can be one of the following. The meaning of these parameters will be clear as you review the sections of documentation below. Note that all parameters below have to be strings. So any parameter which accepts a list (such as save_steps, reductions, include_regex), needs to be given as strings separated by a comma between them. + +``` +include_regex +save_histogram +reductions +save_raw_tensor +save_interval +save_steps +start_step +end_step +train.save_interval +train.save_steps +train.start_step +train.end_step +eval.save_interval +eval.save_steps +eval.start_step +eval.end_step +predict.save_interval +predict.save_steps +predict.start_step +predict.end_step +global.save_interval +global.save_steps +global.start_step +global.end_step +``` + + --- ## SaveConfig The SaveConfig class customizes the frequency of saving tensors. The hook takes a SaveConfig object which is applied as default to all tensors included. A collection can also have a SaveConfig object which is applied to the collection's tensors. +You can also choose to have different configuration for when to save tensors based on the mode of the job. -SaveConfig also allows you to save tensors when certain tensors become nan. -This list of tensors to watch for is taken as a list of strings representing names of tensors. +This class is available in the following namespaces `smdebug` and `smdebug.{framework}`. ```python +import smdebug as smd save_config = smd.SaveConfig( mode_save_configs = None, save_interval = 100, @@ -179,25 +332,28 @@ save_config = smd.SaveConfig( save_steps = None, ) ``` -`mode_save_configs` (dict): Used for advanced cases; see details below.\ -`save_interval` (int): How often, in steps, to save tensors. Defaults to 100. \ -`start_step` (int): When to start saving tensors.\ -`end_step` (int): When to stop saving tensors, exclusive.\ -`save_steps` (list[int]): Specific steps to save tensors at. Union with all other parameters. - -For example, - -`SaveConfig()` will save at steps [0, 100, ...].\ -`SaveConfig(save_interval=1)` will save at steps [0, 1, ...]\ -`SaveConfig(save_interval=100, end_step=200)` will save at steps [0, 200].\ -`SaveConfig(save_interval=100, end_step=201)` will save at steps [0, 100, 200].\ -`SaveConfig(save_interval=100, start_step=150)` will save at steps [200, 300, ...].\ -`SaveConfig(save_steps=[3, 7])` will save at steps [3, 7]. - +##### Arguments +- `mode_save_configs` (dict): Used for advanced cases; see details below. +- `save_interval` (int): How often, in steps, to save tensors. Defaults to 500. A step is saved if `step % save_interval == 0` +- `start_step` (int): When to start saving tensors. +- `end_step` (int): When to stop saving tensors, exclusive. +- `save_steps` (list[int]): Specific steps to save tensors at. Union with save_interval. + +##### Examples + +- `SaveConfig()` will save at steps 0, 500, ... +- `SaveConfig(save_interval=1)` will save at steps 0, 1, ... +- `SaveConfig(save_interval=100, end_step=200)` will save at steps 0, 100 +- `SaveConfig(save_interval=100, end_step=201)` will save at steps 0, 100, 200 +- `SaveConfig(save_interval=100, start_step=150)` will save at steps 200, 300, ... +- `SaveConfig(save_steps=[3, 7])` will save at steps 0, 3, 7, 500, ... + +##### Specifying different configuration based on mode There is also a more advanced use case, where you specify a different SaveConfig for each mode. It is best understood through an example: ```python -SaveConfig(mode_save_configs={ +import smdebug as smd +smd.SaveConfig(mode_save_configs={ smd.modes.TRAIN: smd.SaveConfigMode(save_interval=1), smd.modes.EVAL: smd.SaveConfigMode(save_interval=2), smd.modes.PREDICT: smd.SaveConfigMode(save_interval=3), @@ -209,6 +365,9 @@ take the same four parameters (save_interval, start_step, end_step, save_steps) Any mode not specified will default to the default configuration. If a mode is provided but not all params are specified, we use the default values for non-specified parameters. +#### Configuration using SageMaker Python SDK +Refer [Configuring Hook using SageMaker Python SDK](#configuring-hook-using-sagemaker-python-sdk) and [Configuring Collection using SageMaker Python SDK](#configuring-collection-using-sagemaker-python-sdk) + --- ## ReductionConfig @@ -220,12 +379,14 @@ and then saved. During analysis, these are available as reductions of the original tensor. Please note that using reduction config means that you will not have -the full tensor available during analysis, so this can restrict what you can do with the tensor saved. +the full tensor available during analysis, so this can restrict what you can do with the tensor saved. You can choose to also save the raw tensor along with the reductions if you so desire. + The hook takes a ReductionConfig object which is applied as default to all tensors included. A collection can also have its own ReductionConfig object which is applied to the tensors belonging to that collection. ```python +import smdebug as smd reduction_config = smd.ReductionConfig( reductions = None, abs_reductions = None, @@ -234,118 +395,54 @@ reduction_config = smd.ReductionConfig( save_raw_tensor = False, ) ``` -`reductions` (list[str]): Takes names of reductions, choosing from "min", "max", "median", "mean", "std", "variance", "sum", "prod".\ -`abs_reductions` (list[str]): Same as reductions, except the reduction will be computed on the absolute value of the tensor.\ -`norms` (list[str]): Takes names of norms to compute, choosing from "l1", "l2".\ -`abs_norms` (list[str]): Same as norms, except the norm will be computed on the absolute value of the tensor.\ -`save_raw_tensor` (bool): Saves the tensor directly, in addition to other desired reductions. + +##### Arguments +- `reductions` (list[str]): Takes names of reductions, choosing from "min", "max", "median", "mean", "std", "variance", "sum", "prod" +- `abs_reductions` (list[str]): Same as reductions, except the reduction will be computed on the absolute value of the tensor +- `norms` (list[str]): Takes names of norms to compute, choosing from "l1", "l2" +- `abs_norms` (list[str]): Same as norms, except the norm will be computed on the absolute value of the tensor +- `save_raw_tensor` (bool): Saves the tensor directly, in addition to other desired reductions For example, `ReductionConfig(reductions=['std', 'variance'], abs_reductions=['mean'], norms=['l1'])` -will return the standard deviation and variance, the mean of the absolute value, and the l1 norm. +will save the standard deviation and variance, the mean of the absolute value, and the l1 norm. +#### Configuration using SageMaker Python SDK +The reductions are passed as part of the "reductions" parameter to HookParameters or Collection Parameters. +Refer [Configuring Hook using SageMaker Python SDK](#configuring-hook-using-sagemaker-python-sdk) and [Configuring Collection using SageMaker Python SDK](#configuring-collection-using-sagemaker-python-sdk) for more on that. ---- - -## Environment Variables - -#### `USE_SMDEBUG`: - -Setting this variable to 0 turns off the hook that is created by default. This can be used -if the user doesn't want to use SageMaker Debugger. - -#### `SMDEBUG_CONFIG_FILE_PATH`: - -Contains the path to the JSON file that describes the smdebug hook. - -At the minimum, the JSON config should contain the path where smdebug should output tensors. -Example: - -`{ "LocalPath": "/my/smdebug_hook/path" }` - -In SageMaker environment, this path is set to point to a pre-defined location containing a valid JSON. -In non-SageMaker environment, SageMaker-Debugger is not used if this environment variable is not set and -a hook is not created manually. - -Sample JSON from which a hook can be created: -```json -{ - "LocalPath": "/my/smdebug_hook/path", - "HookParameters": { - "save_all": false, - "include_regex": "regex1,regex2", - "save_interval": "100", - "save_steps": "1,2,3,4", - "start_step": "1", - "end_step": "1000000", - "reductions": "min,max,mean" - }, - "CollectionConfigurations": [ - { - "CollectionName": "collection_obj_name1", - "CollectionParameters": { - "include_regex": "regexe5*", - "save_interval": 100, - "save_steps": "1,2,3", - "start_step": 1, - "reductions": "min" - } - }, - ] -} - +The parameter "reductions" can take a comma separated string consisting of the following values: +``` +min +max +median +mean +std +variance +sum +prod +l1 +l2 +abs_min +abs_max +abs_median +abs_mean +abs_std +abs_variance +abs_sum +abs_prod +abs_l1 +abs_l2 ``` -#### `TENSORBOARD_CONFIG_FILE_PATH`: - -Contains the path to the JSON file that specifies where TensorBoard artifacts need to -be placed. - -Sample JSON file: - -`{ "LocalPath": "/my/tensorboard/path" }` - -In SageMaker environment, the presence of this JSON is necessary to log any Tensorboard artifact. -By default, this path is set to point to a pre-defined location in SageMaker. - -tensorboard_dir can also be passed while creating the hook using the API or -in the JSON specified in SMDEBUG_CONFIG_FILE_PATH. For this, export_tensorboard should be set to True. -This option to set tensorboard_dir is available in both, SageMaker and non-SageMaker environments. - - -#### `CHECKPOINT_CONFIG_FILE_PATH`: - -Contains the path to the JSON file that specifies where training checkpoints need to -be placed. This is used in the context of spot training. - -Sample JSON file: - -`{ "LocalPath": "/my/checkpoint/path" }` - -In SageMaker environment, the presence of this JSON is necessary to save checkpoints. -By default, this path is set to point to a pre-defined location in SageMaker. - - -#### `SAGEMAKER_METRICS_DIRECTORY`: - -Contains the path to the directory where metrics will be recorded for consumption by SageMaker Metrics. -This is relevant only in SageMaker environment, where this variable points to a pre-defined location. - - -#### `TRAINING_END_DELAY_REFRESH`: - -During analysis, a [trial](analysis.md) is created to query for tensors from a specified directory. This -directory contains collections, events, and index files. This environment variable -specifies how many seconds to wait before refreshing the index files to check if training has ended -and the tensor is available. By default value, this value is set to 1. - +--- -#### `INCOMPLETE_STEP_WAIT_WINDOW`: +## Frameworks -During analysis, a [trial](analysis.md) is created to query for tensors from a specified directory. This -directory contains collections, events, and index files. A trial checks to see if a step -specified in the smdebug hook has been completed. This environment variable -specifies the maximum number of incomplete steps that the trial will wait for before marking -half of them as complete. Default: 1000 +For details on what's supported for different framework, go here: +* [TensorFlow](tensorflow.md) +* [PyTorch](pytorch.md) +* [MXNet](mxnet.md) +* [XGBoost](xgboost.md) diff --git a/docs/env_var.md b/docs/env_var.md new file mode 100644 index 000000000..c0abc5b70 --- /dev/null +++ b/docs/env_var.md @@ -0,0 +1,100 @@ + +## Environment Variables + +#### `USE_SMDEBUG`: + +When using official [SageMaker Framework Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-frameworks-deep-learning.html) and [AWS Deep Learning Containers](https://aws.amazon.com/machine-learning/containers/) which support the [Zero Script Change experience](sagemaker.md#zero-script-change), SageMaker Debugger can be disabled by setting this variable to `0`. In such a case, the hook is disabled regardless of what configuration is given to the job through the SageMaker Python SDK. By default this is set to `1` signifying True. + +#### `SMDEBUG_CONFIG_FILE_PATH`: + +Contains the path to the JSON file that describes the smdebug hook. + +At the minimum, the JSON config should contain the path where smdebug should output tensors. +Example: + +`{ "LocalPath": "/my/smdebug_hook/path" }` + +In SageMaker environment, this path is set to point to a pre-defined location containing a valid JSON. +In non-SageMaker environment, SageMaker-Debugger is not used if this environment variable is not set and +a hook is not created manually. + +Sample JSON from which a hook can be created: +```json +{ + "LocalPath": "/my/smdebug_hook/path", + "HookParameters": { + "save_all": false, + "include_regex": "regex1,regex2", + "save_interval": "100", + "save_steps": "1,2,3,4", + "start_step": "1", + "end_step": "1000000", + "reductions": "min,max,mean" + }, + "CollectionConfigurations": [ + { + "CollectionName": "collection_obj_name1", + "CollectionParameters": { + "include_regex": "regexe5*", + "save_interval": 100, + "save_steps": "1,2,3", + "start_step": 1, + "reductions": "min" + } + }, + ] +} + +``` + +#### `TENSORBOARD_CONFIG_FILE_PATH`: + +Contains the path to the JSON file that specifies where TensorBoard artifacts need to +be placed. + +Sample JSON file: + +`{ "LocalPath": "/my/tensorboard/path" }` + +In SageMaker environment, the presence of this JSON is necessary to log any Tensorboard artifact. +By default, this path is set to point to a pre-defined location in SageMaker. + +tensorboard_dir can also be passed while creating the hook using the API or +in the JSON specified in SMDEBUG_CONFIG_FILE_PATH. For this, export_tensorboard should be set to True. +This option to set tensorboard_dir is available in both, SageMaker and non-SageMaker environments. + + +#### `CHECKPOINT_CONFIG_FILE_PATH`: + +Contains the path to the JSON file that specifies where training checkpoints need to +be placed. This is used in the context of spot training. + +Sample JSON file: + +`{ "LocalPath": "/my/checkpoint/path" }` + +In SageMaker environment, the presence of this JSON is necessary to save checkpoints. +By default, this path is set to point to a pre-defined location in SageMaker. + + +#### `SAGEMAKER_METRICS_DIRECTORY`: + +Contains the path to the directory where metrics will be recorded for consumption by SageMaker Metrics. +This is relevant only in SageMaker environment, where this variable points to a pre-defined location. + + +#### `TRAINING_END_DELAY_REFRESH`: + +During analysis, a [trial](analysis.md) is created to query for tensors from a specified directory. This +directory contains collections, events, and index files. This environment variable +specifies how many seconds to wait before refreshing the index files to check if training has ended +and the tensor is available. By default value, this value is set to 1. + + +#### `INCOMPLETE_STEP_WAIT_WINDOW`: + +During analysis, a [trial](analysis.md) is created to query for tensors from a specified directory. This +directory contains collections, events, and index files. A trial checks to see if a step +specified in the smdebug hook has been completed. This environment variable +specifies the maximum number of incomplete steps that the trial will wait for before marking +half of them as complete. Default: 1000 diff --git a/docs/mxnet.md b/docs/mxnet.md index 0dd77fbb9..fb42ef8c4 100644 --- a/docs/mxnet.md +++ b/docs/mxnet.md @@ -3,17 +3,18 @@ ## Contents - [Support](#support) - [How to Use](#how-to-use) -- [Example](#mxnet-example) +- [Example](#example) - [Full API](#full-api) --- ## Support -### Versions - Zero Script Change experience where you need no modifications to your training script is supported in the official [SageMaker Framework Container for MXNet 1.6](https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-frameworks-deep-learning.html), or the [AWS Deep Learning Container for MXNet 1.6](https://aws.amazon.com/machine-learning/containers/). - - This library itself supports the following versions when you use our API which requires a few minimal changes to your training script: MXNet 1.4, 1.5, 1.6. +- Only Gluon models are supported +- When the Gluon model is hybridized, inputs and outputs of intermediate layers can not be saved +- Parameter server based distributed training is not yet supported --- @@ -39,10 +40,13 @@ See the [Common API](api.md) page for details on how to do this. --- -## MXNet Example +## Example ```python +####################################### +# Creating a hook. Refer `API for Saving Tensors` page for more on this import smdebug.mxnet as smd hook = smd.Hook(out_dir=args.out_dir) +####################################### import mxnet as mx from mxnet import gluon @@ -62,7 +66,7 @@ trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': args.lr}) ####################################### # Here we register the block to smdebug hook.register_block(net) - +####################################### batch_size = 100 mnist = mx.test_utils.get_mnist() @@ -89,58 +93,9 @@ for i in range(args.epochs): metric.reset() ``` -## Full API -See the [Common API](https://link.com) page for details about Collection, SaveConfig, and ReductionConfig.\ -See the [Analysis](https://link.com) page for details about analyzing a training job. - -## Hook -```python -__init__( - out_dir, - export_tensorboard = False, - tensorboard_dir = None, - dry_run = False, - reduction_config = None, - save_config = None, - include_regex = None, - include_collections= None, - save_all = False, - include_workers = "one", -) -``` -Initializes the hook. Pass this object as a callback to Keras' `model.fit(), model.evaluate(), model.evaluate()`. - -* `out_dir` (str): Where to write the recorded tensors and metadata. -* `export_tensorboard` (bool): Whether to use TensorBoard logs. -* `tensorboard_dir` (str): Where to save TensorBoard logs. -* `dry_run` (bool): If true, don't write any files. -* `reduction_config` (ReductionConfig object): See the Common API page. -* `save_config` (SaveConfig object): See the Common API page. -* `include_regex` (list[str]): List of additional regexes to save. -* `include_collections` (list[str]): List of collections to save. -* `save_all` (bool): Saves all tensors and collections. May be memory-intensive and slow. -* `include_workers` (str): Used for distributed training, can also be "all". - -```python -register_block( - self, - block, -) -``` -Adds callbacks to the module for recording tensors. - -* `block` (mx.gluon.Block): The block to use. +--- -```python -save_scalar( - self, - name, - value, - searchable = False, -) -``` -Call this method at any point in the training script to log a scalar value, such as accuracy. +## Full API +See the [API for Saving Tensors](api.md) page for details about Hook, Collection, SaveConfig, and ReductionConfig -* `name` (str): Name of the scalar. A prefix 'scalar/' will be added to it. -* `value` (float): Scalar value. -* `searchable` (bool): If True, the scalar value will be written to SageMaker Metrics. +See the [Analysis](analysis) page for details about analyzing a training job. diff --git a/docs/pytorch.md b/docs/pytorch.md index 58f474fda..e07bfe80b 100644 --- a/docs/pytorch.md +++ b/docs/pytorch.md @@ -8,7 +8,6 @@ - [Full API](#full-api) ## Support - ### Versions - Zero Script Change experience where you need no modifications to your training script is supported in the official [SageMaker Framework Container for PyTorch 1.3](https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-frameworks-deep-learning.html), or the [AWS Deep Learning Container for PyTorch 1.3](https://aws.amazon.com/machine-learning/containers/). @@ -44,8 +43,11 @@ See the [Common API](api.md) page for details on how to do this. ## Module Loss Example ```python +####################################### +# Creating a hook. Refer `API for Saving Tensors` page for more on this import smdebug.pytorch as smd hook = smd.Hook(out_dir=args.out_dir) +####################################### class Model(nn.Module) def __init__(self): @@ -59,9 +61,11 @@ net = Model() criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(net.parameters(), lr=args.lr) +####################################### # Register the hook and the loss hook.register_module(net) hook.register_loss(criterion) +####################################### # Training loop as usual for (inputs, labels) in trainloader: @@ -76,8 +80,11 @@ for (inputs, labels) in trainloader: ## Functional Loss Example ```python +####################################### +# Register the hook and the loss import smdebug.pytorch as smd hook = smd.Hook(out_dir=args.out_dir) +####################################### class Model(nn.Module) def __init__(self): @@ -90,18 +97,22 @@ class Model(nn.Module) net = Model() optimizer = optim.Adam(net.parameters(), lr=args.lr) +####################################### # Register the hook hook.register_module(net) +####################################### # Training loop, recording the loss at each iteration for (inputs, labels) in trainloader: optimizer.zero_grad() outputs = net(inputs) loss = F.cross_entropy(outputs, labels) - + + ####################################### # Manually record the loss hook.record_tensor_value(tensor_name="loss", tensor_value=loss) - + ####################################### + loss.backward() optimizer.step() ``` @@ -109,58 +120,5 @@ for (inputs, labels) in trainloader: --- ## Full API -See the [Common API](api.md) page for details about Collection, SaveConfig, and ReductionConfig.\ +See the [API for Saving Tensors](api.md) page for details about Hook, Collection, SaveConfig, and ReductionConfig. See the [Analysis](analysis.md) page for details about analyzing a training job. - -## Hook -```python -__init__( - out_dir, - export_tensorboard = False, - tensorboard_dir = None, - dry_run = False, - reduction_config = None, - save_config = None, - include_regex = None, - include_collections= None, - save_all = False, - include_workers = "one", -) -``` -Initializes the hook. Pass this object as a callback to Keras' `model.fit(), model.evaluate(), model.evaluate()`. - -* `out_dir` (str): Where to write the recorded tensors and metadata. -* `export_tensorboard` (bool): Whether to use TensorBoard logs. -* `tensorboard_dir` (str): Where to save TensorBoard logs. -* `dry_run` (bool): If true, don't write any files. -* `reduction_config` (ReductionConfig object): See the Common API page. -* `save_config` (SaveConfig object): See the Common API page. -* `include_regex` (list[str]): List of additional regexes to save. -* `include_collections` (list[str]): List of collections to save. -* `save_all` (bool): Saves all tensors and collections. May be memory-intensive and slow. -* `include_workers` (str): Used for distributed training, can also be "all". - -```python -register_module( - self, - module, -) -``` -Adds callbacks to the module for recording tensors. - -* `module` (torch.nn.Module): The module to use. - - -```python -save_scalar( - self, - name, - value, - searchable = False, -) -``` -Call this method at any point in the training script to log a scalar value, such as accuracy. - -* `name` (str): Name of the scalar. A prefix 'scalar/' will be added to it. -* `value` (float): Scalar value. -* `searchable` (bool): If True, the scalar value will be written to SageMaker Metrics. diff --git a/docs/sagemaker.md b/docs/sagemaker.md index aa5bce967..94d93b837 100644 --- a/docs/sagemaker.md +++ b/docs/sagemaker.md @@ -6,15 +6,16 @@ - [Bring your own training container](#bring-your-own-training-container) - [Configuring SageMaker Debugger](#configuring-sagemaker-debugger) - [Saving data](#saving-data) - - [Saving first party collections](#saving-first-party-collections) + - [Saving built-in collections that we manage](#saving-built-in-collections-that-we-manage) - [Saving reductions for a custom collection](#saving-reductions-for-a-custom-collection) - [Enabling TensorBoard summaries](#enabling-tensorboard-summaries) - [Rules](#rules) - [Built In Rules](#built-in-rules) - [Custom Rules](#custom-rules) - - [Interactive Exploration](#interactive-exploration) - - [SageMaker Studio](#sagemaker-studio) - - [TensorBoard Visualization](#tensorboard-visualization) +- [Interactive Exploration](#interactive-exploration) +- [SageMaker Studio](#sagemaker-studio) +- [TensorBoard Visualization](#tensorboard-visualization) +- [Example Notebooks](#example-notebooks) ## Enabling SageMaker Debugger There are two ways in which you can enable SageMaker Debugger while training on SageMaker. @@ -65,8 +66,8 @@ Regardless of which of the two above ways you have enabled SageMaker Debugger, y SageMaker Debugger gives you a powerful and flexible API to save the tensors you choose at the frequencies you want. These configurations are made available in the SageMaker Python SDK through the `DebuggerHookConfig` class. -##### Saving first party collections that we manage -Learn more about these first party collections [here](api.md). +#### Saving built-in collections that we manage +Learn more about these built in collections [here](api.md). ```python from sagemaker.debugger import DebuggerHookConfig, CollectionConfig @@ -103,7 +104,7 @@ sagemaker_estimator = sm.tensorflow.TensorFlow( sagemaker_estimator.fit() ``` -##### Saving reductions for a custom collection +#### Saving reductions for a custom collection You can define your collection of tensors. You can also choose to save certain reductions of tensors only instead of saving the full tensor. You may choose to do this to reduce the amount of data saved. Please note that when you save reductions, unless you pass the flag `save_raw_tensor`, only these reductions will be available for analysis. The raw tensor will not be saved. ```python @@ -134,7 +135,7 @@ sagemaker_estimator = sm.tensorflow.TensorFlow( sagemaker_estimator.fit() ``` -##### Enabling TensorBoard summaries +#### Enabling TensorBoard summaries SageMaker Debugger can automatically generate tensorboard scalar summaries, distributions and histograms for tensors saved. This can be enabled by passing a `TensorBoardOutputConfig` object when creating an Estimator as follows. @@ -175,6 +176,8 @@ sagemaker_estimator = sm.tensorflow.TensorFlow( sagemaker_estimator.fit() ``` +For more details, refer our [API page](api.md). + ### Rules Here are some examples on how to run Rules with your training jobs. @@ -194,14 +197,14 @@ Scope of Validity | Rules | | XGBoost algorithm | | -##### Running built-in SageMaker Rules +#### Running built-in SageMaker Rules You can run a SageMaker built-in Rule as follows using the `Rule.sagemaker` method. The first argument to this method is the base configuration that is associated with the Rule. We configure them as much as possible. You can take a look at the ruleconfigs that we populate for all built-in rules [here](https://github.com/awslabs/sagemaker-debugger-rulesconfig). You can choose to customize these parameters using the other parameters. -These rules are run on our pre-built Docker images which are listed [here](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-docker-images-rules.html) +These rules are run on our pre-built Docker images which are listed [here](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-docker-images-rules.html). You are not charged for the instances when running SageMaker built-in rules. A list of all our built-in rules are provided [below](#built-in-rules). @@ -240,7 +243,7 @@ sagemaker_estimator.fit() You can write your own rule custom made for your application and provide it, so SageMaker can monitor your training job using your rule. To do so, you need to understand the programming model that `smdebug` provides. Our page on [Programming Model for Analysis](analysis.md) describes the APIs that we provide to help you write your own rule. Please refer to [this example notebook](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-debugger/tensorflow_keras_custom_rule/tf-keras-custom-rule.ipynb) for a demonstration of creating your custom rule and running it on SageMaker. -##### Running custom Rules +#### Running custom Rules To run a custom rule, you have to provide a few additional parameters. Key parameters to note are a file which has the implementation of your Rule class `source`, the name of the Rule class (`rule_to_invoke`), the type of instance to run the Rule job on (`instance_type`), @@ -289,20 +292,26 @@ sagemaker_estimator = sm.tensorflow.TensorFlow( sagemaker_estimator.fit() ``` -### Interactive Exploration +For more details, refer our [Analysis page](analysis.md). + +## Interactive Exploration `smdebug` SDK also allows you perform interactive and real-time exploration of the data saved. You can choose to inspect the tensors saved, or visualize them through your custom plots. You can retrieve these tensors as numpy arrays allowing you to use your favorite analysis libraries right in a SageMaker notebook instance. We have couple of example notebooks demonstrating this. - [Real-time anaysis in a notebook during training](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-debugger/mxnet_realtime_analysis/mxnet-realtime-analysis.ipynb) - [Interactive tensor analysis in a notebook](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-debugger/mnist_tensor_analysis/mnist_tensor_analysis.ipynb) -### SageMaker Studio +## SageMaker Studio SageMaker Debugger is on by default for supported training jobs on the official SageMaker Framework containers (or AWS Deep Learning Containers) during SageMaker training jobs. In this default scenario, SageMaker Debugger takes the losses and metrics from your training job and publishes them to SageMaker Metrics, allowing you to track these metrics in SageMaker Studio. -You can also see the status of Rules you have enabled for your training job right in the Studio. +You can also see the status of Rules you have enabled for your training job right in the Studio. [Here](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-visualization.html) are screenshots of that experience. -### TensorBoard Visualization +## TensorBoard Visualization If you have enabled TensorBoard outputs for your training job through SageMaker Debugger, TensorBoard artifacts will automatically be generated for the tensors saved. You can then point your TensorBoard instance to that S3 location and review the visualizations for the tensors saved. + +## Example Notebooks + +We have a bunch of [example notebooks](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger) here demonstrating different aspects of SageMaker Debugger. diff --git a/docs/sagemaker_api.md b/docs/sagemaker_api.md deleted file mode 100644 index 87a7c5325..000000000 --- a/docs/sagemaker_api.md +++ /dev/null @@ -1,3 +0,0 @@ -## SageMaker Debugger API in the SageMaker Python SDK - -TODO, document all the parameters accepted in HookParameters and CollectionParameters through the SageMaker Python SDK diff --git a/docs/tensorflow.md b/docs/tensorflow.md index f5f0e940c..869f068b5 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -3,11 +3,13 @@ ## Contents - [Support](#support) - [How to Use](#how-to-use) -- [Keras Example](#keras-example) -- [MonitoredSession Example](#monitored-session-example) -- [Estimator Example](#estimator-example) +- [tf.keras Example](#tfkeras) +- [MonitoredSession Example](#monitoredsession) +- [Estimator Example](#estimator) - [Full API](#full-api) + --- + ## Support ### Versions @@ -52,6 +54,10 @@ See the [Common API](api.md) page for details on how to do this. --- +## Examples + +We have three Hooks for different interfaces of TensorFlow. The following is needed to enable SageMaker Debugger on non Zero Script Change supported containers. Refer [SageMaker training](sagemaker.md) on how to use the Zero Script Change experience. + ## tf.keras ### Example ```python @@ -68,46 +74,6 @@ model.fit(x_train, y_train, epochs=args.epochs, callbacks=[hook]) model.evaluate(x_test, y_test, callbacks=[hook]) ``` -### KerasHook -In SageMaker, call `smd.KerasHook.create_from_json_file()`. - -In a non-SageMaker environment, use the following constructor. -```python -__init__( - out_dir, - export_tensorboard = False, - tensorboard_dir = None, - dry_run = False, - reduction_config = None, - save_config = None, - include_regex = None, - include_collections = None, - save_all = False, -) -``` -Initializes the hook. Pass this object as a callback to Keras' `model.fit(), model.evaluate(), model.evaluate()`. - -`out_dir` (str): Where to write the recorded tensors and metadata.\ -`export_tensorboard` (bool): Whether to use TensorBoard logs.\ -`tensorboard_dir` (str): Where to save TensorBoard logs.\ -`dry_run` (bool): If true, don't write any files.\ -`reduction_config` (ReductionConfig object): See the Common API page.\ -`save_config` (SaveConfig object): See the Common API page.\ -`include_regex` (list[str]): List of additional regexes to save.\ -`include_collections` (list[str]): List of collections to save.\ -`save_all` (bool): Saves all tensors and collections. May be memory-intensive and slow. - - -```python -wrap_optimizer( - self, - optimizer: Union[tf.train.Optimizer, tf.keras.Optimizer] -) -``` -Adds functionality to the optimizer object to log gradients. Returns the original optimizer and doesn't change the optimization process. - -`optimizer` (Union[tf.train.Optimizer, tf.keras.Optimizer]): The optimizer. - --- ## MonitoredSession @@ -128,47 +94,6 @@ sess = tf.train.MonitoredSession(hooks=[hook]) sess.run([loss, ...]) ``` -### SessionHook -In SageMaker, call `smd.SessionHook.create_from_json_file()`. - -If in a non-SageMaker environment, use the following constructor. - -```python -__init__( - out_dir, - export_tensorboard = False, - tensorboard_dir = None, - dry_run = False, - reduction_config = None, - save_config = None, - include_regex = None, - include_collections= None, - save_all = False, - include_workers = "one" -) -``` - -Pass this object as a hook to tf.train.MonitoredSession's `run()` method. - -`out_dir` (str): Where to write the recorded tensors and metadata.\ -`export_tensorboard` (bool): Whether to use TensorBoard logs.\ -`tensorboard_dir` (str): Where to save TensorBoard logs.\ -`dry_run` (bool): If true, don't write any files.\ -`reduction_config` (ReductionConfig object): See the Common API page.\ -`save_config` (SaveConfig object): See the Common API page.\ -`include_regex` (list[str]): List of additional regexes to save.\ -`include_collections` (list[str]): List of collections to save.\ -`save_all` (bool): Saves all tensors and collections. May be memory-intensive and slow.\ -`include_workers` (str): Used for distributed training, can also be "all". - -```python -wrap_optimizer( - self, - optimizer: tf.train.Optimizer -) -``` -Adds functionality to the optimizer object to log gradients. Returns the original optimizer and doesn't change the optimization process. - --- ## Estimator @@ -188,47 +113,8 @@ hook.set_mode(mode=smd.modes.EVAL) estimator.evaluate(input_fn=eval_input_fn, steps=args.steps, hooks=[hook]) ``` -### EstimatorHook -In SageMaker, call `smd.EstimatorHook.create_from_json_file()`. - -If in a non-SageMaker environment, use the following constructor. - -```python -__init__( - out_dir, - export_tensorboard = False, - tensorboard_dir = None, - dry_run = False, - reduction_config = None, - save_config = None, - include_regex = None, - include_collections= None, - save_all = False, - include_workers = "one" -) -``` - -Pass this object as a hook to tf.train.MonitoredSession's `run()` method. - -`out_dir` (str): Where to write the recorded tensors and metadata.\ -`export_tensorboard` (bool): Whether to use TensorBoard logs.\ -`tensorboard_dir` (str): Where to save TensorBoard logs.\ -`dry_run` (bool): If true, don't write any files.\ -`reduction_config` (ReductionConfig object): See the Common API page.\ -`save_config` (SaveConfig object): See the Common API page.\ -`include_regex` (list[str]): List of additional regexes to save.\ -`include_collections` (list[str]): List of collections to save.\ -`save_all` (bool): Saves all tensors and collections. May be memory-intensive and slow.\ -`include_workers` (str): Used for distributed training, can also be "all". - -```python -wrap_optimizer( - self, - optimizer: tf.train.Optimizer -) -``` -Adds functionality to the optimizer object to log gradients. Returns the original optimizer and doesn't change the optimization process. - --- -See the [Common API](api.md) page for details about Collection, SaveConfig, and ReductionConfig.\ + +## Full API +See the [API for saving tensors](api.md) page for details about the Hooks, Collection, SaveConfig, and ReductionConfig. See the [Analysis](analysis.md) page for details about analyzing a training job.