# Detect Stalled Training and Stop Training Job Using SageMaker Debugger Rule
 
This notebook guides you how to use the `StalledTrainingRule` built-in rule. This rule can take an action to stop your training job, when the rule detects an inactivity in your training job for a certain time period. This functionality helps you monitor the training job status and save redundant resource usage.

## How `StalledTrainingRule` Works

Amazon Sagemaker Debugger captures tensors that you want to watch from training jobs on [AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-frameworks-deep-learning.html) or your local machine. If you use one of the Debugger-integrated Deep Learning Containers, you don't need to make any changes to your training script to use the functionality of built-in rules. For information about Debugger-supported SageMaker frameworks and versions, see [Debugger-supported framework versions for zero script change](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/sagemaker.md#zero-script-change). 

The Debugger `StalledTrainingRule` watches tensor updates from your training job. If the rule doesn't find new tensors updated to the default S3 URI for a threshold period of time, it takes an action to trigger the `StopTrainingJob` API operation. The following code cells set up a SageMaker TensorFlow estimator with the Debugger `StalledTrainingRule` to watch the `losses` pre-built tensor collection.

### Import SageMaker Python SDK

In [None]:
import sagemaker
from sagemaker.tensorflow import TensorFlow
print(sagemaker.__version__)

### Import SageMaker Debugger classes for rule configuration

In [None]:
from sagemaker.debugger import Rule, CollectionConfig, rule_configs

### Create a unique training job prefix
The unique prefix must be specified for `StalledTrainingRule` to identify the exact training job name that you want to monitor and stop when the rule triggers the stalled training job issue.
If there are multiple training jobs sharing the same prefix, this rule may react to other training jobs. If the rule cannot find the exact training job name with a provided prefix, it will fallback to safe mode and not take action of stop the training job.

The following code cell includes:
* a code line to create a unique `base_job_name_prefix`
* a stalled training job rule configuration object
* a SageMaker TensorFlow estimator configuration with the Debugger `rules` parameter to run the built-in rule

In [None]:
# Append current time to your training job name to generate a unique base_job_name_prefix
import time
base_job_name_prefix= 'smdebug-stalled-demo-' + str(int(time.time()))

# Configure a StalledTrainingRule rule parameter object
stalled_training_job_rule = [
    Rule.sagemaker(
        base_config=rule_configs.stalled_training_rule(),
        rule_parameters={
                "threshold": "120", 
                "stop_training_on_fire": "True",
                "training_job_name_prefix": base_job_name_prefix
        },
        collections_to_save=[ 
            CollectionConfig(
                name="losses", 
                parameters={
                    "save_interval": "500"
                } 
            )
        ]
    )
]

# Configure a SageMaker TensorFlow estimator
estimator = TensorFlow(
    role=sagemaker.get_execution_role(),
    base_job_name=base_job_name_prefix,
    train_instance_count=1,
    train_instance_type='ml.m5.4xlarge',
    entry_point='src/simple_stalled_training.py', # This sample script forces the training job to sleep for 10 minutes
    framework_version='1.15.0',
    py_version='py3',
    train_max_run=3600,
    ## Debugger-specific parameter
    rules = stalled_training_job_rule
)

In [None]:
estimator.fit(wait=False)

## Monitoring

Once you excute the `estimator.fit()` API, SageMaker initiates a trining job in the background, and Debugger initiates a `StalledTrainingRule` rule evaluation job in parallel.
Because the training scripts has a couple of lines of code at the end to force a stalled training job for 10 minutes, the `RuleEvaluationStatus` for `StalledTrainingRule` changes to `IssuesFound` in 2 minutes and trigger the `StopTrainingJob` API. The following code cells track the `TrainingJobStatus` until the `SecondaryStatus` returns `Stopped` or `Completed`.

### Print the training job name

The following cell outputs the training job name and its training status running in the background.

In [None]:
job_name = estimator.latest_training_job.name
print('Training job name: {}'.format(job_name))

client = estimator.sagemaker_session.sagemaker_client

description = client.describe_training_job(TrainingJobName=job_name)

### Output the current job status

The following cell tracks the status of training job until the `SecondaryStatus` changes to `Training`. While training, Debugger collects output tensors from the training job and monitors the training job with the rules. 

In [None]:
import time

if description['TrainingJobStatus'] != 'Completed':
    while description['SecondaryStatus'] not in {'Stopped', 'Completed'}:
        description = client.describe_training_job(TrainingJobName=job_name)
        primary_status = description['TrainingJobStatus']
        secondary_status = description['SecondaryStatus']
        print('Current job status: [PrimaryStatus: {}, SecondaryStatus: {}] | {} Rule Evaluation Status: {}'
            .format(primary_status, secondary_status, 
                estimator.latest_training_job.rule_job_summary()[0]["RuleConfigurationName"],
                estimator.latest_training_job.rule_job_summary()[0]["RuleEvaluationStatus"]
            )
        )
        time.sleep(15)

## Conclusion

This notebook showed how you can use the Debugger `StalledTrainingRule` built-in rule for your training job to take action on rule evaluation status changes. To find more information about Debugger, see [Amazon SageMaker Debugger Developer Guide](https://integ-docs-aws.amazon.com/sagemaker/latest/dg/train-debugger.html) and the [smdebug GitHub documentation](https://github.com/awslabs/sagemaker-debugger).