## Task 2: Use SageMaker Debugger

In this lab, you use Amazon SageMaker Debugger to analyze, detect, and get alerted on bottlenecks, resource utilization rates, and various training issues during training jobs.

### Task 2.1: Setup the environment

Install packages and dependencies.

In [2]:
%%capture

#install updates
!apt-get update && apt-get install -y build-essential

**Note:** Packages can take as long as 5 minutes to install.

In [3]:
%%capture

#install-dependencies

%pip install pytest-cov
%pip install pytest-filter-subpackage
%pip install -U sagemaker
%pip install -U smdebug
%pip install -U shap
%pip install protobuf==3.19.4

In [4]:
#import libraries and set variable values

import sys
import boto3
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import sagemaker
import sagemaker_datawrangler 
import shap

from mpl_toolkits.axes_grid1 import host_subplot

from sagemaker.debugger import (
    CollectionConfig,
    DebuggerHookConfig,
    FrameworkProfile,
    ProfilerConfig,
    ProfilerRule,
    Rule,
    rule_configs,
    TensorBoardOutputConfig
)

from sagemaker.estimator import Estimator
from sagemaker import get_execution_role
from sagemaker.inputs import TrainingInput
from sagemaker.s3 import S3Uploader
from sagemaker.xgboost.estimator import XGBoost
from smdebug.core import modes
from smdebug.trials import create_trial

base_job_name = "lab-7-smdebugger-job"
bucket = sagemaker.Session().default_bucket()
bucket_path = "s3://{}".format(bucket)
prefix = "lab-7-smdebugger"
region = boto3.Session().region_name
role = sagemaker.get_execution_role()
save_interval = 5

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
[2024-11-06 10:04:50.681 sagemaker-data-scienc-ml-t3-medium-26785b5a6dcec944732a404766e8:18 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None


Next, import the dataset.

In [5]:
#import-dataset
lab_test_data = pd.read_csv('adult_data_processed.csv')
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 20)
lab_test_data.head()

   income  age  workclass  education  education_num  marital_status  \
0       0   39          1          2              2               1   
1       0   50          2          2              2               0   
2       0   38          0          0              0               2   
3       0   53          0          3              6               0   
4       0   28          0          2              2               0   

   occupation  relationship  race  sex  capital_gain  capital_loss  \
0           2             1     0    0          2174             0   
1           2             0     0    0             0             0   
2           0             1     0    0             0             0   
3           0             0     1    0             0             0   
4           3             4     1    1             0             0   

   hours_per_week  
0              40  
1              13  
2              40  
3              40  
4              40  

Split the dataset into training (70 percent), validation (20 percent), and test (10 percent) datasets. The training and validation datasets are used to create the model in this lab. You will not use the test dataset in this lab.

In [6]:
#split-dataset
train_data, validation_data, test_data = np.split(
    lab_test_data.sample(frac=1, random_state=1729),
    [int(0.7 * len(lab_test_data)), int(0.9 * len(lab_test_data))],
)

train_data.to_csv('train_data.csv', index=False, header=False)
validation_data.to_csv('validation_data.csv', index=False, header=False)

feature_names = list(train_data.columns)[1:]

Now, upload the dataset to Amazon Simple Storage Service (Amazon S3).

In [7]:
#upload-dataset
sagemaker_session = sagemaker.Session()

train_path = S3Uploader.upload('train_data.csv', 's3://{}/{}'.format(bucket, prefix))
validation_path = S3Uploader.upload('validation_data.csv', 's3://{}/{}'.format(bucket, prefix))

train_input = TrainingInput(train_path, content_type='text/csv')
validation_input = TrainingInput(validation_path, content_type='text/csv')

data_inputs = {
    'train': train_input,
    'validation': validation_input
}

### Task 2.2: Modify the training script to enable SageMaker Debugger

You must modify the training script that you used in the previous lab to save tensors to a specified output S3 bucket, specify which tensors to save, and register debug hooks.

To train the model, you will need to configure the following:
- **Debugger Hook Parameters** to adjust save intervals of the output tensors in the training phases
- **Debugger Rule Object** to save output tensors for evaluation

For **Debugger Hook Parameters**, you configure the **metrics**, **feature_importance**, **full_shap**, and **average_shap** built-in tensor collections to be captured during training. These are configured in the **collection_configs** for the **debugger_hook_config**. The **full_shap**, and **average_shap** built-in tensor collections use Shapley Values (SHAP). SHAP explains a machine learning (ML) prediction by assuming that each feature value of a training data instance is a player in a game where the prediction is the payout. It then indicates how to distribute the payout fairly among the features. Refer to [SHAP Baselines for Explainability](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-feature-attribute-shap-baselines.html) for more information about SHAP.

For **Debugger Rule Object**, you configure the following **rule_configs** in **rules**:
- [Profiler Report](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-report.html#debugger-profiling-report): Runs rules for system bottleneck detections and autogenerates a profiling report.
- [XGBoost Training Report](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-report-xgboost.html): Runs a comprehensive XGBoost Report.
- [Overfit](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html#overfit): Detects if your model is being overfit to the training data by comparing the validation and training losses.
- [Overtraining](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html#overtraining): Detects if a model is being overtrained. After several training iterations on a well-behaved model (both training and validation loss decrease), the model approaches to a minimum of the loss function and does not improve anymore. If the model continues training, validation loss might start increasing because the model starts overfitting.
- [Loss Not Decreasing](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html#loss-not-decreasing): Detects when the loss is not decreasing in value at an adequate rate. These losses must be scalars. 

In [8]:
#enable-debugger
# Retrieve the container image
container = sagemaker.image_uris.retrieve(
    region=boto3.Session().region_name, 
    framework='xgboost', 
    version='1.5-1'
)
# Set up the estimator
xgb = sagemaker.estimator.Estimator(
    container,
    role, 
    base_job_name=base_job_name,
    instance_count=1, 
    instance_type='ml.m5.4xlarge',
    #Set the hyperparameters
    hyperparameters= {
        "max_depth": "5",
        "eta": "0.2",
        "gamma": "4",
        "min_child_weight": "6",
        "subsample": "0.7",
        "objective": "binary:logistic",
        "num_round": "300",
    },
    sagemaker_session=sagemaker_session,
    max_run=1800,
    #Set the Debugger Hook Config
    debugger_hook_config=DebuggerHookConfig(
        s3_output_path=bucket_path,  # Required
        collection_configs=[  # For each of these, a new processing job will be run later in the lab
            CollectionConfig(name="metrics", parameters={"save_interval": str(save_interval)}),
            CollectionConfig(name="feature_importance", parameters={"save_interval": str(save_interval)},),
            CollectionConfig(name="full_shap", parameters={"save_interval": str(save_interval)}),
            CollectionConfig(name="average_shap", parameters={"save_interval": str(save_interval)}),
        ],
    ),
    #Set the Debugger Profiler Configuration
    profiler_config = ProfilerConfig(
        system_monitor_interval_millis=500,
        framework_profile_params=FrameworkProfile()

    ),
    #Configure the Debugger Rule Object
    rules = [
        ProfilerRule.sagemaker(rule_configs.ProfilerReport()),
        Rule.sagemaker(rule_configs.create_xgboost_report()),  
        Rule.sagemaker(rule_configs.overfit()),
        Rule.sagemaker(rule_configs.overtraining()),
        Rule.sagemaker(rule_configs.loss_not_decreasing(),
            rule_parameters={
                "collection_names": "metrics",
                "num_steps": str(save_interval * 2),
            }
        )
    ]
)

### Task 2.3: Run the Debugger-enabled training job

Now, train the XGBoost model using the Debugger-enabled script. Training will take approximately 5–10 minutes to run. You can continue with the next task while the training job is running and monitor the job progress using SageMaker Debugger.

In [9]:
#train-model
xgb.fit(
    inputs = data_inputs
)

2024-11-06 10:11:44 Starting - Starting the training job...
2024-11-06 10:12:18 Downloading - Downloading input dataCreateXgboostReport: InProgress
Overfit: InProgress
Overtraining: InProgress
LossNotDecreasing: InProgress
ProfilerReport: InProgress
...
2024-11-06 10:12:47 Downloading - Downloading the training image...
  from pandas import MultiIndex, Int64Index[0m
[34m[2024-11-06 10:13:40.063 ip-10-2-113-73.us-west-2.compute.internal:7 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[2024-11-06 10:13:40.085 ip-10-2-113-73.us-west-2.compute.internal:7 INFO profiler_config_parser.py:111] Using config at /opt/ml/input/config/profilerconfig.json.[0m
[34m[2024-11-06:10:13:40:INFO] Imported framework sagemaker_xgboost_container.training[0m
[34m[2024-11-06:10:13:40:INFO] Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34m[2024-11-06:10:13:40:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2

### Task 2.4: Monitor the training job status

In SageMaker Studio, you can review the trial components, including all the SageMaker Debugger jobs that you started. In this lab, you created an **XGBoost Report** job, an **Overfit** job, an **Overtraining** job, and a **Loss Not Decreasing** job. Explore these in SageMaker Studio.

The next step will bring you to a new tab in SageMaker Studio. To follow these directions, use one of the following options:
- **Option 1:** View the tabs side by side. To create a split screen view from the main SageMaker Studio window, either drag the **lab_7.ipynb** tab to the side or choose the **lab_7.ipynb** tab, and then from the toolbar, select **File** and **New View for Notebook**. You can now have the directions visible as you explore the feature group.
- **Option 2:** Switch between the SageMaker Studio tabs to follow these instructions.

1. Choose the **SageMaker Home** icon.
2. Choose **Experiments**.

SageMaker studio opens the **Experiments** tab.

3. Choose **Unassigned runs**.
4. From the list, select the **Name** of the job, which has the Type **SageMakerTrainingJob**.

Details of the Training job are displayed.

5. On the left side, select the **Debugger** tab.

SageMaker Debugger provides the status of your training job, which you can monitor while the model training is running. When complete, you will see the status of any specified training issues.

The analysis is complete when all the **Status** lines display **No Issues Found** or **Issues Found**. The Debugger rules can take as long as 9 minutes to complete.

If issues are found, it means that there are problems you might want to fix in your model. Are there any issues found for the jobs? 

In this lab, you do not resolve any issues found. However, if you want to resolve issues found, you can address them with a combination of processing the dataset and retraining the model with adjusted hyperparameters.

6. When the analysis is complete, return to the notebook tab labeled **lab_7.ipynb**.

### Task 2.5: Perform post-training analysis

With SageMaker Debugger, you can create processing job logs in Amazon CloudWatch that you can use to configure custom alarms. Here, you print the location of where the logs are stored for each metric evaluated.

In [10]:
#print-urls
def _get_rule_job_name(training_job_name, rule_configuration_name, rule_job_arn):
    """Helper function to get the rule job name with correct casing"""
    return "{}-{}-{}".format(
        training_job_name[:26], rule_configuration_name[:26], rule_job_arn[-8:]
    )


def _get_cw_url_for_rule_job(rule_job_name, region):
    return "https://{}.console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/ProcessingJobs;prefix={};streamFilter=typeLogStreamPrefix".format(
        region, region, rule_job_name
    )


def get_rule_jobs_cw_urls(xgb):
    region = boto3.Session().region_name
    training_job = xgb.latest_training_job
    training_job_name = training_job.describe()["TrainingJobName"]
    rule_eval_statuses = training_job.describe()["DebugRuleEvaluationStatuses"]

    result = {}
    for status in rule_eval_statuses:
        if status.get("RuleEvaluationJobArn", None) is not None:
            rule_job_name = _get_rule_job_name(
                training_job_name, status["RuleConfigurationName"], status["RuleEvaluationJobArn"]
            )
            result[status["RuleConfigurationName"]] = _get_cw_url_for_rule_job(
                rule_job_name, region
            )
    return result


get_rule_jobs_cw_urls(xgb)

{'CreateXgboostReport': 'https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logStream:group=/aws/sagemaker/ProcessingJobs;prefix=lab-7-smdebugger-job-2024--CreateXgboostReport-f630f840;streamFilter=typeLogStreamPrefix',
 'Overfit': 'https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logStream:group=/aws/sagemaker/ProcessingJobs;prefix=lab-7-smdebugger-job-2024--Overfit-f9983b8d;streamFilter=typeLogStreamPrefix',
 'Overtraining': 'https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logStream:group=/aws/sagemaker/ProcessingJobs;prefix=lab-7-smdebugger-job-2024--Overtraining-8e916c5a;streamFilter=typeLogStreamPrefix',
 'LossNotDecreasing': 'https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logStream:group=/aws/sagemaker/ProcessingJobs;prefix=lab-7-smdebugger-job-2024--LossNotDecreasing-4dbbae84;streamFilter=typeLogStreamPrefix'}

Tensors can be retrieved by default collections such as weights, gradients, biases, and losses that SageMaker Debugger creates from your training job in addition to custom collections from tensors. Generate a list of names and values for the saved tensors to determine which tensors to plot for further analysis.

In [None]:
#retrieve-names
trial = create_trial(xgb.latest_job_debugger_artifacts_path())
trial.tensor_names()

[2024-11-06 10:16:56.593 sagemaker-data-scienc-ml-t3-medium-26785b5a6dcec944732a404766e8:18 INFO s3_trial.py:42] Loading trial debug-output at path s3://sagemaker-us-west-2-440570968020/lab-7-smdebugger-job-2024-11-06-10-11-42-486/debug-output
[2024-11-06 10:16:59.667 sagemaker-data-scienc-ml-t3-medium-26785b5a6dcec944732a404766e8:18 INFO trial.py:197] Training has ended, will refresh one final time in 1 sec.


In [None]:
#retrieve-values
trial.tensor("average_shap/f1").values()

In [None]:
#plot-tensors
shap_values = trial.tensor("full_shap/f10").value(trial.last_complete_step)
shap_no_base = shap_values[:, :-1]
shap_base_value = shap_values[0, -1]
shap.summary_plot(shap_no_base, plot_type="bar", feature_names=feature_names)

### Task 2.6: Access the SageMaker Debugger insights dashboard

1. Return to the **Experiments** tab.
2. On the left, select **Debugger**.
3. In the **Debugger insights** section, select the available **lab-7-smdebugger-job** from the **Training job name** list.

SageMaker Studio opens a new **Debugger insights tab** for the job and begins loading data.

SageMaker Debugger provides an overview of your model training performance on Amazon Elastic Compute Cloud (Amazon EC2) instances. Explore SageMaker Debugger in SageMaker Studio and examine details contained in the reports.

The **Systems Metrics** tab includes the following sections:

- **Resource utilization summary**
- **CPU utilization summary**
- **GPU Utilization summary**

The **Rules** tab includes the following **Insights**: 

- **BatchSize**
- **LowGPUUtiliztion**
- **CPUBottleneck**
- **GPUMemoryIncrease**
- **StepOutlier**
- **MaxInitializationTime**
- **IOBottleneck**
- **LoadBalancing**

Data will populate in the charts and tables if any issues were found.

4. You can download a Debugger report by choosing the <span style="background-color:#1a1b22; font-size:90%; color:#57c4f8; position:relative; top:-1px; padding-top:3px; padding-bottom:3px; padding-left:10px; padding-right:10px; border-color:#57c4f8; border-width:thin; border-style:solid; border-radius:2px; margin-right:5px; white-space:nowrap">Download report</span> near the top of the **Debugger insights** tab.

### Conclusion

Congratulations! You have used SageMaker Debugger to analyze, detect, and create alerts for potential issues in model training. You can now use the information generated from your reports and alerts to provide insight into efficiently training and creating a more effective model. The next lab focuses on using SageMaker Clarify to detect bias and provide explainability for model predictions.

### Cleanup

You have completed this notebook. To move to the next part of the lab, do the following:

- Close this notebook file.
- Return to the lab session and continue with the **Conclusion**.