-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Describe the bug
Exception during rule evaluation: Customer Error: No debugging data was saved by the training job. Check that the debugger hook was configured correctly before starting the training job. Exception: Training job has ended. All the collection files could not be loaded | Exception during rule evaluation: Customer Error: No debugging data was saved by the training job. Check that the debugger hook was configured correctly before starting the training job. Exception: Training job has ended. All the collection files could not be loaded
To reproduce
Train FrameWork Xgboost with debugger hook as below
from sagemaker.xgboost import XGBoost
from sagemaker.debugger import rule_configs, Rule, DebuggerHookConfig, CollectionConfig
hyperparams = {"max_depth":5,
"subsample":0.8,
"num_round":600,
"eta":0.2,
"gamma":4,
"min_child_weight":6,
"silent":0,
"objective":'multi:softmax',
"num_class":len(le.classes_),
"smdebug_path":f"s3://{bucket}/{prefix}/debug",
"smdebug_collections":"metrics,feature_importance"
}
save_interval = 5
entry_point_script = "xgboost_dest_prediction.py"
trial = Trial.create(trial_name="framework-mode-trial-{}".format(time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())),
experiment_name=destination_prediction_experiment.experiment_name,
sagemaker_boto_client=boto3.client('sagemaker'))
framework_xgb = XGBoost(
entry_point=entry_point_script,
role=sagemaker.get_execution_role(),
framework_version='0.90-2',
py_version="py3",
hyperparameters=hyperparams,
instance_count=1,
instance_type='ml.m4.xlarge',
output_path='s3://{}/{}/output'.format(bucket, prefix),
base_job_name="demo-xgboost-destination-prediction",
sagemaker_session=sm_sess,
# rules=debug_rules,
use_spot_instances = True,
max_run = 3600,
max_wait = 3600,
input_mode = 'File',
debugger_hook_config=DebuggerHookConfig(
s3_output_path=f"s3://{bucket}/{prefix}/debug", # Required
collection_configs=[
CollectionConfig(
name="metrics",
parameters={
"save_interval": str(save_interval)
}
)
],
),
rules=[
Rule.sagemaker(
rule_configs.loss_not_decreasing(),
rule_parameters={
"collection_names": "metrics",
"num_steps": str(save_interval * 2),
},
),
],
)
framework_xgb.fit({'train': s3_input_train,
'validation': s3_input_validation},
experiment_config={
"ExperimentName": destination_prediction_experiment.experiment_name,
"TrialName": trial.trial_name,
"TrialComponentDisplayName": "Training",
})
Expected behavior
I should get tensors saved in s3
Screenshots or logs
[{'RuleConfigurationName': 'LossNotDecreasing',
'RuleEvaluationJobArn': 'arn:aws:sagemaker:us-west-2:990360540682:processing-job/demo-xgboost-destination-p-lossnotdecreasing-abb2296f',
'RuleEvaluationStatus': 'Error',
'StatusDetails': 'ClientError: No debugging data was saved by the training job. Check that the debugger hook was configured correctly before starting the training job. Exception: Training job has ended. All the collection files could not be loaded\nTraceback (most recent call last):\n File "evaluate.py", line 112, in _create_trials\n range_steps=(self.start_step, self.end_step))\n File "/usr/local/lib/python3.7/site-packages/smdebug/trials/utils.py", line 20, in create_trial\n return LocalTrial(name=name, dirname=path, **kwargs)\n File "/usr/local/lib/python3.7/site-packages/smdebug/trials/local_trial.py", line 36, in __init__\n self._load_collections()\n File "/usr/local/lib/python3.7/site-packages/smdebug/trials/trial.py", line 168, in _load_collections\n _wait_for_collection_files(1) # wait for the first collection file\n File "/usr/local/lib/python3.7/site-packages/smdebug/trials/trial.py", line 165, in _wait_for_collection_files\n raise MissingCollectionFiles\nsmdebug.exceptions.MissingCollectionFiles: Trainin',
'LastModifiedTime': datetime.datetime(2020, 9, 18, 11, 6, 27, 290000, tzinfo=tzlocal())}]
System information
SageMaker Python SDK version: 2.6
Framework name (eg. PyTorch) or algorithm (eg. KMeans): xgboost frame work
Framework version: 0.90-2
Python version: 3.8
CPU or GPU: CPU
Custom Docker image (Y/N): N