-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Closed
Labels
Description
Hello,
I am trying to train a model using MXnet estimator. As soon as training starts I see following error when Sagemaker tries to upload checkpoints:
File "/opt/ml/code/translation.py", line 224, in __call__
return super(NMTModel, self).__call__(src_seq, tgt_seq, src_valid_length, tgt_valid_length)
File "/usr/local/lib/python3.6/site-packages/mxnet/gluon/block.py", line 756, in __call__
hook(self, args)
File "/usr/local/lib/python3.6/site-packages/smdebug/mxnet/hook.py", line 143, in forward_pre_hook
self._increment_step()
File "/usr/local/lib/python3.6/site-packages/smdebug/core/hook.py", line 511, in _increment_step
self._write_state()
File "/usr/local/lib/python3.6/site-packages/smdebug/core/hook.py", line 523, in _write_state
if self.state_store.is_checkpoint_updated():
File "/usr/local/lib/python3.6/site-packages/smdebug/core/state_store.py", line 112, in is_checkpoint_updated
cp_file_sizes = [os.path.getsize(file) for file in checkpoint_files]
File "/usr/local/lib/python3.6/site-packages/smdebug/core/state_store.py", line 112, in <listcomp>
cp_file_sizes = [os.path.getsize(file) for file in checkpoint_files]
File "/usr/local/lib/python3.6/genericpath.py", line 50, in getsize
return os.stat(filename).st_size
FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/checkpoints/metadata.json.sagemaker-uploading'
[2020-07-31 05:38:53.405 ip-10-2-236-41.ec2.internal:144 INFO utils.py:25] The end of training job file will not be written for jobs running under SageMaker.
2020-07-31 05:38:54,756 sagemaker-training-toolkit ERROR ExecuteUserScriptError:
Few details about the job:
- I am using Mxnet estimator with distributed setting
- Using 4 p3.16xlarge instances