Skip to content

Sagemaker training smdebug/core/state_store.py FileNotFoundError #1791

@guptaanshul201989

Description

@guptaanshul201989

Hello,

I am trying to train a model using MXnet estimator. As soon as training starts I see following error when Sagemaker tries to upload checkpoints:

File "/opt/ml/code/translation.py", line 224, in __call__
return super(NMTModel, self).__call__(src_seq, tgt_seq, src_valid_length, tgt_valid_length)
File "/usr/local/lib/python3.6/site-packages/mxnet/gluon/block.py", line 756, in __call__
hook(self, args)
File "/usr/local/lib/python3.6/site-packages/smdebug/mxnet/hook.py", line 143, in forward_pre_hook
self._increment_step()
File "/usr/local/lib/python3.6/site-packages/smdebug/core/hook.py", line 511, in _increment_step
self._write_state()
File "/usr/local/lib/python3.6/site-packages/smdebug/core/hook.py", line 523, in _write_state
if self.state_store.is_checkpoint_updated():
File "/usr/local/lib/python3.6/site-packages/smdebug/core/state_store.py", line 112, in is_checkpoint_updated
cp_file_sizes = [os.path.getsize(file) for file in checkpoint_files]
File "/usr/local/lib/python3.6/site-packages/smdebug/core/state_store.py", line 112, in <listcomp>
cp_file_sizes = [os.path.getsize(file) for file in checkpoint_files]
File "/usr/local/lib/python3.6/genericpath.py", line 50, in getsize
return os.stat(filename).st_size
FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/checkpoints/metadata.json.sagemaker-uploading'
[2020-07-31 05:38:53.405 ip-10-2-236-41.ec2.internal:144 INFO utils.py:25] The end of training job file will not be written for jobs running under SageMaker.
2020-07-31 05:38:54,756 sagemaker-training-toolkit ERROR ExecuteUserScriptError:

Few details about the job:

  1. I am using Mxnet estimator with distributed setting
  2. Using 4 p3.16xlarge instances

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions