-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Open
Description
Describe the bug
The logic to repack model artifact (notably used in MXnet) uses a temp dir under /tmp. However, on SageMaker notebook instance classic, this partition is limited in size and cannot be up-sized, hence cause create model to fail on large model.tar.gz.
I propose to allow callers to customize temp dir. In this way, I can simply increase the EBS volume of my SageMaker notebook instance, and then set the temp dir to /home/ec2-user/SageMaker/tmp/.
Presently, my hack was to directly modify the sagemaker/util.py to hardcode /home/ec2-user/SageMaker/tmp.
To reproduce
model = MXNetModel(
model_data=train_model_artifact,
role=role,
entry_point='inf_entrypoint.py',
source_dir='../../src/sm_ngb',
py_version="py3",
framework_version="1.6.0",
sagemaker_session=sess,
container_log_level=logging.DEBUG,
)
model._create_sagemaker_model(instance_type='ml.m5.large')Expected behavior
A new SageMaker model will be created and visible from the console.
Screenshots or logs
This is the stack trace:
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<timed eval> in <module>
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/model.py in _create_sagemaker_model(self, instance_type, accelerator_type, tags)
186 /api/latest/reference/services/sagemaker.html#SageMaker.Client.add_tags
187 """
--> 188 container_def = self.prepare_container_def(instance_type, accelerator_type=accelerator_type)
189 self.name = self.name or utils.name_from_image(container_def["Image"])
190 enable_network_isolation = self.enable_network_isolation()
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/mxnet/model.py in prepare_container_def(self, instance_type, accelerator_type)
155
156 deploy_key_prefix = model_code_key_prefix(self.key_prefix, self.name, deploy_image)
--> 157 self._upload_code(deploy_key_prefix, self._is_mms_version())
158 deploy_env = dict(self.env)
159 deploy_env.update(self._framework_env_vars())
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/model.py in _upload_code(self, key_prefix, repack)
933 repacked_model_uri=repacked_model_data,
934 sagemaker_session=self.sagemaker_session,
--> 935 kms_key=self.model_kms_key,
936 )
937
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/utils.py in repack_model(inference_script, source_directory, dependencies, model_uri, repacked_model_uri, sagemaker_session, kms_key)
486
487 with _tmpdir() as tmp:
--> 488 model_dir = _extract_model(model_uri, sagemaker_session, tmp)
489
490 _create_or_update_code_dir(
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/utils.py in _extract_model(model_uri, sagemaker_session, tmp)
582 local_model_path = model_uri.replace("file://", "")
583 with tarfile.open(name=local_model_path, mode="r:gz") as t:
--> 584 t.extractall(path=tmp_model_dir)
585 return tmp_model_dir
586
~/anaconda3/envs/python3/lib/python3.6/tarfile.py in extractall(self, path, members, numeric_owner)
2008 # Do not set_attrs directories, as we will do that further down
2009 self.extract(tarinfo, path, set_attrs=not tarinfo.isdir(),
-> 2010 numeric_owner=numeric_owner)
2011
2012 # Reverse sort directories.
~/anaconda3/envs/python3/lib/python3.6/tarfile.py in extract(self, member, path, set_attrs, numeric_owner)
2050 self._extract_member(tarinfo, os.path.join(path, tarinfo.name),
2051 set_attrs=set_attrs,
-> 2052 numeric_owner=numeric_owner)
2053 except OSError as e:
2054 if self.errorlevel > 0:
~/anaconda3/envs/python3/lib/python3.6/tarfile.py in _extract_member(self, tarinfo, targetpath, set_attrs, numeric_owner)
2120
2121 if tarinfo.isreg():
-> 2122 self.makefile(tarinfo, targetpath)
2123 elif tarinfo.isdir():
2124 self.makedir(tarinfo, targetpath)
~/anaconda3/envs/python3/lib/python3.6/tarfile.py in makefile(self, tarinfo, targetpath)
2169 target.truncate()
2170 else:
-> 2171 copyfileobj(source, target, tarinfo.size, ReadError, bufsize)
2172
2173 def makeunknown(self, tarinfo, targetpath):
~/anaconda3/envs/python3/lib/python3.6/tarfile.py in copyfileobj(src, dst, length, exception, bufsize)
250 if len(buf) < bufsize:
251 raise exception("unexpected end of data")
--> 252 dst.write(buf)
253
254 if remainder != 0:
OSError: [Errno 28] No space left on deviceSystem information
- SageMaker Python SDK version: 1.72.1 (default installed in SageMaker notebook instance classic, conda environment
python3. - Framework name (eg. PyTorch) or algorithm (eg. KMeans): mxnet
- Framework version: 1.6.0
- Python version: 3.6
- CPU or GPU: cpu
- Custom Docker image (Y/N): N
Additional context
N/A.
athewsey and jcntrl