Skip to content

Allow repack model function to customize tempdir  #1956

@verdimrc

Description

@verdimrc

Describe the bug

The logic to repack model artifact (notably used in MXnet) uses a temp dir under /tmp. However, on SageMaker notebook instance classic, this partition is limited in size and cannot be up-sized, hence cause create model to fail on large model.tar.gz.

I propose to allow callers to customize temp dir. In this way, I can simply increase the EBS volume of my SageMaker notebook instance, and then set the temp dir to /home/ec2-user/SageMaker/tmp/.

Presently, my hack was to directly modify the sagemaker/util.py to hardcode /home/ec2-user/SageMaker/tmp.

To reproduce

model = MXNetModel(
        model_data=train_model_artifact,
        role=role,
        entry_point='inf_entrypoint.py',
        source_dir='../../src/sm_ngb',
        py_version="py3",
        framework_version="1.6.0",
        sagemaker_session=sess,
        container_log_level=logging.DEBUG,
    )

model._create_sagemaker_model(instance_type='ml.m5.large')

Expected behavior

A new SageMaker model will be created and visible from the console.

Screenshots or logs

This is the stack trace:

'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<timed eval> in <module>

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/model.py in _create_sagemaker_model(self, instance_type, accelerator_type, tags)
    186                 /api/latest/reference/services/sagemaker.html#SageMaker.Client.add_tags
    187         """
--> 188         container_def = self.prepare_container_def(instance_type, accelerator_type=accelerator_type)
    189         self.name = self.name or utils.name_from_image(container_def["Image"])
    190         enable_network_isolation = self.enable_network_isolation()

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/mxnet/model.py in prepare_container_def(self, instance_type, accelerator_type)
    155 
    156         deploy_key_prefix = model_code_key_prefix(self.key_prefix, self.name, deploy_image)
--> 157         self._upload_code(deploy_key_prefix, self._is_mms_version())
    158         deploy_env = dict(self.env)
    159         deploy_env.update(self._framework_env_vars())

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/model.py in _upload_code(self, key_prefix, repack)
    933                 repacked_model_uri=repacked_model_data,
    934                 sagemaker_session=self.sagemaker_session,
--> 935                 kms_key=self.model_kms_key,
    936             )
    937 

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/utils.py in repack_model(inference_script, source_directory, dependencies, model_uri, repacked_model_uri, sagemaker_session, kms_key)
    486 
    487     with _tmpdir() as tmp:
--> 488         model_dir = _extract_model(model_uri, sagemaker_session, tmp)
    489 
    490         _create_or_update_code_dir(

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/utils.py in _extract_model(model_uri, sagemaker_session, tmp)
    582         local_model_path = model_uri.replace("file://", "")
    583     with tarfile.open(name=local_model_path, mode="r:gz") as t:
--> 584         t.extractall(path=tmp_model_dir)
    585     return tmp_model_dir
    586 

~/anaconda3/envs/python3/lib/python3.6/tarfile.py in extractall(self, path, members, numeric_owner)
   2008             # Do not set_attrs directories, as we will do that further down
   2009             self.extract(tarinfo, path, set_attrs=not tarinfo.isdir(),
-> 2010                          numeric_owner=numeric_owner)
   2011 
   2012         # Reverse sort directories.

~/anaconda3/envs/python3/lib/python3.6/tarfile.py in extract(self, member, path, set_attrs, numeric_owner)
   2050             self._extract_member(tarinfo, os.path.join(path, tarinfo.name),
   2051                                  set_attrs=set_attrs,
-> 2052                                  numeric_owner=numeric_owner)
   2053         except OSError as e:
   2054             if self.errorlevel > 0:

~/anaconda3/envs/python3/lib/python3.6/tarfile.py in _extract_member(self, tarinfo, targetpath, set_attrs, numeric_owner)
   2120 
   2121         if tarinfo.isreg():
-> 2122             self.makefile(tarinfo, targetpath)
   2123         elif tarinfo.isdir():
   2124             self.makedir(tarinfo, targetpath)

~/anaconda3/envs/python3/lib/python3.6/tarfile.py in makefile(self, tarinfo, targetpath)
   2169                 target.truncate()
   2170             else:
-> 2171                 copyfileobj(source, target, tarinfo.size, ReadError, bufsize)
   2172 
   2173     def makeunknown(self, tarinfo, targetpath):

~/anaconda3/envs/python3/lib/python3.6/tarfile.py in copyfileobj(src, dst, length, exception, bufsize)
    250         if len(buf) < bufsize:
    251             raise exception("unexpected end of data")
--> 252         dst.write(buf)
    253 
    254     if remainder != 0:

OSError: [Errno 28] No space left on device

System information

  • SageMaker Python SDK version: 1.72.1 (default installed in SageMaker notebook instance classic, conda environment python3.
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): mxnet
  • Framework version: 1.6.0
  • Python version: 3.6
  • CPU or GPU: cpu
  • Custom Docker image (Y/N): N

Additional context
N/A.

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions