Skip to content

Sagemaker uploading model artifacts without compressing and in a different directory #1895

@inderpartap

Description

@inderpartap

Describe the bug
I have a Keras model getting trained using an entry_point script and I am using the following pieces of code to store the model artifacts (in the entry_point script).

parser.add_argument('--model_dir', type=str, default=os.environ['SM_MODEL_DIR'])
args, _ = parser.parse_known_args()
model_dir  = args.model_dir
---

tf.keras.models.save_model(
      model,
      os.path.join(model_dir, 'model/1'),
      overwrite=True,
      include_optimizer=True
     )

Ideally, the model_dir should be opt/ml/model and Sagemaker should automatically move the contents of this folder to S3 as s3://<default_bucket>/<training_name>/output/model.tar.gz

When I run the estimator.fit({'training': training_input_path}), the training is successful, but the Cloudwatch logs show the following:

2020-09-16 02:49:12,458 sagemaker_tensorflow_container.training WARNING  No model artifact is saved under the path /opt/ml/model. Your training job will not save any model files to S3.

Even then, Sagemaker does store my model artifacts, with the only difference being that instead of storing them in s3://<default_bucket>/<training_name>/output/model.tar.gz, they are now stored unzipped as s3://<default_bucket>/<training_name>/model/model/1/saved_model.pb along with the variables and assets folder. Because of this, estimator.deploy() call fails as it is unable to find the artifacts in the output/ directory in S3.

To reproduce
Estimator code:

from sagemaker.tensorflow import TensorFlow

tf_estimator = TensorFlow(entry_point='autoencoder-model.py',
                       role=role,
                       instance_count=1,
                       instance_type='ml.m5.large',
                       framework_version="2.3.0",
                       py_version="py37",
                       debugger_hook_config=False,
                       hyperparameters={'epochs': 20},
                       source_dir='/home/ec2-user/SageMaker/model',
                       subnets=['subnet-1', 'subnet-2'],
                       security_group_ids=['sg-1', 'sg-1'])

tf_estimator.fit({'training': training_input_path})

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 2.6
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): Tensorflow Keras
  • Framework version: 2.3.0
  • Python version: 3.7
  • CPU or GPU: CPU
  • Custom Docker image (Y/N): N

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions