Skip to content

Attach and deploy doesn't create the endpoint #1379

@aninoy

Description

@aninoy

Describe the bug
When trying to attach and deploy a previously completed training job, the deploy function fails healthcheck and the logs show errors like:
FileNotFoundError: [Errno 2] No such file or directory: 'nginx': 'nginx'

To reproduce
I'm using the keras script mode example for creating my model, train, attach and deploy scripts.

Code for starting a training job:

from sagemaker.tensorflow import TensorFlow

role = sagemaker.get_execution_role()
estimator = TensorFlow(base_job_name='training-job',
                       entry_point='model.py',
                       source_dir=source_dir,
                       output_path=output_path,
                       role=role,
                       framework_version='1.15.0',
                       py_version='py3',
                       hyperparameters=hyperparameters,
                       train_instance_count=1,
                       train_instance_type=train_instance_type,
                       metric_definitions=keras_metric_definition)

train_channel = sagemaker.session.s3_input(train_input_path)
valid_channel = sagemaker.session.s3_input(validation_input_path)
test_channel = sagemaker.session.s3_input(test_input_path)

data_channels = {
    'train': train_channel, 
    'val': valid_channel,
    'test': test_channel
}
estimator.fit(inputs=data_channels)

Code for attaching and deploying:

import logging
import argparse
import sagemaker as sm

session = sm.session.Session(default_bucket=bucket_name)
role = sm.get_execution_role(sagemaker_session=session)

estimator = sm.estimator.Estimator.attach(args.model_name)
logging.info("Deploying model %s", args.model_name)
predictor = estimator.deploy(initial_instance_count=args.instance_count, 
                                 instance_type=args.instance_type,
                                 endpoint_name=args.endpoint_name,
                                 update_endpoint=False
                                )

Expected behavior
The endpoint should be deployed and ready to be used via rest API or sagemaker python sdk's predict method.

Screenshots or logs
Add any other context about the problem here.
Screen Shot 2020-03-31 at 2 01 52 AM
Screen Shot 2020-03-31 at 2 10 05 AM

System information

  • SageMaker Python SDK version: 1.51.4
  • Framework name: Tensorflow (custom keras model)
  • Framework version: 1.15.0
  • Python version: 3.6 (script mode)
  • CPU or GPU: GPU
  • Custom Docker image (Y/N): N
  • Model Container Image: 763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:1.15.0-gpu-py3

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions