-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Closed
Labels
Description
Describe the bug
When trying to attach and deploy a previously completed training job, the deploy function fails healthcheck and the logs show errors like:
FileNotFoundError: [Errno 2] No such file or directory: 'nginx': 'nginx'
To reproduce
I'm using the keras script mode example for creating my model, train, attach and deploy scripts.
Code for starting a training job:
from sagemaker.tensorflow import TensorFlow
role = sagemaker.get_execution_role()
estimator = TensorFlow(base_job_name='training-job',
entry_point='model.py',
source_dir=source_dir,
output_path=output_path,
role=role,
framework_version='1.15.0',
py_version='py3',
hyperparameters=hyperparameters,
train_instance_count=1,
train_instance_type=train_instance_type,
metric_definitions=keras_metric_definition)
train_channel = sagemaker.session.s3_input(train_input_path)
valid_channel = sagemaker.session.s3_input(validation_input_path)
test_channel = sagemaker.session.s3_input(test_input_path)
data_channels = {
'train': train_channel,
'val': valid_channel,
'test': test_channel
}
estimator.fit(inputs=data_channels)
Code for attaching and deploying:
import logging
import argparse
import sagemaker as sm
session = sm.session.Session(default_bucket=bucket_name)
role = sm.get_execution_role(sagemaker_session=session)
estimator = sm.estimator.Estimator.attach(args.model_name)
logging.info("Deploying model %s", args.model_name)
predictor = estimator.deploy(initial_instance_count=args.instance_count,
instance_type=args.instance_type,
endpoint_name=args.endpoint_name,
update_endpoint=False
)
Expected behavior
The endpoint should be deployed and ready to be used via rest API or sagemaker python sdk's predict method.
Screenshots or logs
Add any other context about the problem here.


System information
- SageMaker Python SDK version: 1.51.4
- Framework name: Tensorflow (custom keras model)
- Framework version: 1.15.0
- Python version: 3.6 (script mode)
- CPU or GPU: GPU
- Custom Docker image (Y/N): N
- Model Container Image: 763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:1.15.0-gpu-py3