-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Please fill out the form below.
System Information
- Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): Local mode with BYOC Tensorflow example
- Framework Version: n/a
- Python Version: 3
- CPU or GPU: CPU
- Python SDK Version:
AttributeError: module 'sagemaker' has no attribute '__version__'
- Are you using a custom image: Yes, I'm trying to
Describe the problem
I'm working through the BYOC TF example https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/tensorflow_bring_your_own/tensorflow_bring_your_own.ipynb but modifying it and I see that the error message from running docker locally isn't getting reported properly.
Specifically, if you just run just this code in a notebook:
from sagemaker.estimator import Estimator
hyperparameters = {'train-steps': 100}
instance_type = 'local'
estimator = Estimator(role=role,
train_instance_count=1,
train_instance_type=instance_type,
image_name='tensorflow-cifar10-example:latest',
hyperparameters=hyperparameters)
estimator.fit('file:///tmp/cifar-10-data')
it fails (because I don't have a docker image called tensorflow-cifar10-example:latest
). I get a big stack trace, but the key problem is that the actual error message has been swallowed. The exception says RuntimeError: Failed to run: ['docker-compose', '-f', '/private/var/folders/kw/8b59cw0s1c74qm8vc3xnzx50bj1cn5/T/tmpjl2m87y5/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1
which doesn't tell me what's wrong. But when I run the docker-compose
command myself from the commandline I get:
$ docker-compose -f /private/var/folders/kw/8b59cw0s1c74qm8vc3xnzx50bj1cn5/T/tmpjl2m87y5/docker-compose.yaml up --build --abort-on-container-exit
Pulling algo-1-HNZFC (tensorflow-cifar10-example:latest)...
ERROR: The image for the service you're trying to recreate has been removed. If you continue, volume data could be lost. Consider backing up your data before continuing.
Continue with the new image? [yN]y
Pulling algo-1-HNZFC (tensorflow-cifar10-example:latest)...
ERROR: pull access denied for tensorflow-cifar10-example, repository does not exist or may require 'docker login'
It looks like when docker-compose
is invoked in local-mode, stdout gets printed, but stderr is swallowed. Because in the notebook I see what looks like the stdout part of docker-compose, but not stderr:
INFO:sagemaker:Creating training-job with name: tensorflow-cifar10-example-2018-09-26-01-13-11-189
Continue with the new image? [yN]Pulling algo-1-NX951 (tensorflow-cifar10-example:latest)...
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
The key thing that's missing in the notebook output is the two error messages
ERROR: The image for the service you're trying to recreate has been removed. If you continue, volume data could be lost. Consider backing up your data before continuing.
ERROR: pull access denied for tensorflow-cifar10-example, repository does not exist or may require 'docker login'
Minimal repro / logs
See above.