Skip to content

Poor error reporting of docker failures in local mode #405

@leopd

Description

@leopd

Please fill out the form below.

System Information

  • Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): Local mode with BYOC Tensorflow example
  • Framework Version: n/a
  • Python Version: 3
  • CPU or GPU: CPU
  • Python SDK Version: AttributeError: module 'sagemaker' has no attribute '__version__'
  • Are you using a custom image: Yes, I'm trying to

Describe the problem

I'm working through the BYOC TF example https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/tensorflow_bring_your_own/tensorflow_bring_your_own.ipynb but modifying it and I see that the error message from running docker locally isn't getting reported properly.

Specifically, if you just run just this code in a notebook:

from sagemaker.estimator import Estimator

hyperparameters = {'train-steps': 100}

instance_type = 'local'

estimator = Estimator(role=role,
                      train_instance_count=1,
                      train_instance_type=instance_type,
                      image_name='tensorflow-cifar10-example:latest',
                      hyperparameters=hyperparameters)

estimator.fit('file:///tmp/cifar-10-data')

it fails (because I don't have a docker image called tensorflow-cifar10-example:latest). I get a big stack trace, but the key problem is that the actual error message has been swallowed. The exception says RuntimeError: Failed to run: ['docker-compose', '-f', '/private/var/folders/kw/8b59cw0s1c74qm8vc3xnzx50bj1cn5/T/tmpjl2m87y5/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1 which doesn't tell me what's wrong. But when I run the docker-compose command myself from the commandline I get:

$ docker-compose -f /private/var/folders/kw/8b59cw0s1c74qm8vc3xnzx50bj1cn5/T/tmpjl2m87y5/docker-compose.yaml up --build --abort-on-container-exit
Pulling algo-1-HNZFC (tensorflow-cifar10-example:latest)...
ERROR: The image for the service you're trying to recreate has been removed. If you continue, volume data could be lost. Consider backing up your data before continuing.

Continue with the new image? [yN]y
Pulling algo-1-HNZFC (tensorflow-cifar10-example:latest)...
ERROR: pull access denied for tensorflow-cifar10-example, repository does not exist or may require 'docker login'

It looks like when docker-compose is invoked in local-mode, stdout gets printed, but stderr is swallowed. Because in the notebook I see what looks like the stdout part of docker-compose, but not stderr:

INFO:sagemaker:Creating training-job with name: tensorflow-cifar10-example-2018-09-26-01-13-11-189
Continue with the new image? [yN]Pulling algo-1-NX951 (tensorflow-cifar10-example:latest)...
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)

The key thing that's missing in the notebook output is the two error messages

ERROR: The image for the service you're trying to recreate has been removed. If you continue, volume data could be lost. Consider backing up your data before continuing.
ERROR: pull access denied for tensorflow-cifar10-example, repository does not exist or may require 'docker login'

Minimal repro / logs

See above.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions