Running local training inside docker container (No /opt/ml/input/config/resourceconfig.json) #106

bee-keeper · 2019-06-04T13:25:39Z

Hi there,

Given the following simplified setup, I can't get local training to work inside Docker. Outside of Docker it works fine, but when I run within the container it blows up with the error below.

docker-compose.yml

---
version: '2'

services:
  training:
    build:
      context: .
    stdin_open: true
    tty: true
    volumes:
      - ../:/root/code
      - $HOME/.aws:/root/.aws
      - /var/run/docker.sock:/var/run/docker.sock

I've tried this with 520713654638.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-pytorch:1.0.0-cpu-py3Docker image and also other base images, the error is always the same.

I have a simple train.py script which gets mounted in /root/code, I docker-compose up
and inside the container run python train.py.

The output is:

root@8ce53fb18e5c:/root/code# python train.py 
Creating tmp6d4_wpt6_algo-1-nexl2_1 ... done
Attaching to tmp6d4_wpt6_algo-1-nexl2_1
algo-1-nexl2_1  | jq: error: Could not open file /opt/ml/input/config/resourceconfig.json: No such file or directory
algo-1-nexl2_1  | changehostname.c: In function ‘gethostname’:
algo-1-nexl2_1  | changehostname.c:15:21: error: expected expression before ‘;’ token
algo-1-nexl2_1  |    const char *val = ;
algo-1-nexl2_1  |                      ^
algo-1-nexl2_1  | gcc: error: changehostname.o: No such file or directory
algo-1-nexl2_1  | ERROR: ld.so: object '/libchangehostname.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
algo-1-nexl2_1  | ERROR: ld.so: object '/libchangehostname.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
algo-1-nexl2_1  | Reporting training FAILURE
algo-1-nexl2_1  | framework error: 
algo-1-nexl2_1  | Traceback (most recent call last):
algo-1-nexl2_1  |   File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_trainer.py", line 47, in train
algo-1-nexl2_1  |     env = sagemaker_containers.training_env()
algo-1-nexl2_1  |   File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/__init__.py", line 26, in training_env
algo-1-nexl2_1  |     resource_config=_env.read_resource_config(),
algo-1-nexl2_1  |   File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_env.py", line 237, in read_resource_config
algo-1-nexl2_1  |     return _read_json(resource_config_file_dir)
algo-1-nexl2_1  |   File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_env.py", line 193, in _read_json
algo-1-nexl2_1  |     with open(path, 'r') as f:
algo-1-nexl2_1  | FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/input/config/resourceconfig.json'
algo-1-nexl2_1  | 
algo-1-nexl2_1  | [Errno 2] No such file or directory: '/opt/ml/input/config/resourceconfig.json'
tmp6d4_wpt6_algo-1-nexl2_1 exited with code 2

My issue is therefore - how to run local training in a dockerisied environment? Obviously the training is in Docker but I'd also like to dockerise the environment which the pre training script runs in. This is due to the need to pin down dependencies and reduce on boarding time etc

The text was updated successfully, but these errors were encountered:

ChoiByungWook · 2019-06-04T18:22:18Z

Hello @bee-keeper,

Do you think you can share the contents of train.py?

Based on the stack trace provided, files such as resourceconfig.json are provided by SageMaker: https://docs.aws.amazon.com/sagemaker/latest/dg/API_ResourceConfig.html

I understand the need to have fast iterations and debugging outside of SageMaker enables that.

I do recommend using the SageMaker Python SDK's local mode, after we're able to get around this hurdle, which will run your docker container on your localhost in a emulated environment. So the iterations should be quicker, as you won't be waiting for instances to be provisioned.

Here is a PyTorch notebook showcasing how to use local mode: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/pytorch_cnn_cifar10/pytorch_local_mode_cifar10.ipynb

bee-keeper · 2019-06-04T19:08:54Z

Hi @ChoiByungWook

I am using local mode but I'm wanting to run training in a docker container and inside that container sagemaker starts another container to run training. As I said it works fine if I run training on my host outside of docker but if I try dockerise I get the error I posted.

Could you confirm it's possible to dockerise the notebook/initial python script which calls fit() please?

ChoiByungWook · 2019-06-04T19:31:42Z

I think I am misunderstanding your statement in regards to

inside that container sagemaker starts another container to run training

SageMaker along with local mode should only run the docker container you specify and not spin up a container within your container, unless that was your intention?

If that is your intention, then I would say it is possible, however I'm not too sure I understand the use case or benefit.

bee-keeper · 2019-06-04T19:46:30Z

Yes it is my intention to run training within a container. This is because in my use case, the training script has various deps which would be better encapsulated in Docker rather having all team members working using virtual envs on their host machines. I'd have thought the benefits of this would be obvious?

laurenyu · 2019-06-08T00:03:20Z

hi @bee-keeper, I was able to get a trivial local training job to run in a Docker container. There are probably better approaches, but here's what I did:

started with the pre-built SageMaker MXNet image (it was one I happened to have downloaded already): 520713654638.dkr.ecr.us-west-2.amazonaws.com/sagemaker-mxnet:1.4.0-cpu-py3
bashed into the image: docker run -it --privileged -v /var/lib/docker --entrypoint bash d54dd07e344c. (make sure you have --privileged and -v /var/lib/docker)
installed Docker following these instructions: https://www.digitalocean.com/community/tutorials/how-to-install-and-use-docker-on-ubuntu-16-04
started the Docker daemon: dockerd&
installed the SageMaker Python SDK: pip install sagemaker
made a very trivial "training" script: echo "print('hello world')" > script.py
set my AWS credentials
ran a training job using the Python SDK's Local Mode:

# python
Python 3.6.8 (default, Dec 24 2018, 19:24:27)           
[GCC 5.4.0 20160609] on linux   
Type "help", "copyright", "credits" or "license" for more information.
>>> from sagemaker.mxnet import MXNet
WARNING:root:pandas failed to import. Analytics features will be impaired or broken.
>>> m = MXNet('script.py', role=<role_name>, framework_version='1.4.0', train_instance_count=1, train_instance_type='local', py_version='py3')
>>> m.fit()

bee-keeper · 2019-06-11T13:57:57Z

@laurenyu ok thanks for this confirmation. I'm going to close for now.

* Support env variables for configure batchSize, maxBatchDelay etc. for the single model in torchserve * Add modified version * fix flake8 * Edit version * Correct type * Add condition to including env variables in model config * Add version * Update version and remove env support * Try converting config to string * Reverse str and update version * Fix true * Experiment with default config * Complete * Include load models * Set max workers to 1 * Set default response timeout to 60, and improve docstring * Fix flake8 * Add a warning log for single model * Fix extra spacing in log * Use string instead of a dict * Print config * Fix string * Fix f-string * Remove newline * Adjust f string * Fix flake8 * Trigger build * Trigger build * Trigger build * Trigger build Co-authored-by: Nikhil Kulkarni <nikhilsk@amazon.com>

kenny-chen · 2023-12-19T07:10:35Z

try to add -v /tmp:/tmp while running dev container.
estimator.fit(...) will create a folder on /tmp/tmp6p45ov_a/ to train ML model, but your dev container does not -v to /tmp, so AWS pre-train container cannot get /tmp/tmp6p45ov_a/ and missing resourceconfig.json.

bee-keeper closed this as completed Jun 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running local training inside docker container (No /opt/ml/input/config/resourceconfig.json) #106

Running local training inside docker container (No /opt/ml/input/config/resourceconfig.json) #106

bee-keeper commented Jun 4, 2019 •

edited

ChoiByungWook commented Jun 4, 2019

bee-keeper commented Jun 4, 2019

ChoiByungWook commented Jun 4, 2019

bee-keeper commented Jun 4, 2019

laurenyu commented Jun 8, 2019

bee-keeper commented Jun 11, 2019

kenny-chen commented Dec 19, 2023 •

edited

Running local training inside docker container (No /opt/ml/input/config/resourceconfig.json) #106

Running local training inside docker container (No /opt/ml/input/config/resourceconfig.json) #106

Comments

bee-keeper commented Jun 4, 2019 • edited

ChoiByungWook commented Jun 4, 2019

bee-keeper commented Jun 4, 2019

ChoiByungWook commented Jun 4, 2019

bee-keeper commented Jun 4, 2019

laurenyu commented Jun 8, 2019

bee-keeper commented Jun 11, 2019

kenny-chen commented Dec 19, 2023 • edited

bee-keeper commented Jun 4, 2019 •

edited

kenny-chen commented Dec 19, 2023 •

edited