Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running local training inside docker container (No /opt/ml/input/config/resourceconfig.json) #106

Closed
bee-keeper opened this issue Jun 4, 2019 · 7 comments

Comments

@bee-keeper
Copy link

bee-keeper commented Jun 4, 2019

Hi there,

Given the following simplified setup, I can't get local training to work inside Docker. Outside of Docker it works fine, but when I run within the container it blows up with the error below.

docker-compose.yml

---
version: '2'

services:
  training:
    build:
      context: .
    stdin_open: true
    tty: true
    volumes:
      - ../:/root/code
      - $HOME/.aws:/root/.aws
      - /var/run/docker.sock:/var/run/docker.sock

I've tried this with 520713654638.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-pytorch:1.0.0-cpu-py3Docker image and also other base images, the error is always the same.

I have a simple train.py script which gets mounted in /root/code, I docker-compose up
and inside the container run python train.py.

The output is:

root@8ce53fb18e5c:/root/code# python train.py 
Creating tmp6d4_wpt6_algo-1-nexl2_1 ... done
Attaching to tmp6d4_wpt6_algo-1-nexl2_1
algo-1-nexl2_1  | jq: error: Could not open file /opt/ml/input/config/resourceconfig.json: No such file or directory
algo-1-nexl2_1  | changehostname.c: In function ‘gethostname’:
algo-1-nexl2_1  | changehostname.c:15:21: error: expected expression before ‘;’ token
algo-1-nexl2_1  |    const char *val = ;
algo-1-nexl2_1  |                      ^
algo-1-nexl2_1  | gcc: error: changehostname.o: No such file or directory
algo-1-nexl2_1  | ERROR: ld.so: object '/libchangehostname.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
algo-1-nexl2_1  | ERROR: ld.so: object '/libchangehostname.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
algo-1-nexl2_1  | Reporting training FAILURE
algo-1-nexl2_1  | framework error: 
algo-1-nexl2_1  | Traceback (most recent call last):
algo-1-nexl2_1  |   File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_trainer.py", line 47, in train
algo-1-nexl2_1  |     env = sagemaker_containers.training_env()
algo-1-nexl2_1  |   File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/__init__.py", line 26, in training_env
algo-1-nexl2_1  |     resource_config=_env.read_resource_config(),
algo-1-nexl2_1  |   File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_env.py", line 237, in read_resource_config
algo-1-nexl2_1  |     return _read_json(resource_config_file_dir)
algo-1-nexl2_1  |   File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_env.py", line 193, in _read_json
algo-1-nexl2_1  |     with open(path, 'r') as f:
algo-1-nexl2_1  | FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/input/config/resourceconfig.json'
algo-1-nexl2_1  | 
algo-1-nexl2_1  | [Errno 2] No such file or directory: '/opt/ml/input/config/resourceconfig.json'
tmp6d4_wpt6_algo-1-nexl2_1 exited with code 2

My issue is therefore - how to run local training in a dockerisied environment? Obviously the training is in Docker but I'd also like to dockerise the environment which the pre training script runs in. This is due to the need to pin down dependencies and reduce on boarding time etc

@ChoiByungWook
Copy link
Contributor

Hello @bee-keeper,

Do you think you can share the contents of train.py?

Based on the stack trace provided, files such as resourceconfig.json are provided by SageMaker: https://docs.aws.amazon.com/sagemaker/latest/dg/API_ResourceConfig.html

I understand the need to have fast iterations and debugging outside of SageMaker enables that.

I do recommend using the SageMaker Python SDK's local mode, after we're able to get around this hurdle, which will run your docker container on your localhost in a emulated environment. So the iterations should be quicker, as you won't be waiting for instances to be provisioned.

Here is a PyTorch notebook showcasing how to use local mode: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/pytorch_cnn_cifar10/pytorch_local_mode_cifar10.ipynb

@bee-keeper
Copy link
Author

Hi @ChoiByungWook

I am using local mode but I'm wanting to run training in a docker container and inside that container sagemaker starts another container to run training. As I said it works fine if I run training on my host outside of docker but if I try dockerise I get the error I posted.

Could you confirm it's possible to dockerise the notebook/initial python script which calls fit() please?

@ChoiByungWook
Copy link
Contributor

I think I am misunderstanding your statement in regards to

inside that container sagemaker starts another container to run training

SageMaker along with local mode should only run the docker container you specify and not spin up a container within your container, unless that was your intention?

If that is your intention, then I would say it is possible, however I'm not too sure I understand the use case or benefit.

@bee-keeper
Copy link
Author

Yes it is my intention to run training within a container. This is because in my use case, the training script has various deps which would be better encapsulated in Docker rather having all team members working using virtual envs on their host machines. I'd have thought the benefits of this would be obvious?

@laurenyu
Copy link
Contributor

laurenyu commented Jun 8, 2019

hi @bee-keeper, I was able to get a trivial local training job to run in a Docker container. There are probably better approaches, but here's what I did:

  1. started with the pre-built SageMaker MXNet image (it was one I happened to have downloaded already): 520713654638.dkr.ecr.us-west-2.amazonaws.com/sagemaker-mxnet:1.4.0-cpu-py3
  2. bashed into the image: docker run -it --privileged -v /var/lib/docker --entrypoint bash d54dd07e344c. (make sure you have --privileged and -v /var/lib/docker)
  3. installed Docker following these instructions: https://www.digitalocean.com/community/tutorials/how-to-install-and-use-docker-on-ubuntu-16-04
  4. started the Docker daemon: dockerd&
  5. installed the SageMaker Python SDK: pip install sagemaker
  6. made a very trivial "training" script: echo "print('hello world')" > script.py
  7. set my AWS credentials
  8. ran a training job using the Python SDK's Local Mode:
# python
Python 3.6.8 (default, Dec 24 2018, 19:24:27)           
[GCC 5.4.0 20160609] on linux   
Type "help", "copyright", "credits" or "license" for more information.
>>> from sagemaker.mxnet import MXNet
WARNING:root:pandas failed to import. Analytics features will be impaired or broken.
>>> m = MXNet('script.py', role=<role_name>, framework_version='1.4.0', train_instance_count=1, train_instance_type='local', py_version='py3')
>>> m.fit()

@bee-keeper
Copy link
Author

@laurenyu ok thanks for this confirmation. I'm going to close for now.

carljeske pushed a commit to carljeske/sagemaker-pytorch-training-toolkit that referenced this issue Jul 5, 2023
* Support env variables for configure batchSize, maxBatchDelay etc. for the single model in torchserve

* Add modified version

* fix flake8

* Edit version

* Correct type

* Add condition to including env variables in model config

* Add version

* Update version and remove env support

* Try converting config to string

* Reverse str and update version

* Fix true

* Experiment with default config

* Complete

* Include load models

* Set max workers to 1

* Set default response timeout to 60, and improve docstring

* Fix flake8

* Add a warning log for single model

* Fix extra spacing in log

* Use string instead of a dict

* Print config

* Fix string

* Fix f-string

* Remove newline

* Adjust f string

* Fix flake8

* Trigger build

* Trigger build

* Trigger build

* Trigger build

Co-authored-by: Nikhil Kulkarni <nikhilsk@amazon.com>
@kenny-chen
Copy link

kenny-chen commented Dec 19, 2023

try to add -v /tmp:/tmp while running dev container.
estimator.fit(...) will create a folder on /tmp/tmp6p45ov_a/ to train ML model, but your dev container does not -v to /tmp, so AWS pre-train container cannot get /tmp/tmp6p45ov_a/ and missing resourceconfig.json.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants