New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running local training inside docker container (No /opt/ml/input/config/resourceconfig.json) #106
Comments
Hello @bee-keeper, Do you think you can share the contents of train.py? Based on the stack trace provided, files such as resourceconfig.json are provided by SageMaker: https://docs.aws.amazon.com/sagemaker/latest/dg/API_ResourceConfig.html I understand the need to have fast iterations and debugging outside of SageMaker enables that. I do recommend using the SageMaker Python SDK's local mode, after we're able to get around this hurdle, which will run your docker container on your localhost in a emulated environment. So the iterations should be quicker, as you won't be waiting for instances to be provisioned. Here is a PyTorch notebook showcasing how to use local mode: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/pytorch_cnn_cifar10/pytorch_local_mode_cifar10.ipynb |
I am using local mode but I'm wanting to run training in a docker container and inside that container sagemaker starts another container to run training. As I said it works fine if I run training on my host outside of docker but if I try dockerise I get the error I posted. Could you confirm it's possible to dockerise the notebook/initial python script which calls |
I think I am misunderstanding your statement in regards to
SageMaker along with local mode should only run the docker container you specify and not spin up a container within your container, unless that was your intention? If that is your intention, then I would say it is possible, however I'm not too sure I understand the use case or benefit. |
Yes it is my intention to run training within a container. This is because in my use case, the training script has various deps which would be better encapsulated in Docker rather having all team members working using virtual envs on their host machines. I'd have thought the benefits of this would be obvious? |
hi @bee-keeper, I was able to get a trivial local training job to run in a Docker container. There are probably better approaches, but here's what I did:
|
@laurenyu ok thanks for this confirmation. I'm going to close for now. |
* Support env variables for configure batchSize, maxBatchDelay etc. for the single model in torchserve * Add modified version * fix flake8 * Edit version * Correct type * Add condition to including env variables in model config * Add version * Update version and remove env support * Try converting config to string * Reverse str and update version * Fix true * Experiment with default config * Complete * Include load models * Set max workers to 1 * Set default response timeout to 60, and improve docstring * Fix flake8 * Add a warning log for single model * Fix extra spacing in log * Use string instead of a dict * Print config * Fix string * Fix f-string * Remove newline * Adjust f string * Fix flake8 * Trigger build * Trigger build * Trigger build * Trigger build Co-authored-by: Nikhil Kulkarni <nikhilsk@amazon.com>
try to add |
Hi there,
Given the following simplified setup, I can't get local training to work inside Docker. Outside of Docker it works fine, but when I run within the container it blows up with the error below.
docker-compose.yml
I've tried this with
520713654638.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-pytorch:1.0.0-cpu-py3
Docker image and also other base images, the error is always the same.I have a simple
train.py
script which gets mounted in/root/code
, Idocker-compose up
and inside the container run
python train.py
.The output is:
My issue is therefore - how to run local training in a dockerisied environment? Obviously the training is in Docker but I'd also like to dockerise the environment which the pre training script runs in. This is due to the need to pin down dependencies and reduce on boarding time etc
The text was updated successfully, but these errors were encountered: