Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to use socket in containerized PyTorch app #89

Closed
duckontheweb opened this issue Mar 3, 2020 · 7 comments
Closed

Unable to use socket in containerized PyTorch app #89

duckontheweb opened this issue Mar 3, 2020 · 7 comments

Comments

@duckontheweb
Copy link

I have been following the docs for Docker environment setup for Neuron and Run containerized neuron application to set up a containerized app using the Inferentia chip.

I am able to get the neuron-rtd container running and using a socket in /tmp/neuron_rtd_sock as described, but I had to add the following modification to that folder in order for the container to be able to use the socket: chmod o+x /tmp/neuron_rtd_sock.

I tried using the trace_resnet50.py script described here to test whether a container could get access to the chip. I used the following Dockerfile and run command:

Dockerfile (built as pytorch-inf1 image)

FROM python:3.6

RUN pip install -U pip && \
    pip install torch-neuron "neuron-cc[tensorflow]" --extra-index-url https://pip.repos.neuron.amazonaws.com && \
    pip install  pillow==6.2.2 && \
    pip install torchvision==0.4.2 --no-deps

COPY trace_resnet50.py /src/trace_resnet50.py

CMD ["python", "/src/trace_resnet50.py"]

Docker run command

$ docker run -it --rm --env NEURON_RTD_ADDRESS=/sock/neuron.sock -v /tmp/neuron_rtd_sock/:/sock  pytorch-inf1
Downloading: "https://download.pytorch.org/models/resnet50-19c8e357.pth" to /root/.cache/torch/checkpoints/resnet50-19c8e357.pth
100.0%
/usr/local/lib/python3.6/site-packages/torch/jit/__init__.py:847: UserWarning: `optimize` is deprecated and has no effect. Use `with torch.jit.optimized_execution() instead
  warnings.warn("`optimize` is deprecated and has no effect. Use `with torch.jit.optimized_execution() instead")
INFO:Neuron:compiling module ResNet with neuron-cc
As
[E neuron_runtime.cpp:82] grpc server /sock/neuron.sock is unavailable. Is neuron-rtd running?
[E neuron_op_impl.cpp:52] Warning: Neuron runtime cannot be initialized; falling back to CPU execution
[E neuron_op_impl.cpp:53] Warning: Tensor output are ** NOT CALCULATED ** during CPU execution and only indicate tensor shape
[E neuron_runtime.cpp:82] grpc server /sock/neuron.sock is unavailable. Is neuron-rtd running?
[E neuron_op_impl.cpp:52] Warning: Neuron runtime cannot be initialized; falling back to CPU execution
[E neuron_op_impl.cpp:53] Warning: Tensor output are ** NOT CALCULATED ** during CPU execution and only indicate tensor shape
[E neuron_runtime.cpp:82] grpc server /sock/neuron.sock is unavailable. Is neuron-rtd running?
[E neuron_op_impl.cpp:52] Warning: Neuron runtime cannot be initialized; falling back to CPU execution
[E neuron_op_impl.cpp:53] Warning: Tensor output are ** NOT CALCULATED ** during CPU execution and only indicate tensor shape
[E neuron_runtime.cpp:82] grpc server /sock/neuron.sock is unavailable. Is neuron-rtd running?
[E neuron_op_impl.cpp:52] Warning: Neuron runtime cannot be initialized; falling back to CPU execution
[E neuron_op_impl.cpp:53] Warning: Tensor output are ** NOT CALCULATED ** during CPU execution and only indicate tensor shape
[E neuron_runtime.cpp:82] grpc server /sock/neuron.sock is unavailable. Is neuron-rtd running?
[E neuron_op_impl.cpp:52] Warning: Neuron runtime cannot be initialized; falling back to CPU execution
[E neuron_op_impl.cpp:53] Warning: Tensor output are ** NOT CALCULATED ** during CPU execution and only indicate tensor shape

Here are the permissions on the socket directory:

$ ls -l /tmp
total 4
...
drwxr-xrwx 2 root     root      25 Mar  3 16:31 neuron_rtd_sock
...
$ ls -l /tmp/neuron_rtd_sock
total 0
srw-rw-rw- 1 root root 0 Mar  3 16:31 neuron.sock

Is there a step I'm missing to allow the app container to access that socket? I tried running the app with the same elevated privileges as the neuron-rtd container, but got the same results.

Thanks!

@awsrjh
Copy link
Contributor

awsrjh commented Mar 3, 2020

duckontheweb -- we will have a look and get back to you.

@awsrjh
Copy link
Contributor

awsrjh commented Mar 3, 2020

We have updated the docs to add the unix: to the socket environment variable :

NEURON_RTD_ADDRESS=unix:/sock/neuron.sock

Please see
https://github.com/aws/aws-neuron-sdk/blob/master/docs/neuron-container-tools/docker-example/README.md

thanks for letting us know.

@awsrjh awsrjh closed this as completed Mar 3, 2020
@duckontheweb
Copy link
Author

Thanks, that worked!

I did still have to run chmod o+x /tmp/neuron_rtd_sock in order to run the neuron-rtd container. Is that expected (I didn't see anything about it in the docs) or should I be running my container differently?

@micwade-aws
Copy link
Contributor

You are correct. We missed the chmod operation in the example. Pull request #91 will correct our tutorial to reflect this. Thank you!

@hsonetta
Copy link

hsonetta commented Jul 9, 2021

@duckontheweb when I try to run the neuron-rtd container, I get the below error:

`ubuntu@ip-192-168-19-125:~$ sudo docker run --device=/dev/neuron0 --cap-add IPC_LOCK -v /tmp/neuron_rtd_sock/:/sock -it neuron-rtd
nrtd[1]: [NRTD:nrtd_main] nrtd build using:1.1.1402.0
nrtd[1]: [NRTD:nrtd_main] nrtd build using:1.1.1402.0

sh: lspci: command not found
nrtd[1]: [TDRV:tdrv_init_mla_phase1] Could not open the device index:0

nrtd[1]: [TDRV:tdrv_init_mla_phase1] Could not open the device index:0

nrtd[1]: [TDRV:tdrv_destroy_one_mla] close device failed
nrtd[1]: [TDRV:tdrv_destroy_one_mla] close device failed

nrtd[1]: [TDRV:tdrv_destroy] TDRV not initialized
nrtd[1]: [TDRV:tdrv_destroy] TDRV not initialized

nrtd[1]: [NRTD:InitTongas] Failed to initialize devices, error:1
nrtd[1]: [NRTD:InitTongas] Failed to initialize devices, error:1

nrtd[1]: [NRTD:nrtd_main] Failed to initialize devices: , attempt: 1
nrtd[1]: [NRTD:nrtd_main] Failed to initialize devices: , attempt: 1`

Did you face this error? If so, can you please point me in the right diresction?

@duckontheweb
Copy link
Author

@duckontheweb when I try to run the neuron-rtd container, I get the below error:

`ubuntu@ip-192-168-19-125:~$ sudo docker run --device=/dev/neuron0 --cap-add IPC_LOCK -v /tmp/neuron_rtd_sock/:/sock -it neuron-rtd
nrtd[1]: [NRTD:nrtd_main] nrtd build using:1.1.1402.0
nrtd[1]: [NRTD:nrtd_main] nrtd build using:1.1.1402.0

sh: lspci: command not found
nrtd[1]: [TDRV:tdrv_init_mla_phase1] Could not open the device index:0

nrtd[1]: [TDRV:tdrv_init_mla_phase1] Could not open the device index:0

nrtd[1]: [TDRV:tdrv_destroy_one_mla] close device failed
nrtd[1]: [TDRV:tdrv_destroy_one_mla] close device failed

nrtd[1]: [TDRV:tdrv_destroy] TDRV not initialized
nrtd[1]: [TDRV:tdrv_destroy] TDRV not initialized

nrtd[1]: [NRTD:InitTongas] Failed to initialize devices, error:1
nrtd[1]: [NRTD:InitTongas] Failed to initialize devices, error:1

nrtd[1]: [NRTD:nrtd_main] Failed to initialize devices: , attempt: 1
nrtd[1]: [NRTD:nrtd_main] Failed to initialize devices: , attempt: 1`

Did you face this error? If so, can you please point me in the right diresction?

To be honest, it's been so long that I don't recall. We were experimenting with the SDK at my previous employer but ended up not using it, so I'm not sure I'll be of much help. Sorry!

@john-heyer
Copy link

If you're seeing this error, make sure you stop the neuron runtime running on your instance outside of the container!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants