Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getting the wrong GPUs #422

Open
corcra opened this issue Jun 9, 2016 · 17 comments
Open

getting the wrong GPUs #422

corcra opened this issue Jun 9, 2016 · 17 comments

Comments

@corcra
Copy link

corcra commented Jun 9, 2016

I am confused by/failing to do qsub commands to get the correct resources.
For example, I ran:
qsub -I -q gpu -l gpus=4:gtxtitan:docker:shared
and got this setup: (gpu-1-5 fwiw)

+------------------------------------------------------+
| NVIDIA-SMI 352.39     Driver Version: 352.39         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 680     Off  | 0000:03:00.0     N/A |                  N/A |
| 30%   32C    P8    N/A /  N/A |     48MiB /  4095MiB |     N/A    E. Thread |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 680     Off  | 0000:04:00.0     N/A |                  N/A |
| 30%   31C    P8    N/A /  N/A |     48MiB /  4095MiB |     N/A    E. Thread |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 680     Off  | 0000:83:00.0     N/A |                  N/A |
| 30%   31C    P8    N/A /  N/A |     48MiB /  4095MiB |     N/A    E. Thread |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 680     Off  | 0000:84:00.0     N/A |                  N/A |
| 30%   30C    P8    N/A /  N/A |     48MiB /  4095MiB |     N/A    E. Thread |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0                  Not Supported                                         |
|    1                  Not Supported                                         |
|    2                  Not Supported                                         |
|    3                  Not Supported                                         |
+-----------------------------------------------------------------------------+

The GPUs aren't shared, and aren't gtxtitans ...what's going on here? I need both non-exclusive and gtxtitan (or >gtx680 at least) to run Tensorflow, so this is problematic.

@nhgirija
Copy link

nhgirija commented Jun 9, 2016

Try the active queue.

qsub -I -q active -l walltime=01:00:00 -l nodes=1:ppn=1:gpus=4:shared:gtxtitans

@jchodera
Copy link
Member

jchodera commented Jun 9, 2016

You shouldn't need to use the active queue---the constraints should still work. The active queue just has different priorities. Hm...

@corcra
Copy link
Author

corcra commented Jun 9, 2016

Using the active queue didn't fix it.

Although, I just managed to get 'good' (aka conforming to my request) gpus on gg06 and gg01. Included nodes=1:ppn=1 in the qsub call, although I don't see how that should be relevant...

@jchodera
Copy link
Member

jchodera commented Jun 9, 2016

This worked correctly for me when I included nodes=1:ppn=4:

qsub -I -l walltime=04:00:00,nodes=1:ppn=4:gpus=4:shared:gtxtitan -l mem=4G -q gpu

I wonder why omitting the nodes=1:ppn=X gives incorrect resource requests...

@jchodera
Copy link
Member

jchodera commented Jun 9, 2016

Oh, there are some problems with GPU spillover though. I was allocated gpu-2-14 and found the GPUs are tied up with something already:


[chodera@gpu-2-14 ~]$ nvidia-smi
Thu Jun  9 13:41:35 2016       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.39     Driver Version: 352.39         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TITAN   Off  | 0000:03:00.0     Off |                  N/A |
| 30%   34C    P8    14W / 250W |   5872MiB /  6143MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TITAN   Off  | 0000:04:00.0     Off |                  N/A |
| 30%   33C    P8    14W / 250W |   5771MiB /  6143MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TITAN   Off  | 0000:83:00.0     Off |                  N/A |
| 30%   34C    P8    14W / 250W |     85MiB /  6143MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TITAN   Off  | 0000:84:00.0     Off |                  N/A |
| 30%   34C    P8    14W / 250W |     85MiB /  6143MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     23613    C   /usr/bin/python                               5854MiB |
|    1      5220    C   /usr/bin/python                               5683MiB |
|    1     23613    C   /usr/bin/python                                 68MiB |
|    2     23613    C   /usr/bin/python                                 68MiB |
|    3     23613    C   /usr/bin/python                                 68MiB |
+-----------------------------------------------------------------------------+

This seems to be a docker job that is using GPUs but didn't request them, or is still running after supposedly being killed by torque:

1164      5107  0.0  0.0 153340  9356 pts/0    Sl+  Jun08   0:01 docker run -it -v /usr/lib64/libcuda.so:/usr/lib64/libcuda.so -v /usr/lib64/libcuda.so.1:/usr/lib64/libcuda.so.1 -v /usr/lib64/libcuda.so.352.39:/usr/lib64/libcuda.so.352.39 --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia1:/dev/nvidia1 --device /dev/nvidia2:/dev/nvidia2 --device /dev/nvidia3:/dev/nvidia3 --device /dev/nvidia4:/dev/nvidia4 --device /dev/nvidia5:/dev/nvidia5 --device /dev/nvidia6:/dev/nvidia6 --device /dev/nvidia7:/dev/nvidia7 --device /dev/nvidiactl:/dev/nvidiactl --env LD_LIBRARY_PATH=/opt/mpich2/gcc/eth/lib:/opt/gnu/gcc/4.8.1/lib64:/opt/gnu/gcc/4.8.1/lib:/opt/gnu/gmp/lib:/opt/gnu/mpc/lib:/opt/gnu/mpfr/lib:/usr/lib64/ --env CUDA_VISIBLE_DEVICES=1 -v /cbio/grlab/home/dresdnerg/software:/mnt/software -v /cbio/grlab/home/dresdnerg/projects/tissue-microarray-resnet/:/mnt/tma-resnet -it gmd:cudnn4 ipython

@jchodera
Copy link
Member

jchodera commented Jun 9, 2016

It's impossible to tell who is/was running that docker job, but they are processing data in @gideonite's directory.

@tatarsky
Copy link
Contributor

tatarsky commented Jun 9, 2016

I will look in a moment. I have been on the road all morning.

@gideonite
Copy link

I was running a docker container in an active session on gpu-2-14 but the process should have stopped using GPU resources sometime yesterday evening. Perhaps nvidia-smi is reporting memory which is "allocated but collectible," though I'm not sure that makes sense or is a valid state to be in.

I requested the node by running qsub -I -l nodes=1:gpus=1:gtxtitan:docker:shared -q active. Should I have done something different? @jchodera

@tatarsky
Copy link
Contributor

tatarsky commented Jun 9, 2016

I have a dim memory of seeing this before where without the "nodes" stanza the qsub does not do what is expected. I would need to locate the Git issue or Torque ticket that matches that part of my memory. As for the docker item, I'd have to investigate that as well if you feel the state of the card is wrong.

@corcra
Copy link
Author

corcra commented Jun 9, 2016

The docker flag seems to be working fine!

@tatarsky
Copy link
Contributor

tatarsky commented Jun 9, 2016

For the gpu-2-14 docker and nvidia GPU resources item I show via lsof that these processes appear to still have nvidia devices open.

ipython   23613      root  mem       REG               0,32              35460148 /dev/nvidia2 (path dev=0,5, inode=21968)
ipython   23613      root  mem       REG               0,32              35460147 /dev/nvidia1 (path dev=0,5, inode=21673)
ipython   23613      root  mem       REG               0,32              35460146 /dev/nvidia0 (path dev=0,5, inode=21282)
ipython   23613      root  mem       REG               0,32              35460149 /dev/nvidia3 (path dev=0,5, inode=21979)

Those are docker processes (note the root part) and an attempt to narrow down a bit besides being the only docker job on the system is that the cwd is showing:

/proc/23613/cwd -> /tf_data

Thats within the chroot.

And also if we expand that docker instance a bit we see that ipyton is associated with that one:

# docker ps
CONTAINER ID        IMAGE                                             COMMAND             CREATED             STATUS              PORTS                NAMES
b9d872ee4a3a        b.gcr.io/tensorflow/tensorflow:latest-devel-gpu   "/bin/bash"         2 days ago          Up 2 days           6006/tcp, 8888/tcp   thirsty_tesla       

 # docker top b9d872ee4a3a
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
root                18030               5210                0                   Jun06               pts/1               00:00:00            /bin/bash
root                23613               18030               28                  Jun07               pts/1               14:20:17            /usr/bin/python /usr/local/bin/ipython

So I guess the question this is "why isn't this correct when you've both I believe requested "shared" mode for the nvidia cards". Or do I misunderstand that aside?

@corcra please note this is NOT related to your item.

@tatarsky
Copy link
Contributor

tatarsky commented Jun 9, 2016

I have reproduced as you folks have leaving off "nodes=X" appears to result in the behavior. Now trying to remember where I remember this from.

@corcra
Copy link
Author

corcra commented Jun 9, 2016

Things pointing at /tf_data are mine; I currently have two docker jobs running... is this complicating matters?

@tatarsky
Copy link
Contributor

tatarsky commented Jun 9, 2016

No, I don't think its complicating things. I think basically the syntax you've used at the start of this doesn't work properly and we've talked about it before. I'm just trying to locate that conversation.

@tatarsky
Copy link
Contributor

tatarsky commented Jun 9, 2016

Ah, we may have noted something similar in #275 and Adaptive assigned me a bug number after confirming a resource parsing error. Let me see if I can spot anything on that. I don't recall ever seeing that bug being fixed.

@tatarsky
Copy link
Contributor

tatarsky commented Jun 9, 2016

They believe its basically the same bug and that nodes=X is required at this time in that release. He is however checking where the bug number went for the developers to fix the one I reported many moons ago. As it seems to have fallen out of existence.

My preference with "enforced syntax" is the parser should tell you its wrong and not just "do something random" ;) I know I'm weird that way.

@tatarsky
Copy link
Contributor

This is confirmed back in their bug system but not addressed. Please use nodes=X in qsub resource requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants