Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NOT URGENT] Test upgraded NVIDIA driver in gpu-2-6 #331

Open
jchodera opened this issue Oct 18, 2015 · 13 comments
Open

[NOT URGENT] Test upgraded NVIDIA driver in gpu-2-6 #331

jchodera opened this issue Oct 18, 2015 · 13 comments
Assignees
Labels

Comments

@jchodera
Copy link
Member

We are having some trouble using the FAH client application on gpu-2-8 (where the new GTX-980 cards were installed), and the advice we have received is to upgrade the 352.39 driver to 355.11 or later. Would it be possible to drain this node of GPU jobs and test the upgrade when feasible?

I believe the 355.11 driver is available here: http://www.nvidia.com/download/driverResults.aspx/90393/en-us

@tatarsky
Copy link
Contributor

Are you sure you mean gpu-2-8? nvidia-smi shows four of the GTX 680.

GeForce GTX 680

The GTX 980 cards are in gpu-2-6.

Please confirm with nvidia-smi as well just to make sure we offline the correct node.

@tatarsky
Copy link
Contributor

Snippet from nodes file as well to show the card types in that group of nodes:

gpu-2-4 np=32 gpus=4 batch gtx780ti nv352
gpu-2-5 np=32 gpus=4 batch gtxtitanx nv352
gpu-2-6 np=32 gpus=4 batch gtx980 nv352    <-------- I believe you want this node but double confirm with nvidia-smi
gpu-2-7 np=32 gpus=4 batch gtx680 nv352
gpu-2-8 np=32 gpus=4 batch gtx680 nv352

@jchodera
Copy link
Member Author

Yep, gpu-2-8 was a typo. I meant gpu-2-6.

@tatarsky
Copy link
Contributor

I've placed a reservation on the GPU resources on gpu-2-6. When I see them come free I will update the driver.

@tatarsky
Copy link
Contributor

No GPU activity was seen. Updated driver.

+------------------------------------------------------+                       
| NVIDIA-SMI 355.11     Driver Version: 355.11         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 980     Off  | 0000:03:00.0     Off |                  N/A |
| 26%   36C    P0    47W / 180W |     14MiB /  4095MiB |      0%      Default |

I still have the reservation in place on the GPUS however. Do you wish to test manually first in case roll back is desired?

@jchodera
Copy link
Member Author

jchodera commented Oct 18, 2015 via email

@tatarsky
Copy link
Contributor

No prob. Reservation left in place for GPUs. Batch jobs non-impacted.

@tatarsky
Copy link
Contributor

gpu-2-6 is drained from discussions elsewhere. Did that driver update work out? I can re-add it to the batch queue and re-issue the GPU only reservation if desired.

@jchodera
Copy link
Member Author

My apologies for not having much time to further debug. There appears to
still be something weird going on with the GPU configuration. Will provide
more info in next email.
On Oct 22, 2015 5:00 PM, "tatarsky" notifications@github.com wrote:

gpu-2-6 is drained from discussions elsewhere. Did that driver update work
out? I can re-add it to the batch queue and re-issue the GPU only
reservation if desired.


Reply to this email directly or view it on GitHub
#331 (comment).

@tatarsky tatarsky changed the title [NOT URGENT] Test upgraded NVIDIA driver in gpu-2-8 [NOT URGENT] Test upgraded NVIDIA driver in gpu-2-6 Oct 22, 2015
@tatarsky
Copy link
Contributor

OK. I put the node back in the pool for batch work but stuck a 10 day reservation on the GPUs. Hope that is reasonable.

@tatarsky
Copy link
Contributor

tatarsky commented Nov 2, 2015

I believe I need to renew the GPU reservation on this node. Done for another 10 days.

@jchodera
Copy link
Member Author

jchodera commented Nov 2, 2015

Thanks. We're still chasing this down, and have replicated the issue on a local dev box. It seems to be 980-specific and related to driver versions.

@pgrinaway and @steven-albanese have been investigating on the local dev box.

@tatarsky
Copy link
Contributor

tatarsky commented Nov 2, 2015

Fun! Noted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants