New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Different VAMPNET results from CPU/GPU training #220
Comments
Oh wow that is quite different and not even one of the GPU results look good. I tried to reproduce this misbehavior with the same torch and deeptime versions but to no avail, the results look okay to me. But then again I don't have access to an |
I tested both GTX980Ti and RTX2080 with the same cuda/conda environment and they worked perfectly fine:) As I don't have the privilege to restart the system, I tried different nodes equipped with A5000 in our cluster and they failed consistently. I also tried I will see if it is a compatibility issue with pytorch by testing out other deep learning benchmarks. I will get back to you if I find out what's wrong. BTW: the other VAMPNET notebook (https://github.com/deeptime-ml/deeptime-notebooks/blob/master/vampnets.ipynb) has no issue running with GPU. |
Hmmm. My best guess is the driver version that is not fully compatible (yet) with A5000s. But really I am not sure. Please do keep me posted! Thanks |
Found the solution! The culprit is the TF32 tensor cores on new Ampere devices. I have to manually set Related issue: https://github.com/pytorch/pytorch/labels/module%3A%20tf32 |
That is interesting. I should put up a warning in the vampnets documentation that this might be required. Good detective work! |
Does it make sense to set it to |
I was thinking about that, too. I don't think setting it as a global default in the library is a good way to go about it, as it might effect the performance of other parts in a larger program (and silently so). I like the context manager idea! There is an issue with multi-threaded applications, but I don't think that is a big concern here. |
This might be a general problem with applications where precision does matter... for example with SO(3) equivarient nets etc. I am curious to see how things evolve. |
In case you missed it, this tf32 setting is now false by default from PyTorch 1.12. |
Thanks @davidgilbertson, i missed this indeed! |
Describe the bug
I tried to run the ala2 notebook (https://github.com/deeptime-ml/deeptime-notebooks/blob/master/examples/ala2-example.ipynb) but ended up with quite different results with GPU v.s. CPU training. CPU had a much higher success rate and flat training curves compared to GPU. I am wondering if it is something common or if I had made any mistakes.
Results
I tested with 10 individual runs with the same parameters as in the tutorial notebook.
CPU
GPU
System
CPU: AMD EPYC 7551
GPU: RTX A5000
System: Ubuntu 20.04.1
Python 3.9
torch 1.11.0+cu113
deeptime '0.4.1+8.g38b0158.dirty' (main branch)
The text was updated successfully, but these errors were encountered: