Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different VAMPNET results from CPU/GPU training #220

Closed
yuxuanzhuang opened this issue Mar 22, 2022 · 11 comments · Fixed by #222
Closed

Different VAMPNET results from CPU/GPU training #220

yuxuanzhuang opened this issue Mar 22, 2022 · 11 comments · Fixed by #222

Comments

@yuxuanzhuang
Copy link
Contributor

Describe the bug
I tried to run the ala2 notebook (https://github.com/deeptime-ml/deeptime-notebooks/blob/master/examples/ala2-example.ipynb) but ended up with quite different results with GPU v.s. CPU training. CPU had a much higher success rate and flat training curves compared to GPU. I am wondering if it is something common or if I had made any mistakes.

Results
I tested with 10 individual runs with the same parameters as in the tutorial notebook.

  • CPU
    cpu_training
    cpu_state

  • GPU
    gpu_training
    gpu_state

System
CPU: AMD EPYC 7551
GPU: RTX A5000
System: Ubuntu 20.04.1
Python 3.9
torch 1.11.0+cu113
deeptime '0.4.1+8.g38b0158.dirty' (main branch)

@clonker
Copy link
Member

clonker commented Mar 22, 2022

Oh wow that is quite different and not even one of the GPU results look good. I tried to reproduce this misbehavior with the same torch and deeptime versions but to no avail, the results look okay to me. But then again I don't have access to an A5000 right now. 🙂 is it possible for you to try on a different hardware? also have you tried rebooting the system? (sounds silly, i know, but sometimes the gpu cache / driver gets a bit confused)

@yuxuanzhuang
Copy link
Contributor Author

yuxuanzhuang commented Mar 23, 2022

I tested both GTX980Ti and RTX2080 with the same cuda/conda environment and they worked perfectly fine:) As I don't have the privilege to restart the system, I tried different nodes equipped with A5000 in our cluster and they failed consistently. I also tried torch.cuda.empty_cache() and it's not helping as well.

I will see if it is a compatibility issue with pytorch by testing out other deep learning benchmarks. I will get back to you if I find out what's wrong.

BTW: the other VAMPNET notebook (https://github.com/deeptime-ml/deeptime-notebooks/blob/master/vampnets.ipynb) has no issue running with GPU.

@clonker
Copy link
Member

clonker commented Mar 23, 2022

Hmmm. My best guess is the driver version that is not fully compatible (yet) with A5000s. But really I am not sure. Please do keep me posted! Thanks

@yuxuanzhuang
Copy link
Contributor Author

yuxuanzhuang commented Mar 23, 2022

Found the solution! The culprit is the TF32 tensor cores on new Ampere devices. I have to manually set torch.backends.cuda.matmul.allow_tf32 = False to increase the accuracy of eigensolver for Koopman. (https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices)

Related issue: https://github.com/pytorch/pytorch/labels/module%3A%20tf32

@clonker
Copy link
Member

clonker commented Mar 24, 2022

That is interesting. I should put up a warning in the vampnets documentation that this might be required. Good detective work!

@yuxuanzhuang
Copy link
Contributor Author

yuxuanzhuang commented Mar 24, 2022

Does it make sense to set it to False in default in the codebase given it is obviously not only failing ala2? Or add it as a context decorator only to the vamp_score in case people gain any performance elsewhere with this setting? I can take a stab at it.

@clonker
Copy link
Member

clonker commented Mar 24, 2022

I was thinking about that, too. I don't think setting it as a global default in the library is a good way to go about it, as it might effect the performance of other parts in a larger program (and silently so). I like the context manager idea! There is an issue with multi-threaded applications, but I don't think that is a big concern here.
Looking forward to a PR!

@yuxuanzhuang
Copy link
Contributor Author

It turns out tf32 needs to be disabled throughout training (and validating).

Results after fixing.
image

@clonker
Copy link
Member

clonker commented Mar 24, 2022

This might be a general problem with applications where precision does matter... for example with SO(3) equivarient nets etc. I am curious to see how things evolve.

@davidgilbertson
Copy link

In case you missed it, this tf32 setting is now false by default from PyTorch 1.12.

@clonker
Copy link
Member

clonker commented Sep 2, 2022

Thanks @davidgilbertson, i missed this indeed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants