Different VAMPNET results from CPU/GPU training #220

yuxuanzhuang · 2022-03-22T12:45:03Z

Describe the bug
I tried to run the ala2 notebook (https://github.com/deeptime-ml/deeptime-notebooks/blob/master/examples/ala2-example.ipynb) but ended up with quite different results with GPU v.s. CPU training. CPU had a much higher success rate and flat training curves compared to GPU. I am wondering if it is something common or if I had made any mistakes.

Results
I tested with 10 individual runs with the same parameters as in the tutorial notebook.

CPU
GPU

System
CPU: AMD EPYC 7551
GPU: RTX A5000
System: Ubuntu 20.04.1
Python 3.9
torch 1.11.0+cu113
deeptime '0.4.1+8.g38b0158.dirty' (main branch)

clonker · 2022-03-22T15:26:20Z

Oh wow that is quite different and not even one of the GPU results look good. I tried to reproduce this misbehavior with the same torch and deeptime versions but to no avail, the results look okay to me. But then again I don't have access to an A5000 right now. 🙂 is it possible for you to try on a different hardware? also have you tried rebooting the system? (sounds silly, i know, but sometimes the gpu cache / driver gets a bit confused)

yuxuanzhuang · 2022-03-23T10:22:15Z

I tested both GTX980Ti and RTX2080 with the same cuda/conda environment and they worked perfectly fine:) As I don't have the privilege to restart the system, I tried different nodes equipped with A5000 in our cluster and they failed consistently. I also tried torch.cuda.empty_cache() and it's not helping as well.

I will see if it is a compatibility issue with pytorch by testing out other deep learning benchmarks. I will get back to you if I find out what's wrong.

BTW: the other VAMPNET notebook (https://github.com/deeptime-ml/deeptime-notebooks/blob/master/vampnets.ipynb) has no issue running with GPU.

clonker · 2022-03-23T13:08:58Z

Hmmm. My best guess is the driver version that is not fully compatible (yet) with A5000s. But really I am not sure. Please do keep me posted! Thanks

yuxuanzhuang · 2022-03-23T17:44:18Z

Found the solution! The culprit is the TF32 tensor cores on new Ampere devices. I have to manually set torch.backends.cuda.matmul.allow_tf32 = False to increase the accuracy of eigensolver for Koopman. (https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices)

Related issue: https://github.com/pytorch/pytorch/labels/module%3A%20tf32

clonker · 2022-03-24T08:12:34Z

That is interesting. I should put up a warning in the vampnets documentation that this might be required. Good detective work!

yuxuanzhuang · 2022-03-24T08:45:56Z

Does it make sense to set it to False in default in the codebase given it is obviously not only failing ala2? Or add it as a context decorator only to the vamp_score in case people gain any performance elsewhere with this setting? I can take a stab at it.

clonker · 2022-03-24T09:42:28Z

I was thinking about that, too. I don't think setting it as a global default in the library is a good way to go about it, as it might effect the performance of other parts in a larger program (and silently so). I like the context manager idea! There is an issue with multi-threaded applications, but I don't think that is a big concern here.
Looking forward to a PR!

yuxuanzhuang · 2022-03-24T11:45:07Z

It turns out tf32 needs to be disabled throughout training (and validating).

Results after fixing.

clonker · 2022-03-24T13:03:58Z

This might be a general problem with applications where precision does matter... for example with SO(3) equivarient nets etc. I am curious to see how things evolve.

davidgilbertson · 2022-09-02T06:41:19Z

In case you missed it, this tf32 setting is now false by default from PyTorch 1.12.

clonker · 2022-09-02T13:02:16Z

Thanks @davidgilbertson, i missed this indeed!

yuxuanzhuang mentioned this issue Mar 24, 2022

Disable TF-32 tensor cores for VAMPNET training #222

Merged

3 tasks

clonker closed this as completed in #222 Mar 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different VAMPNET results from CPU/GPU training #220

Different VAMPNET results from CPU/GPU training #220

yuxuanzhuang commented Mar 22, 2022

clonker commented Mar 22, 2022

yuxuanzhuang commented Mar 23, 2022 •

edited

clonker commented Mar 23, 2022

yuxuanzhuang commented Mar 23, 2022 •

edited

clonker commented Mar 24, 2022

yuxuanzhuang commented Mar 24, 2022 •

edited

clonker commented Mar 24, 2022

yuxuanzhuang commented Mar 24, 2022

clonker commented Mar 24, 2022

davidgilbertson commented Sep 2, 2022

clonker commented Sep 2, 2022

Different VAMPNET results from CPU/GPU training #220

Different VAMPNET results from CPU/GPU training #220

Comments

yuxuanzhuang commented Mar 22, 2022

clonker commented Mar 22, 2022

yuxuanzhuang commented Mar 23, 2022 • edited

clonker commented Mar 23, 2022

yuxuanzhuang commented Mar 23, 2022 • edited

clonker commented Mar 24, 2022

yuxuanzhuang commented Mar 24, 2022 • edited

clonker commented Mar 24, 2022

yuxuanzhuang commented Mar 24, 2022

clonker commented Mar 24, 2022

davidgilbertson commented Sep 2, 2022

clonker commented Sep 2, 2022

yuxuanzhuang commented Mar 23, 2022 •

edited

yuxuanzhuang commented Mar 23, 2022 •

edited

yuxuanzhuang commented Mar 24, 2022 •

edited