Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

foundations branch run_train.py fails with a foundation_model that is float64 #267

Closed
bernstei opened this issue Dec 22, 2023 · 3 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@bernstei
Copy link
Collaborator

Trying to fine-tune a model of type float64 (e.g. MP medium model with an appropriate change to mace.tools.load_foundations to accommodate the different max_L, or the old MP large model after it's been converted to float64) fails with the error below.

Is this an issue with some fine-tuning specific code that's implicitly assuming some other dtype, or is it related to this known issue when training with pytorch with torch.set_default_dtype(torch.float64), or something else?

Traceback (most recent call last):
  File "/home/cluster2/bernstei/src/work/MACE/mace_github/mace/cli/run_train.py", line 584, in <module>
    main()
  File "/home/cluster2/bernstei/src/work/MACE/mace_github/mace/cli/run_train.py", line 510, in main
    tools.train(
  File "/home/cluster2/bernstei/src/work/MACE/mace_github/mace/tools/train.py", line 92, in train
    _, opt_metrics = take_step(
  File "/home/cluster2/bernstei/src/work/MACE/mace_github/mace/tools/train.py", line 253, in take_step
    optimizer.step()
  File "/home/Software/python/system/torch/gpu/lib64/python3.9/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper                                                                                   
    return wrapped(*args, **kwargs)
  File "/home/Software/python/system/torch/gpu/lib64/python3.9/site-packages/torch/optim/optimizer.py", line 373, in wrapper                                                                                     
    out = func(*args, **kwargs)
  File "/home/Software/python/system/torch/gpu/lib64/python3.9/site-packages/torch/optim/optimizer.py", line 76, in _use_grad                                                                                    
    ret = func(self, *args, **kwargs)
  File "/home/Software/python/system/torch/gpu/lib64/python3.9/site-packages/torch/optim/adam.py", line 163, in step                                                                                             
    adam(
  File "/home/Software/python/system/torch/gpu/lib64/python3.9/site-packages/torch/optim/adam.py", line 311, in adam                                                                                             
    func(params,
  File "/home/Software/python/system/torch/gpu/lib64/python3.9/site-packages/torch/optim/adam.py", line 474, in _multi_tensor_adam                                                                               
    grouped_tensors = Optimizer._group_tensors_by_device_and_dtype(
  File "/home/Software/python/system/torch/gpu/lib64/python3.9/site-packages/torch/optim/optimizer.py", line 397, in _group_tensors_by_device_and_dtype                                                          
    return _group_tensors_by_device_and_dtype(tensorlistlist, with_indices)
  File "/home/Software/python/system/torch/gpu/lib64/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context                                                                          
    return func(*args, **kwargs)
  File "/home/Software/python/system/torch/gpu/lib64/python3.9/site-packages/torch/utils/_foreach_utils.py", line 42, in _group_tensors_by_device_and_dtype                                                      
    torch._C._group_tensors_by_device_and_dtype(tensorlistlist, with_indices).items()
RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32 notwithstanding                                                          
@ilyes319 ilyes319 self-assigned this Jan 9, 2024
@ilyes319 ilyes319 added the bug Something isn't working label Jan 9, 2024
@zakmachachi
Copy link

I'm also getting this error when training from scratch. Is this a CU121 issue perhaps?

@ilyes319
Copy link
Contributor

This a Pytorch 2.1 issue, preventing training with float64. We need to bring back the warning on the readme.

@ilyes319
Copy link
Contributor

Apparently this is fixed with Pytorch 2.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants