foundations branch run_train.py fails with a foundation_model that is float64 #267

bernstei · 2023-12-22T19:54:02Z

Trying to fine-tune a model of type float64 (e.g. MP medium model with an appropriate change to mace.tools.load_foundations to accommodate the different max_L, or the old MP large model after it's been converted to float64) fails with the error below.

Is this an issue with some fine-tuning specific code that's implicitly assuming some other dtype, or is it related to this known issue when training with pytorch with torch.set_default_dtype(torch.float64), or something else?

Traceback (most recent call last):
  File "/home/cluster2/bernstei/src/work/MACE/mace_github/mace/cli/run_train.py", line 584, in <module>
    main()
  File "/home/cluster2/bernstei/src/work/MACE/mace_github/mace/cli/run_train.py", line 510, in main
    tools.train(
  File "/home/cluster2/bernstei/src/work/MACE/mace_github/mace/tools/train.py", line 92, in train
    _, opt_metrics = take_step(
  File "/home/cluster2/bernstei/src/work/MACE/mace_github/mace/tools/train.py", line 253, in take_step
    optimizer.step()
  File "/home/Software/python/system/torch/gpu/lib64/python3.9/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper                                                                                   
    return wrapped(*args, **kwargs)
  File "/home/Software/python/system/torch/gpu/lib64/python3.9/site-packages/torch/optim/optimizer.py", line 373, in wrapper                                                                                     
    out = func(*args, **kwargs)
  File "/home/Software/python/system/torch/gpu/lib64/python3.9/site-packages/torch/optim/optimizer.py", line 76, in _use_grad                                                                                    
    ret = func(self, *args, **kwargs)
  File "/home/Software/python/system/torch/gpu/lib64/python3.9/site-packages/torch/optim/adam.py", line 163, in step                                                                                             
    adam(
  File "/home/Software/python/system/torch/gpu/lib64/python3.9/site-packages/torch/optim/adam.py", line 311, in adam                                                                                             
    func(params,
  File "/home/Software/python/system/torch/gpu/lib64/python3.9/site-packages/torch/optim/adam.py", line 474, in _multi_tensor_adam                                                                               
    grouped_tensors = Optimizer._group_tensors_by_device_and_dtype(
  File "/home/Software/python/system/torch/gpu/lib64/python3.9/site-packages/torch/optim/optimizer.py", line 397, in _group_tensors_by_device_and_dtype                                                          
    return _group_tensors_by_device_and_dtype(tensorlistlist, with_indices)
  File "/home/Software/python/system/torch/gpu/lib64/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context                                                                          
    return func(*args, **kwargs)
  File "/home/Software/python/system/torch/gpu/lib64/python3.9/site-packages/torch/utils/_foreach_utils.py", line 42, in _group_tensors_by_device_and_dtype                                                      
    torch._C._group_tensors_by_device_and_dtype(tensorlistlist, with_indices).items()
RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32 notwithstanding

The text was updated successfully, but these errors were encountered:

zakmachachi · 2024-01-22T16:53:42Z

I'm also getting this error when training from scratch. Is this a CU121 issue perhaps?

ilyes319 · 2024-01-22T17:09:32Z

This a Pytorch 2.1 issue, preventing training with float64. We need to bring back the warning on the readme.

ilyes319 · 2024-02-20T12:00:49Z

Apparently this is fixed with Pytorch 2.2.

ilyes319 self-assigned this Jan 9, 2024

ilyes319 added the bug Something isn't working label Jan 9, 2024

ilyes319 closed this as completed Mar 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

foundations branch run_train.py fails with a foundation_model that is float64 #267

foundations branch run_train.py fails with a foundation_model that is float64 #267

bernstei commented Dec 22, 2023

zakmachachi commented Jan 22, 2024

ilyes319 commented Jan 22, 2024

ilyes319 commented Feb 20, 2024

foundations branch run_train.py fails with a foundation_model that is float64 #267

foundations branch run_train.py fails with a foundation_model that is float64 #267

Comments

bernstei commented Dec 22, 2023

zakmachachi commented Jan 22, 2024

ilyes319 commented Jan 22, 2024

ilyes319 commented Feb 20, 2024