Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LAMMPS implementation terminates unexpectedly #87

Closed
zakmachachi opened this issue Mar 7, 2023 · 7 comments
Closed

LAMMPS implementation terminates unexpectedly #87

zakmachachi opened this issue Mar 7, 2023 · 7 comments
Assignees

Comments

@zakmachachi
Copy link

Describe the bug
I have installed the LAMMPS implementation of MACE on Archer2 and compiled the potential as per the instructions and I receive the following error when I run a small LAMMPS script:

Due to MODULEPATH changes, the following have been reloaded:
  1) cray-mpich/8.1.4


The following have been reloaded with a version change:
  1) cray-dsmml/0.1.4 => cray-dsmml/0.2.2
  2) cray-libsci/21.04.1.1 => cray-libsci/21.08.1.2
  3) cray-mpich/8.1.4 => cray-mpich/8.1.15
  4) craype/2.7.6 => craype/2.7.15
  5) gcc/10.2.0 => gcc/11.2.0


The following have been reloaded with a version change:
  1) cray-dsmml/0.2.2 => cray-dsmml/0.2.1        4) craype/2.7.15 => craype/2.7.10
  2) cray-fftw/3.3.8.9 => cray-fftw/3.3.8.11     5) gcc/11.2.0 => gcc/9.3.0
  3) cray-mpich/8.1.15 => cray-mpich/8.1.9
  
Currently Loaded Modules:
  1) craype-x86-rome         3) craype-network-ofi   5) epcc-setup-env     7) PrgEnv-cray/8.1.0   9) cray-dsmml/0.2.1       11) cray-mpich/8.1.9  13) cpe-cuda/21.09
  2) libfabric/1.11.0.4.71   4) bolt/0.8             6) load-epcc-module   8) cce/12.0.3         10) cray-libsci/21.08.1.2  12) craype/2.7.10
 
LAMMPS (22 Dec 2022)
terminate called after throwing an instance of 'c10::NotImplementedError'
  what():  Could not run 'aten::empty_strided' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::empty_strided' is only available for these backends: [CPU, Meta, QuantizedCPU, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradHIP, AutogradXLA, AutogradMPS, AutogradIPU, AutogradXPU, AutogradHPU, AutogradVE, AutogradLazy, AutogradMeta, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, AutogradNestedTensor, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PythonDispatcher].

CPU: registered at aten/src/ATen/RegisterCPU.cpp:30798 [kernel]
Meta: registered at aten/src/ATen/RegisterMeta.cpp:26815 [kernel]
QuantizedCPU: registered at aten/src/ATen/RegisterQuantizedCPU.cpp:929 [kernel]
BackendSelect: registered at aten/src/ATen/RegisterBackendSelect.cpp:726 [kernel]
Python: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:140 [backend fallback]
FuncTorchDynamicLayerBackMode: registered at ../aten/src/ATen/functorch/DynamicLayer.cpp:488 [backend fallback]
Functionalize: registered at ../aten/src/ATen/FunctionalizeFallbackKernel.cpp:291 [backend fallback]
Named: registered at ../aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
Conjugate: fallthrough registered at ../aten/src/ATen/ConjugateFallback.cpp:22 [kernel]
Negative: fallthrough registered at ../aten/src/ATen/native/NegateFallback.cpp:22 [kernel]
ZeroTensor: fallthrough registered at ../aten/src/ATen/ZeroTensorFallback.cpp:90 [kernel]
ADInplaceOrView: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:64 [backend fallback]
AutogradOther: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:16903 [autograd kernel]
AutogradCPU: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:16903 [autograd kernel]
AutogradCUDA: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:16903 [autograd kernel]
AutogradHIP: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:16903 [autograd kernel]
AutogradXLA: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:16903 [autograd kernel]
AutogradMPS: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:16903 [autograd kernel]
AutogradIPU: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:16903 [autograd kernel]
AutogradXPU: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:16903 [autograd kernel]
AutogradHPU: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:16903 [autograd kernel]
AutogradVE: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:16903 [autograd kernel]
AutogradLazy: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:16903 [autograd kernel]
AutogradMeta: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:16903 [autograd kernel]
AutogradPrivateUse1: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:16903 [autograd kernel]
AutogradPrivateUse2: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:16903 [autograd kernel]
AutogradPrivateUse3: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:16903 [autograd kernel]
AutogradNestedTensor: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:16903 [autograd kernel]
Tracer: registered at ../torch/csrc/autograd/generated/TraceType_2.cpp:16890 [kernel]
AutocastCPU: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:482 [backend fallback]
AutocastCUDA: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:324 [backend fallback]
FuncTorchBatched: registered at ../aten/src/ATen/functorch/LegacyBatchingRegistrations.cpp:743 [backend fallback]
FuncTorchVmapMode: fallthrough registered at ../aten/src/ATen/functorch/VmapModeRegistrations.cpp:28 [backend fallback]
Batched: registered at ../aten/src/ATen/BatchingRegistrations.cpp:1064 [backend fallback]
VmapMode: fallthrough registered at ../aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
FuncTorchGradWrapper: registered at ../aten/src/ATen/functorch/TensorWrapper.cpp:189 [backend fallback]
PythonTLSSnapshot: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:148 [backend fallback]
FuncTorchDynamicLayerFrontMode: registered at ../aten/src/ATen/functorch/DynamicLayer.cpp:484 [backend fallback]
PythonDispatcher: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:144 [backend fallback]

Exception raised from reportError at ../aten/src/ATen/core/dispatch/OperatorEntry.cpp:511 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x2b7cd821156e in /work/e89/e89/zem/libtorch/lib/libc10.so)
frame #1: <unknown function> + 0x11bc130 (0x2b7cde4de130 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x2476823 (0x2b7cdf798823 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #3: at::_ops::empty_strided::redispatch(c10::DispatchKeySet, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>) + 0x90 (0x2b7cdf9c2510 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x29b2dbe (0x2b7cdfcd4dbe in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #5: at::_ops::empty_strided::call(c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>) + 0x1bb (0x2b7cdfa0582b in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x1c19cd7 (0x2b7cdef3bcd7 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #7: at::native::_to_copy(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0x177b (0x2b7cdf2a32cb in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x2b6d60b (0x2b7cdfe8f60b in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #9: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0xf5 (0x2b7cdf6dce35 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0x29b2b33 (0x2b7cdfcd4b33 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #11: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0xf5 (0x2b7cdf6dce35 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x3d73aab (0x2b7ce1095aab in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x3d73f1e (0x2b7ce1095f1e in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #14: at::_ops::_to_copy::call(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0x1f9 (0x2b7cdf75d399 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #15: at::native::to(at::Tensor const&, c10::Device, c10::ScalarType, bool, bool, c10::optional<c10::MemoryFormat>) + 0xc7 (0x2b7cdf29b6a7 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x2d2ba69 (0x2b7ce004da69 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #17: at::_ops::to_device::call(at::Tensor const&, c10::Device, c10::ScalarType, bool, bool, c10::optional<c10::MemoryFormat>) + 0x1ba (0x2b7cdf8c48aa in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #18: torch::jit::Unpickler::readInstruction() + 0x1af0 (0x2b7ce20995d0 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #19: torch::jit::Unpickler::run() + 0x90 (0x2b7ce209a3b0 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #20: torch::jit::Unpickler::parse_ivalue() + 0x18 (0x2b7ce209a508 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #21: torch::jit::readArchiveAndTensors(std::string const&, std::string const&, std::string const&, c10::optional<std::function<c10::StrongTypePtr (c10::QualifiedName const&)> >, c10::optional<std::function<c10::intrusive_ptr<c10::ivalue::Object, c10::detail::intrusive_target_default_null_type<c10::ivalue::Object> > (c10::StrongTypePtr, c10::IValue)> >, c10::optional<c10::Device>, caffe2::serialize::PyTorchStreamReader&, c10::Type::SingletonOrSharedTypePtr<c10::Type> (*)(std::string const&), std::shared_ptr<torch::jit::DeserializationStorageContext>) + 0x45a (0x2b7ce205799a in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #22: <unknown function> + 0x4d20507 (0x2b7ce2042507 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #23: <unknown function> + 0x4d23362 (0x2b7ce2045362 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #24: torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::string const&, c10::optional<c10::Device>, std::unordered_map<std::string, std::string, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::string> > >&) + 0x3a2 (0x2b7ce2046c82 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #25: torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::string const&, c10::optional<c10::Device>) + 0x7b (0x2b7ce204739b in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #26: torch::jit::load(std::string const&, c10::optional<c10::Device>) + 0xa5 (0x2b7ce2047475 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #27: /work/e89/e89/zem/lammps/build/lmp() [0x7bc928]
frame #28: /work/e89/e89/zem/lammps/build/lmp() [0x31a2dd]
frame #29: /work/e89/e89/zem/lammps/build/lmp() [0x3117c4]
frame #30: /work/e89/e89/zem/lammps/build/lmp() [0x31ceaf]
frame #31: /work/e89/e89/zem/lammps/build/lmp() [0x2fe6ad]
frame #32: __libc_start_main + 0xea (0x2b7cfc4e534a in /lib64/libc.so.6)
frame #33: /work/e89/e89/zem/lammps/build/lmp() [0x2fe5ca]

srun: error: nid002072: task 0: Aborted
srun: launch/slurm: _step_signal: Terminating StepId=3216084.0

To Reproduce
Steps to reproduce the behavior:

  1. I followed the install instructions found here
  2. My CMake settings are as follows:
(mace-venv) zem@ln03:/work/e89/e89/zem/lammps/build> cmake -DCMAKE_INSTALL_PREFIX=$(pwd) \
>       -DBUILD_MPI=ON \
>       -DBUILD_OMP=ON \
>       -DPKG_OPENMP=ON \
>       -DPKG_ML-MACE=ON \
>       -DCMAKE_PREFIX_PATH=$(pwd)/../../libtorch \
>       ../cmake
-- The CXX compiler identification is Clang 11.0.0
-- Check for working CXX compiler: /opt/cray/pe/craype/2.7.6/bin/CC
-- Check for working CXX compiler: /opt/cray/pe/craype/2.7.6/bin/CC -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.26.2") 
-- Running check for auto-generated files from make-based build system
-- Running in virtual environment: /mnt/lustre/a2fs-work3/work/e89/e89/zem/mace-venv
   Setting Python interpreter to: /mnt/lustre/a2fs-work3/work/e89/e89/zem/mace-venv/bin/python
-- Found MPI_CXX: /opt/cray/pe/craype/2.7.6/bin/CC (found version "3.1") 
-- Found MPI: TRUE (found version "3.1")  
-- Looking for C++ include omp.h
-- Looking for C++ include omp.h - found
-- Found OpenMP_CXX: -fopenmp  
-- Found OpenMP: TRUE  found components:  CXX 
-- Found JPEG: /usr/lib64/libjpeg.so  
-- Found GZIP: /usr/bin/gzip  
-- Could NOT find FFMPEG (missing: FFMPEG_EXECUTABLE) 
Hello from ML-MACE.cmake.
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - found
-- Found Threads: TRUE  
-- Found Torch: /work/e89/e89/zem/libtorch/lib/libtorch.so  
-- Looking for C++ include cmath
-- Looking for C++ include cmath - found
-- Generating style headers...
-- Generating package headers...
-- Generating lmpinstalledpkgs.h...
-- Could NOT find ClangFormat: Found unsuitable version "0.0", but required is at least "8.0" (found /opt/cray/pe/cce/11.0.4/cce-clang/x86_64/bin/clang-format)
-- The following tools and libraries have been found and configured:
 * Git
 * MPI
 * OpenMP
 * JPEG
 * Threads
 * Caffe2
 * Torch

-- <<< Build configuration >>>
   LAMMPS Version:   20221222
   Operating System: Linux SLES 15.1
   CMake Version:    3.10.2
   Build type:       RelWithDebInfo
   Install path:     /work/e89/e89/zem/lammps/build
   Generator:        Unix Makefiles using /usr/bin/gmake
-- Enabled packages: ML-MACE;OPENMP
-- <<< Compilers and Flags: >>>
-- C++ Compiler:     /opt/cray/pe/craype/2.7.6/bin/CC
      Type:          Clang
      Version:       11.0.0
      C++ Flags:     -O2 -g -DNDEBUG
      Defines:       LAMMPS_SMALLBIG;LAMMPS_MEMALIGN=64;LAMMPS_OMP_COMPAT=4;LAMMPS_JPEG;LAMMPS_GZIP;LMP_OPENMP
-- <<< Linker flags: >>>
-- Executable name:  lmp
-- Static library flags:    
-- <<< MPI flags >>>
-- MPI_defines:      MPICH_SKIP_MPICXX;OMPI_SKIP_MPICXX;_MPICC_H
-- MPI includes:     
-- MPI libraries:    ;
-- Configuring done
-- Generating done
-- Build files have been written to: /work/e89/e89/zem/lammps/build

Cheers,
Zak

@ilyes319
Copy link
Contributor

ilyes319 commented Mar 7, 2023

Hey,

I am tagging @wcwitt who will be able to help more in details.

@ilyes319
Copy link
Contributor

ilyes319 commented Mar 8, 2023

Hey,

Looking at the trace it seems that you are trying to use a model compiled on GPU to run on CPU in archer. Could you check that the model you are loading is saved on a CPU? To do so, you could do:

  model = torch.load(path)
  model_cpu = model.to("cpu")
  torch.save(model_cpu,path_cpu)

and then load the model_cpu through your setup.

@wcwitt
Copy link
Collaborator

wcwitt commented Mar 8, 2023

That was my first guess too, based on the CUDA in the trace. Try @ilyes319's suggestion and we can go from there.

@zakmachachi, just a warning, if you have access to a decent GPU, I predict you will prefer using that for MD. The CPU LAMMPS can't really compete (yet) in performance for most use cases. Feel free to email at wcw28@cam.ac.uk if you want to discuss any details you'd rather not post here.

@zakmachachi
Copy link
Author

zakmachachi commented Mar 9, 2023

Hey, thanks for the swift reply both!

So I used this script to switch the model from GPU to CPU compilation:

from e3nn.util import jit
import sys
import torch
from mace.calculators import LAMMPS_MACE

# Load the model
model = torch.load(
    path,
    map_location=torch.device("cpu"),
)
model_cpu = model.to("cpu")
torch.save(model_cpu, path_cpu)

And it worked! But now I get the following error:

LAMMPS (22 Dec 2022)
terminate called after throwing an instance of 'c10::Error'
  what():  PytorchStreamReader failed locating file constants.pkl: file not found
Exception raised from valid at ../caffe2/serialize/inline_container.cc:177 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x2af16ceb856e in /work/e89/e89/zem/libtorch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5c (0x2af16ce82f18 in /work/e89/e89/zem/libtorch/lib/libc10.so)
frame #2: caffe2::serialize::PyTorchStreamReader::valid(char const*, char const*) + 0x8e (0x2af175c3cc4e in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #3: caffe2::serialize::PyTorchStreamReader::getRecordID(std::string const&) + 0x46 (0x2af175c3cdd6 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #4: caffe2::serialize::PyTorchStreamReader::getRecord(std::string const&) + 0x45 (0x2af175c3ce85 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #5: torch::jit::readArchiveAndTensors(std::string const&, std::string const&, std::string const&, c10::optional<std::function<c10::StrongTypePtr (c10::QualifiedName const&)> >, c10::optional<std::function<c10::intrusive_ptr<c10::ivalue::Object, c10::detail::intrusive_target_default_null_type<c10::ivalue::Object> > (c10::StrongTypePtr, c10::IValue)> >, c10::optional<c10::Device>, caffe2::serialize::PyTorchStreamReader&, c10::Type::SingletonOrSharedTypePtr<c10::Type> (*)(std::string const&), std::shared_ptr<torch::jit::DeserializationStorageContext>) + 0xa5 (0x2af176cfe5e5 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x4d20507 (0x2af176ce9507 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x4d2324b (0x2af176cec24b in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #8: torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::string const&, c10::optional<c10::Device>, std::unordered_map<std::string, std::string, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::string> > >&) + 0x3a2 (0x2af176cedc82 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #9: torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::string const&, c10::optional<c10::Device>) + 0x7b (0x2af176cee39b in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #10: torch::jit::load(std::string const&, c10::optional<c10::Device>) + 0xa5 (0x2af176cee475 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #11: /work/e89/e89/zem/lammps/build/lmp() [0x7bc928]
frame #12: /work/e89/e89/zem/lammps/build/lmp() [0x31a2dd]
frame #13: /work/e89/e89/zem/lammps/build/lmp() [0x3117c4]
frame #14: /work/e89/e89/zem/lammps/build/lmp() [0x31ceaf]
frame #15: /work/e89/e89/zem/lammps/build/lmp() [0x2fe6ad]
frame #16: __libc_start_main + 0xea (0x2af19118c34a in /lib64/libc.so.6)
frame #17: /work/e89/e89/zem/lammps/build/lmp() [0x2fe5ca]

Some more info:

  • I trained the model on a GPU on a different cluster
  • I compiled the model using a GPU on that same cluster (to get the .pt file) and then converted to CPU on that same cluster using the script above and then copied over to Archer2 and received the error provided
  • I then tried the same process except converting to CPU on Archer2 now and still received the same error
  • So I went back to the cluster and took the .model file and serialized it on Archer2 using the following script:
from e3nn.util import jit
import sys
import torch
from mace.calculators import LAMMPS_MACE

model_path = sys.argv[1]  # takes model name as command-line input
model = torch.load(model_path, map_location=torch.device("cpu"))
lammps_model = LAMMPS_MACE(model)
lammps_model_compiled = jit.compile(lammps_model)
lammps_model_compiled.save(model_path + "-lammps.pt")

And got the following error from LAMMPS:

LAMMPS (22 Dec 2022)
terminate called after throwing an instance of 'c10::Error'
  what():  open file failed because of errno 2 on fopen: , file path: /work/e89/e89/zem/MACE_C_Potential/C_MACE_GAP-17_CPU.pt
Exception raised from RAIIFile at ../caffe2/serialize/file_adapter.cc:21 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x2ac9493e756e in /work/e89/e89/zem/libtorch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5c (0x2ac9493b1f18 in /work/e89/e89/zem/libtorch/lib/libc10.so)
frame #2: caffe2::serialize::FileAdapter::RAIIFile::RAIIFile(std::string const&) + 0x124 (0x2ac952170634 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #3: caffe2::serialize::FileAdapter::FileAdapter(std::string const&) + 0x2e (0x2ac95217068e in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #4: caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader(std::string const&) + 0x5a (0x2ac95216eada in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #5: torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::string const&, c10::optional<c10::Device>, std::unordered_map<std::string, std::string, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::string> > >&) + 0x2a5 (0x2ac95321cb85 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #6: torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::string const&, c10::optional<c10::Device>) + 0x7b (0x2ac95321d39b in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #7: torch::jit::load(std::string const&, c10::optional<c10::Device>) + 0xa5 (0x2ac95321d475 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #8: /work/e89/e89/zem/lammps/build/lmp() [0x7bc928]
frame #9: /work/e89/e89/zem/lammps/build/lmp() [0x31a2dd]
frame #10: /work/e89/e89/zem/lammps/build/lmp() [0x3117c4]
frame #11: /work/e89/e89/zem/lammps/build/lmp() [0x31ceaf]
frame #12: /work/e89/e89/zem/lammps/build/lmp() [0x2fe6ad]
frame #13: __libc_start_main + 0xea (0x2ac96d6bb34a in /lib64/libc.so.6)
frame #14: /work/e89/e89/zem/lammps/build/lmp() [0x2fe5ca]

srun: error: nid001643: task 0: Aborted
srun: launch/slurm: _step_signal: Terminating StepId=3227130.0

I guess the obvious thing here is to train and compile on Archer2, but I chose the other cluster as they have some fancy RTX cards which were not running out of memory during training. Archer2 sadly does not have any GPUs so training is an issue.

@wcwitt Have you setup GPU MACE runs for LAMMPS? I think this could be an interesting approach as I think this swapping between CPU and GPU compilations is a bit messy from my side!

@ilyes319
Copy link
Contributor

@wcwitt @zakmachachi Can we close this issue, is there a fix somewhere?

@wcwitt
Copy link
Collaborator

wcwitt commented Mar 29, 2023

We've been emailing about it in combination with some other things. Let's leave it open for a bit longer and I'll post once it's ready to close

@ilyes319
Copy link
Contributor

Sure thank you!

@ilyes319 ilyes319 closed this as not planned Won't fix, can't repro, duplicate, stale Jul 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants