Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] _lmp raise "assert mapping is not None" with dpa2 model_ #3428

Closed
changxiaoju opened this issue Mar 7, 2024 · 2 comments · Fixed by #3657
Closed

[BUG] _lmp raise "assert mapping is not None" with dpa2 model_ #3428

changxiaoju opened this issue Mar 7, 2024 · 2 comments · Fixed by #3657
Assignees
Milestone

Comments

@changxiaoju
Copy link

Bug summary

I trained a dpa2 model using deepmd-kit-3.0.0a0/examples/water/dpa2/input_torch.json , training data is deepmd-kit-3.0.0a0/examples/water/data, freeze and use frozen_model.pth to run DPMD using input files in deepmd-kit-3.0.0a0/examples/water/lmp. All above works are done with only necessary running step change to the example files.

All the data and input files to reproduct are provided in
water_test_inputs.zip

The error of lammps (water_test_inputs/lmp/slurm-9441.out) is :

OMP: Info #172: KMP_AFFINITY: OS proc 0 maps to socket 0 core 0 thread 0 
OMP: Info #254: KMP_AFFINITY: pid 13722 tid 13722 thread 0 bound to OS proc set 0
4046
Exception: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/deepmd/pt/model/model/ener_model.py", line 59, in forward_lower
    aparam: Optional[Tensor]=None,
    do_atomic_virial: bool=False) -> Dict[str, Tensor]:
    model_ret = (self).forward_common_lower(extended_coord, extended_atype, nlist, mapping, fparam, aparam, do_atomic_virial, )
                 ~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    model_predict = annotate(Dict[str, Tensor], {})
    torch._set_item(model_predict, "atom_energy", model_ret["energy"])
  File "code/__torch__/deepmd/pt/model/model/ener_model.py", line 200, in forward_common_lower
    _31 = (self).input_type_cast(extended_coord0, None, fparam, aparam, )
    cc_ext, _32, fp, ap, input_prec, = _31
    atomic_ret = (self).forward_common_atomic(cc_ext, extended_atype, nlist0, mapping, fp, ap, )
                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    model_predict = _29(atomic_ret, (self).atomic_output_def(), cc_ext, do_atomic_virial, )
    model_predict1 = (self).output_type_cast(model_predict, input_prec, )
  File "code/__torch__/deepmd/pt/model/model/ener_model.py", line 264, in forward_common_atomic
    fparam: Optional[Tensor]=None,
    aparam: Optional[Tensor]=None) -> Dict[str, Tensor]:
    ret_dict = (self).forward_atomic(extended_coord, extended_atype, nlist, mapping, fparam, aparam, )
                ~~~~~~~~~~~~~~~~~~~~ <--- HERE
    return ret_dict
  def forward_atomic(self: __torch__.deepmd.pt.model.model.ener_model.EnergyModel,
  File "code/__torch__/deepmd/pt/model/model/ener_model.py", line 284, in forward_atomic
      pass
    descriptor = self.descriptor
    _43 = (descriptor).forward(extended_coord, extended_atype, nlist, mapping, )
           ~~~~~~~~~~~~~~~~~~~ <--- HERE
    descriptor0, rot_mat, g2, h2, sw, = _43
    fitting_net = self.fitting_net
  File "code/__torch__/deepmd/pt/model/descriptor/dpa2.py", line 54, in forward
      mapping0 = unchecked_cast(Tensor, mapping)
    else:
      ops.prim.RaiseException("AssertionError: ")
      ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
      mapping0 = _2
    _15 = torch.view(mapping0, [nframes, nall])

Traceback of TorchScript, original code (most recent call last):
  File "/data/home/changxiaoju/software/deepmd-kit-3.0.0a0-cuda123/lib/python3.11/site-packages/deepmd/pt/model/model/ener_model.py", line 73, in forward_lower
        do_atomic_virial: bool = False,
    ):
        model_ret = self.forward_common_lower(
                    ~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
            extended_coord,
            extended_atype,
  File "/data/home/changxiaoju/software/deepmd-kit-3.0.0a0-cuda123/lib/python3.11/site-packages/deepmd/pt/model/model/make_model.py", line 206, in forward_common_lower
            )
            del extended_coord, fparam, aparam
            atomic_ret = self.forward_common_atomic(
                         ~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
                cc_ext,
                extended_atype,
  File "/data/home/changxiaoju/software/deepmd-kit-3.0.0a0-cuda123/lib/python3.11/site-packages/deepmd/pt/model/atomic_model/base_atomic_model.py", line 103, in forward_common_atomic
            nlist = torch.where(pair_mask == 1, nlist, -1)
    
        ret_dict = self.forward_atomic(
                   ~~~~~~~~~~~~~~~~~~~ <--- HERE
            extended_coord,
            extended_atype,
  File "/data/home/changxiaoju/software/deepmd-kit-3.0.0a0-cuda123/lib/python3.11/site-packages/deepmd/pt/model/atomic_model/dp_atomic_model.py", line 164, in forward_atomic
        if self.do_grad_r() or self.do_grad_c():
            extended_coord.requires_grad_(True)
        descriptor, rot_mat, g2, h2, sw = self.descriptor(
                                          ~~~~~~~~~~~~~~~ <--- HERE
            extended_coord,
            extended_atype,
  File "/data/home/changxiaoju/software/deepmd-kit-3.0.0a0-cuda123/lib/python3.11/site-packages/deepmd/pt/model/descriptor/dpa2.py", line 443, in forward
        g1 = self.g1_shape_tranform(g1)
        # mapping g1
        assert mapping is not None
        ~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        mapping_ext = (
            mapping.view(nframes, nall).unsqueeze(-1).expand(-1, -1, g1.shape[-1])
RuntimeError: AssertionError: 

--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

DeePMD-kit Version

DeePMD-kit v3.0.0a0

TensorFlow Version

torch Version: 2.1.2.post300

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

water_test_inputs.zip

Steps to Reproduce

(base) [juju@mgt workdir]$ cd water_test_inputs/dpa2/
(base) [juju@mgt dpa2]$ sbatch job.sbatch 
Submitted batch job 9442
(base) [juju@mgt dpa2]$ sbatch freeze.sbatch 
Submitted batch job 9443
(base) [juju@mgt dpa2]$ cp frozen_model.pth ../lmp
(base) [juju@mgt dpa2]$ cd ../lmp/
(base) [juju@mgt lmp]$ sbatch job.sbatch 
Submitted batch job 9444

Further Information, Files, and Links

No response

@changxiaoju changxiaoju added the bug label Mar 7, 2024
@njzjz njzjz added this to the v3.0.0 milestone Mar 9, 2024
@changxiaoju
Copy link
Author

I've noticed that this issue has been open for some time now without any resolution or feedback. I'm quite interested in this matter and was wondering if there have been any developments?

@njzjz
Copy link
Member

njzjz commented Mar 16, 2024

I've noticed that this issue has been open for some time now without any resolution or feedback. I'm quite interested in this matter and was wondering if there have been any developments?

As the DPA2 model requires more information than what the current API passes to communicate between MPI ranks, we plan to schedule a meeting to propose a new inference API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment