pt: support dpa2 model parallel inference #3657

CaRoLZhangxy · 2024-04-08T11:31:11Z

No description provided.

…to dis

for more information, see https://pre-commit.ci

CaRoLZhangxy · 2024-04-08T11:34:01Z

TODO: add ut

source/op/pt/comm.cc

deepmd/pt/model/descriptor/repformers.py

deepmd/pt/model/descriptor/dpa2.py

source/lib/include/neighbor_list.h

source/op/pt/comm.cc

…to dis

for more information, see https://pre-commit.ci

… dis

for more information, see https://pre-commit.ci

source/op/pt/comm.cc

source/api_c/include/deepmd.hpp

source/lmp/pair_deepmd.cpp

source/op/pt/comm.cc

+
+    int** recvlist = reinterpret_cast<int**>(sendlist_tensor.data_ptr());
+    // swap send and recv here
+    int* recvproc = sendproc_tensor.data_ptr<int>();


source/op/pt/comm.cc

+                            sendnum_tensor, recvnum_tensor, communicator_tensor,
+                            nlocal_tensor, nghost_tensor});
+    int** sendlist = reinterpret_cast<int**>(sendlist_tensor.data_ptr());
+    int* sendproc = sendproc_tensor.data_ptr<int>();


source/op/pt/comm.cc

+                            nlocal_tensor, nghost_tensor});
+    int** sendlist = reinterpret_cast<int**>(sendlist_tensor.data_ptr());
+    int* sendproc = sendproc_tensor.data_ptr<int>();
+    int* recvproc = recvproc_tensor.data_ptr<int>();


source/op/pt/comm.cc

+                       communicator_tensor, nlocal_tensor, nghost_tensor);
+}
+
+TORCH_LIBRARY_FRAGMENT(deepmd, m) { m.def("border_op", border_op); }


source/op/pt/comm.cc

iProzd

Docstr of comm_dict missed in all the python classes.

deepmd/pt/model/atomic_model/base_atomic_model.py

no need

chazeon · 2024-05-04T06:51:19Z

I am unable to compile comm.cc after this pull request. The error I get is

-> % cmake --build . --parallel 8
[  3%] Built target deepmd_ipi
[  5%] Built target gtest
[ 19%] Built target deepmd
[ 20%] Built target gtest_main
[ 21%] Built target gmock
[ 22%] Built target gmock_main
[ 28%] Built target deepmd_cc_test_no_backend
[ 46%] Built target runUnitTests_lib
[ 52%] Built target deepmd_cc
[ 73%] Built target deepmd_op
[ 74%] Linking CXX shared module libdeepmd_op_pt.so
[ 76%] Built target deepmd_c
[ 88%] Built target runUnitTests_cc
[ 89%] Built target deepmd_gromacs
[ 90%] Built target dp_ipi
Undefined symbols for architecture arm64:
  "long* at::TensorBase::data_ptr<long>() const", referenced from:
      Border::unpack_communicator(at::Tensor const&, int&) in comm.cc.o
ld: symbol(s) not found for architecture arm64
clang-16: error: linker command failed with exit code 1 (use -v to see invocation)
[ 99%] Built target runUnitTests_c
make[2]: *** [op/pt/CMakeFiles/deepmd_op_pt.dir/build.make:187: op/pt/libdeepmd_op_pt.so] Error 1
make[1]: *** [CMakeFiles/Makefile2:549: op/pt/CMakeFiles/deepmd_op_pt.dir/all] Error 2
make: *** [Makefile:146: all] Error 2

The error is related to these two lines:

https://github.com/CaRoLZhangxy/deepmd-kit/blob/beba142f64a51a9d6e65d6a621d30519679aa43d/source/op/pt/comm.cc#L323-L326

It seems at::TensorBase::data_ptr<long>() is not defined with long, but for int, long long etc in my libtorch.
I tried replacing long int with either long long or int. Both can be compiled.
Is there a specific reason why long int is used here?

-> % nm -gU libtorch_cpu.dylib | grep TensorBase | llvm-cxxfilt | grep ::data_ptr
0000000002153398 T c10::Float8_e5m2* at::TensorBase::data_ptr<c10::Float8_e5m2>() const
0000000002153524 T c10::Float8_e4m3fn* at::TensorBase::data_ptr<c10::Float8_e4m3fn>() const
0000000002152738 T c10::Half* at::TensorBase::data_ptr<c10::Half>() const
00000000021536b0 T c10::qint8* at::TensorBase::data_ptr<c10::qint8>() const
00000000021539c8 T c10::qint32* at::TensorBase::data_ptr<c10::qint32>() const
000000000215383c T c10::quint8* at::TensorBase::data_ptr<c10::quint8>() const
0000000002152bdc T c10::complex<c10::Half>* at::TensorBase::data_ptr<c10::complex<c10::Half>>() const
0000000002152ef4 T c10::complex<double>* at::TensorBase::data_ptr<c10::complex<double>>() const
0000000002152d68 T c10::complex<float>* at::TensorBase::data_ptr<c10::complex<float>>() const
000000000215320c T c10::BFloat16* at::TensorBase::data_ptr<c10::BFloat16>() const
0000000002153ce0 T c10::quint2x4* at::TensorBase::data_ptr<c10::quint2x4>() const
0000000002153b54 T c10::quint4x2* at::TensorBase::data_ptr<c10::quint4x2>() const
0000000002152108 T signed char* at::TensorBase::data_ptr<signed char>() const
0000000002153080 T bool* at::TensorBase::data_ptr<bool>() const
0000000002152a50 T double* at::TensorBase::data_ptr<double>() const
00000000021528c4 T float* at::TensorBase::data_ptr<float>() const
0000000002151f7c T unsigned char* at::TensorBase::data_ptr<unsigned char>() const
0000000002152420 T int* at::TensorBase::data_ptr<int>() const
0000000002152294 T short* at::TensorBase::data_ptr<short>() const
00000000021525ac T long long* at::TensorBase::data_ptr<long long>() const

My environment is:

     active environment : deepmd-dev
    active env location : /Users/chazeon/miniforge3/envs/deepmd-dev
            shell level : 1
       user config file : /Users/chazeon/.condarc
 populated config files : /Users/chazeon/miniforge3/.condarc
                          /Users/chazeon/.condarc
          conda version : 23.3.1
    conda-build version : not installed
         python version : 3.10.12.final.0
       virtual packages : __archspec=1=arm64
                          __osx=14.4.1=0
                          __unix=0=0
       base environment : /Users/chazeon/miniforge3  (writable)
      conda av data dir : /Users/chazeon/miniforge3/etc/conda
  conda av metadata url : None
           channel URLs : https://conda.anaconda.org/conda-forge/osx-arm64
                          https://conda.anaconda.org/conda-forge/noarch
                          https://repo.anaconda.com/pkgs/main/osx-arm64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/r/osx-arm64
                          https://repo.anaconda.com/pkgs/r/noarch
          package cache : /Users/chazeon/miniforge3/pkgs
                          /Users/chazeon/.conda/pkgs
       envs directories : /Users/chazeon/miniforge3/envs
                          /Users/chazeon/.conda/envs
               platform : osx-arm64
             user-agent : conda/23.3.1 requests/2.31.0 CPython/3.10.12 Darwin/23.4.0 OSX/14.4.1
                UID:GID : 501:20
             netrc file : None
           offline mode : False

This might related to: https://stackoverflow.com/questions/67584843/pytorch-tensordata-ptrlong-long-not-working-on-linux

njzjz · 2024-05-04T07:24:27Z

@chazeon could you send a PR?

In a 64-bit system, long and long long have the same size, which might be why long can be compiled on Linux. But it looks dangerous.

njzjz · 2024-05-04T07:46:21Z

long used here also requires to be reviewed:

deepmd-kit/source/api_cc/src/DeepPotPT.cc

Lines 199 to 213 in ebd809b

    
           if (!fparam.empty()) { 
        
             fparam_tensor = 
        
                 torch::from_blob(const_cast<VALUETYPE*>(fparam.data()), 
        
                                  {1, static_cast<long int>(fparam.size())}, options) 
        
                     .to(device); 
        
           } 
        
           c10::optional<torch::Tensor> aparam_tensor; 
        
           if (!aparam_.empty()) { 
        
             aparam_tensor = torch::from_blob( 
        
                                 const_cast<VALUETYPE*>(aparam_.data()), 
        
                                 {1, lmp_list.inum, 
        
                                  static_cast<long int>(aparam_.size()) / lmp_list.inum}, 
        
                                 options) 
        
                                 .to(device); 
        
           }

Following the discussion in #3657 this pull request addresses the usage of `long` or `long int` by replacing them with `int64_t` in multiple instances. This change aims to enhance code compatibility across different platforms and improve code clarity. The `int64_t` type is a feature introduced in C++11, defined in the [`<cstdint>`](https://en.cppreference.com/w/cpp/header/cstdint) header. Due to historical reasons, the compilation behavior of `int64_t` is platform- and system-specific. On Linux, `int64_t` is compiled to `long`, whereas on macOS, it's compiled to `long long`. In relevant codebases such as PyTorch and TensorFlow, `int64_t` is preferred over explicit declarations of `long` or `long long`. Consequently, for precompiled libraries, on Linux, symbols are defined exclusively to `long`, while on macOS, symbols are defined exclusively based on `long long`. For these reasons, `data_ptr<long int>()` is unable to compile on macOS. ## References * https://stackoverflow.com/questions/67584843/pytorch-tensordata-ptrlong-long-not-working-on-linux * https://github.com/llvm/llvm-project/blob/c7910ee1f0af64501bf068cdfec154ea359ff832/clang/test/Preprocessor/init.c ## Examples ### Example 1 For the code used here, `torch::from_blob`, is defined using ```cpp inline at::Tensor from_blob( void* data, at::IntArrayRef sizes, const at::TensorOptions& options = at::TensorOptions()) { ``` where `IntArrayRef` is defined as ```cpp using IntArrayRef = c10::ArrayRef<int64_t>; ``` ### Example 2 Dumping the symbols in `libtorch_cpu.dylib` on macOS ``` -> % nm -gU libtorch_cpu.dylib | llvm-cxxfilt | grep TensorBase | grep ::data_ptr 0000000002153398 T c10::Float8_e5m2* at::TensorBase::data_ptr<c10::Float8_e5m2>() const 0000000002153524 T c10::Float8_e4m3fn* at::TensorBase::data_ptr<c10::Float8_e4m3fn>() const 0000000002152738 T c10::Half* at::TensorBase::data_ptr<c10::Half>() const 00000000021536b0 T c10::qint8* at::TensorBase::data_ptr<c10::qint8>() const 00000000021539c8 T c10::qint32* at::TensorBase::data_ptr<c10::qint32>() const 000000000215383c T c10::quint8* at::TensorBase::data_ptr<c10::quint8>() const 0000000002152bdc T c10::complex<c10::Half>* at::TensorBase::data_ptr<c10::complex<c10::Half>>() const 0000000002152ef4 T c10::complex<double>* at::TensorBase::data_ptr<c10::complex<double>>() const 0000000002152d68 T c10::complex<float>* at::TensorBase::data_ptr<c10::complex<float>>() const 000000000215320c T c10::BFloat16* at::TensorBase::data_ptr<c10::BFloat16>() const 0000000002153ce0 T c10::quint2x4* at::TensorBase::data_ptr<c10::quint2x4>() const 0000000002153b54 T c10::quint4x2* at::TensorBase::data_ptr<c10::quint4x2>() const 0000000002152108 T signed char* at::TensorBase::data_ptr<signed char>() const 0000000002153080 T bool* at::TensorBase::data_ptr<bool>() const 0000000002152a50 T double* at::TensorBase::data_ptr<double>() const 00000000021528c4 T float* at::TensorBase::data_ptr<float>() const 0000000002151f7c T unsigned char* at::TensorBase::data_ptr<unsigned char>() const 0000000002152420 T int* at::TensorBase::data_ptr<int>() const 0000000002152294 T short* at::TensorBase::data_ptr<short>() const 00000000021525ac T long long* at::TensorBase::data_ptr<long long>() const ``` dumping symbols in `libtorch_cpu.dylib` on Linux ``` -> % nm -gU libtorch_cpu.so | c++filt | grep TensorBase | grep ::data_ptr 00000000031ec0d0 T c10::Float8_e5m2* at::TensorBase::data_ptr<c10::Float8_e5m2>() const 00000000031ec2f0 T c10::Float8_e4m3fn* at::TensorBase::data_ptr<c10::Float8_e4m3fn>() const 00000000031ec730 T c10::Float8_e4m3fnuz* at::TensorBase::data_ptr<c10::Float8_e4m3fnuz>() const 00000000031ec510 T c10::Float8_e5m2fnuz* at::TensorBase::data_ptr<c10::Float8_e5m2fnuz>() const 00000000031eb030 T c10::Half* at::TensorBase::data_ptr<c10::Half>() const 00000000031ec950 T c10::qint8* at::TensorBase::data_ptr<c10::qint8>() const 00000000031ecd80 T c10::qint32* at::TensorBase::data_ptr<c10::qint32>() const 00000000031ecb70 T c10::quint8* at::TensorBase::data_ptr<c10::quint8>() const 00000000031eb660 T c10::complex<c10::Half>* at::TensorBase::data_ptr<c10::complex<c10::Half> >() const 00000000031eba80 T c10::complex<double>* at::TensorBase::data_ptr<c10::complex<double> >() const 00000000031eb870 T c10::complex<float>* at::TensorBase::data_ptr<c10::complex<float> >() const 00000000031ebeb0 T c10::BFloat16* at::TensorBase::data_ptr<c10::BFloat16>() const 00000000031ed1c0 T c10::quint2x4* at::TensorBase::data_ptr<c10::quint2x4>() const 00000000031ecfa0 T c10::quint4x2* at::TensorBase::data_ptr<c10::quint4x2>() const 00000000031ea7f0 T signed char* at::TensorBase::data_ptr<signed char>() const 00000000031ebca0 T bool* at::TensorBase::data_ptr<bool>() const 00000000031eb450 T double* at::TensorBase::data_ptr<double>() const 00000000031eb240 T float* at::TensorBase::data_ptr<float>() const 00000000031ea5d0 T unsigned char* at::TensorBase::data_ptr<unsigned char>() const 00000000031eac10 T int* at::TensorBase::data_ptr<int>() const 00000000031ed5e0 T unsigned int* at::TensorBase::data_ptr<unsigned int>() const 00000000031eae20 T long* at::TensorBase::data_ptr<long>() const 00000000031ed7f0 T unsigned long* at::TensorBase::data_ptr<unsigned long>() const 00000000031eaa00 T short* at::TensorBase::data_ptr<short>() const 00000000031ed3d0 T unsigned short* at::TensorBase::data_ptr<unsigned short>() const ```  ## Summary by CodeRabbit - **Refactor** - Improved data type consistency across various components for handling larger data sizes more reliably.  --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

CaRoLZhangxy added 5 commits April 7, 2024 05:03

init

ae0f799

Merge branch 'devel' of https://github.com/deepmodeling/deepmd-kit in…

96c9309

…to dis

init

bd1927f

fix

8350372

finish

28ae599

github-actions bot added Python Core OP C++ LAMMPS C labels Apr 8, 2024

[pre-commit.ci] auto fixes from pre-commit.com hooks

1afd8fc

for more information, see https://pre-commit.ci

CaRoLZhangxy requested review from njzjz, iProzd and wanghan-iapcm April 8, 2024 11:33

njzjz linked an issue Apr 8, 2024 that may be closed by this pull request

[BUG] _lmp raise "assert mapping is not None" with dpa2 model_ #3428

Closed

njzjz requested changes Apr 8, 2024

View reviewed changes

source/op/pt/comm.cc Outdated Show resolved Hide resolved

source/op/pt/comm.cc Outdated Show resolved Hide resolved

source/op/pt/comm.cc Outdated Show resolved Hide resolved

wanghan-iapcm reviewed Apr 9, 2024

View reviewed changes

deepmd/pt/model/descriptor/repformers.py Show resolved Hide resolved

deepmd/pt/model/descriptor/dpa2.py Show resolved Hide resolved

source/lib/include/neighbor_list.h Show resolved Hide resolved

source/op/pt/comm.cc Show resolved Hide resolved

CaRoLZhangxy added 3 commits April 15, 2024 02:11

Merge branch 'devel' of https://github.com/deepmodeling/deepmd-kit in…

7f6632a

…to dis

Merge branch 'devel' of https://github.com/deepmodeling/deepmd-kit in…

29d1bec

…to dis

use google cuda define

2a7db1e

github-actions bot added the Examples label Apr 17, 2024

CaRoLZhangxy and others added 6 commits April 17, 2024 06:14

update forward api

6af0d63

remove frozen model

3020781

[pre-commit.ci] auto fixes from pre-commit.com hooks

420868f

for more information, see https://pre-commit.ci

Merge branch 'dis' of https://github.com/CaRoLZhangxy/deepmd-kit into…

c779828

… dis

be able to compile without mpi

7591dd3

[pre-commit.ci] auto fixes from pre-commit.com hooks

3d0f14d

for more information, see https://pre-commit.ci

github-advanced-security bot found potential problems Apr 17, 2024

View reviewed changes

CaRoLZhangxy added the Test CUDA Trigger test CUDA workflow label Apr 26, 2024

github-actions bot removed the Test CUDA Trigger test CUDA workflow label Apr 26, 2024

reset test.yml

273a446

njzjz added the Test CUDA Trigger test CUDA workflow label Apr 26, 2024

github-actions bot removed the Test CUDA Trigger test CUDA workflow label Apr 26, 2024

njzjz approved these changes Apr 26, 2024

View reviewed changes

coderabbitai bot mentioned this pull request Apr 26, 2024

Replace string search with API call in freeze function #3713

Open

CaRoLZhangxy requested a review from wanghan-iapcm April 27, 2024 06:45

wanghan-iapcm approved these changes Apr 27, 2024

View reviewed changes

iProzd reviewed Apr 27, 2024

View reviewed changes

deepmd/pt/model/atomic_model/base_atomic_model.py Show resolved Hide resolved

add doc str in python

beba142

CaRoLZhangxy requested a review from iProzd April 28, 2024 04:48

iProzd approved these changes Apr 28, 2024

View reviewed changes

wanghan-iapcm added this pull request to the merge queue Apr 29, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 29, 2024

wanghan-iapcm added this pull request to the merge queue Apr 29, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 29, 2024

njzjz added this pull request to the merge queue Apr 29, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 29, 2024

wanghan-iapcm added this pull request to the merge queue Apr 30, 2024

wanghan-iapcm removed this pull request from the merge queue due to a manual request Apr 30, 2024

njzjz added this pull request to the merge queue Apr 30, 2024

Merged via the queue into deepmodeling:devel with commit d0fe13c Apr 30, 2024
48 checks passed

chazeon mentioned this pull request May 4, 2024

Replacing long int type with int64_t #3739

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pt: support dpa2 model parallel inference #3657

pt: support dpa2 model parallel inference #3657

CaRoLZhangxy commented Apr 8, 2024

CaRoLZhangxy commented Apr 8, 2024

iProzd left a comment

chazeon commented May 4, 2024

njzjz commented May 4, 2024

njzjz commented May 4, 2024

pt: support dpa2 model parallel inference #3657

pt: support dpa2 model parallel inference #3657

Conversation

CaRoLZhangxy commented Apr 8, 2024

CaRoLZhangxy commented Apr 8, 2024

iProzd left a comment

Choose a reason for hiding this comment

chazeon commented May 4, 2024

njzjz commented May 4, 2024

njzjz commented May 4, 2024