Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pt: support dpa2 model parallel inference #3657

Merged
merged 104 commits into from Apr 30, 2024
Merged

Conversation

CaRoLZhangxy
Copy link
Collaborator

No description provided.

@CaRoLZhangxy
Copy link
Collaborator Author

TODO: add ut

@njzjz njzjz linked an issue Apr 8, 2024 that may be closed by this pull request
source/op/pt/comm.cc Outdated Show resolved Hide resolved
source/op/pt/comm.cc Outdated Show resolved Hide resolved
source/op/pt/comm.cc Outdated Show resolved Hide resolved
deepmd/pt/model/descriptor/repformers.py Show resolved Hide resolved
deepmd/pt/model/descriptor/dpa2.py Show resolved Hide resolved
source/lib/include/neighbor_list.h Show resolved Hide resolved
source/op/pt/comm.cc Show resolved Hide resolved
source/op/pt/comm.cc Fixed Show fixed Hide fixed
source/op/pt/comm.cc Fixed Show fixed Hide fixed
source/api_c/include/deepmd.hpp Fixed Show fixed Hide fixed
source/lmp/pair_deepmd.cpp Fixed Show fixed Hide fixed

int** recvlist = reinterpret_cast<int**>(sendlist_tensor.data_ptr());
// swap send and recv here
int* recvproc = sendproc_tensor.data_ptr<int>();

Check notice

Code scanning / CodeQL

Unused local variable Note

Variable recvproc is not used.
sendnum_tensor, recvnum_tensor, communicator_tensor,
nlocal_tensor, nghost_tensor});
int** sendlist = reinterpret_cast<int**>(sendlist_tensor.data_ptr());
int* sendproc = sendproc_tensor.data_ptr<int>();

Check notice

Code scanning / CodeQL

Unused local variable Note

Variable sendproc is not used.
nlocal_tensor, nghost_tensor});
int** sendlist = reinterpret_cast<int**>(sendlist_tensor.data_ptr());
int* sendproc = sendproc_tensor.data_ptr<int>();
int* recvproc = recvproc_tensor.data_ptr<int>();

Check notice

Code scanning / CodeQL

Unused local variable Note

Variable recvproc is not used.
source/op/pt/comm.cc Fixed Show fixed Hide fixed
communicator_tensor, nlocal_tensor, nghost_tensor);
}

TORCH_LIBRARY_FRAGMENT(deepmd, m) { m.def("border_op", border_op); }

Check notice

Code scanning / CodeQL

Unused static function Note

Static function TORCH_LIBRARY_FRAGMENT_init_deepmd_2 is unreachable (
TORCH_LIBRARY_FRAGMENT_static_init_deepmd_2
must be removed at the same time)
source/op/pt/comm.cc Fixed Show fixed Hide fixed
@CaRoLZhangxy CaRoLZhangxy added the Test CUDA Trigger test CUDA workflow label Apr 26, 2024
@github-actions github-actions bot removed the Test CUDA Trigger test CUDA workflow label Apr 26, 2024
@njzjz njzjz added the Test CUDA Trigger test CUDA workflow label Apr 26, 2024
@github-actions github-actions bot removed the Test CUDA Trigger test CUDA workflow label Apr 26, 2024
Copy link
Collaborator

@iProzd iProzd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstr of comm_dict missed in all the python classes.

@wanghan-iapcm wanghan-iapcm added this pull request to the merge queue Apr 29, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 29, 2024
@wanghan-iapcm wanghan-iapcm added this pull request to the merge queue Apr 29, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 29, 2024
@njzjz njzjz added this pull request to the merge queue Apr 29, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 29, 2024
@wanghan-iapcm wanghan-iapcm added this pull request to the merge queue Apr 30, 2024
@wanghan-iapcm wanghan-iapcm removed this pull request from the merge queue due to a manual request Apr 30, 2024
@njzjz njzjz added this pull request to the merge queue Apr 30, 2024
Merged via the queue into deepmodeling:devel with commit d0fe13c Apr 30, 2024
48 checks passed
@chazeon
Copy link
Contributor

chazeon commented May 4, 2024

I am unable to compile comm.cc after this pull request. The error I get is

-> % cmake --build . --parallel 8
[  3%] Built target deepmd_ipi
[  5%] Built target gtest
[ 19%] Built target deepmd
[ 20%] Built target gtest_main
[ 21%] Built target gmock
[ 22%] Built target gmock_main
[ 28%] Built target deepmd_cc_test_no_backend
[ 46%] Built target runUnitTests_lib
[ 52%] Built target deepmd_cc
[ 73%] Built target deepmd_op
[ 74%] Linking CXX shared module libdeepmd_op_pt.so
[ 76%] Built target deepmd_c
[ 88%] Built target runUnitTests_cc
[ 89%] Built target deepmd_gromacs
[ 90%] Built target dp_ipi
Undefined symbols for architecture arm64:
  "long* at::TensorBase::data_ptr<long>() const", referenced from:
      Border::unpack_communicator(at::Tensor const&, int&) in comm.cc.o
ld: symbol(s) not found for architecture arm64
clang-16: error: linker command failed with exit code 1 (use -v to see invocation)
[ 99%] Built target runUnitTests_c
make[2]: *** [op/pt/CMakeFiles/deepmd_op_pt.dir/build.make:187: op/pt/libdeepmd_op_pt.so] Error 1
make[1]: *** [CMakeFiles/Makefile2:549: op/pt/CMakeFiles/deepmd_op_pt.dir/all] Error 2
make: *** [Makefile:146: all] Error 2

The error is related to these two lines:

https://github.com/CaRoLZhangxy/deepmd-kit/blob/beba142f64a51a9d6e65d6a621d30519679aa43d/source/op/pt/comm.cc#L323-L326

It seems at::TensorBase::data_ptr<long>() is not defined with long, but for int, long long etc in my libtorch.
I tried replacing long int with either long long or int. Both can be compiled.
Is there a specific reason why long int is used here?

-> % nm -gU libtorch_cpu.dylib | grep TensorBase | llvm-cxxfilt | grep ::data_ptr
0000000002153398 T c10::Float8_e5m2* at::TensorBase::data_ptr<c10::Float8_e5m2>() const
0000000002153524 T c10::Float8_e4m3fn* at::TensorBase::data_ptr<c10::Float8_e4m3fn>() const
0000000002152738 T c10::Half* at::TensorBase::data_ptr<c10::Half>() const
00000000021536b0 T c10::qint8* at::TensorBase::data_ptr<c10::qint8>() const
00000000021539c8 T c10::qint32* at::TensorBase::data_ptr<c10::qint32>() const
000000000215383c T c10::quint8* at::TensorBase::data_ptr<c10::quint8>() const
0000000002152bdc T c10::complex<c10::Half>* at::TensorBase::data_ptr<c10::complex<c10::Half>>() const
0000000002152ef4 T c10::complex<double>* at::TensorBase::data_ptr<c10::complex<double>>() const
0000000002152d68 T c10::complex<float>* at::TensorBase::data_ptr<c10::complex<float>>() const
000000000215320c T c10::BFloat16* at::TensorBase::data_ptr<c10::BFloat16>() const
0000000002153ce0 T c10::quint2x4* at::TensorBase::data_ptr<c10::quint2x4>() const
0000000002153b54 T c10::quint4x2* at::TensorBase::data_ptr<c10::quint4x2>() const
0000000002152108 T signed char* at::TensorBase::data_ptr<signed char>() const
0000000002153080 T bool* at::TensorBase::data_ptr<bool>() const
0000000002152a50 T double* at::TensorBase::data_ptr<double>() const
00000000021528c4 T float* at::TensorBase::data_ptr<float>() const
0000000002151f7c T unsigned char* at::TensorBase::data_ptr<unsigned char>() const
0000000002152420 T int* at::TensorBase::data_ptr<int>() const
0000000002152294 T short* at::TensorBase::data_ptr<short>() const
00000000021525ac T long long* at::TensorBase::data_ptr<long long>() const

My environment is:

     active environment : deepmd-dev
    active env location : /Users/chazeon/miniforge3/envs/deepmd-dev
            shell level : 1
       user config file : /Users/chazeon/.condarc
 populated config files : /Users/chazeon/miniforge3/.condarc
                          /Users/chazeon/.condarc
          conda version : 23.3.1
    conda-build version : not installed
         python version : 3.10.12.final.0
       virtual packages : __archspec=1=arm64
                          __osx=14.4.1=0
                          __unix=0=0
       base environment : /Users/chazeon/miniforge3  (writable)
      conda av data dir : /Users/chazeon/miniforge3/etc/conda
  conda av metadata url : None
           channel URLs : https://conda.anaconda.org/conda-forge/osx-arm64
                          https://conda.anaconda.org/conda-forge/noarch
                          https://repo.anaconda.com/pkgs/main/osx-arm64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/r/osx-arm64
                          https://repo.anaconda.com/pkgs/r/noarch
          package cache : /Users/chazeon/miniforge3/pkgs
                          /Users/chazeon/.conda/pkgs
       envs directories : /Users/chazeon/miniforge3/envs
                          /Users/chazeon/.conda/envs
               platform : osx-arm64
             user-agent : conda/23.3.1 requests/2.31.0 CPython/3.10.12 Darwin/23.4.0 OSX/14.4.1
                UID:GID : 501:20
             netrc file : None
           offline mode : False

This might related to: https://stackoverflow.com/questions/67584843/pytorch-tensordata-ptrlong-long-not-working-on-linux

@njzjz
Copy link
Member

njzjz commented May 4, 2024

@chazeon could you send a PR?

In a 64-bit system, long and long long have the same size, which might be why long can be compiled on Linux. But it looks dangerous.

@njzjz
Copy link
Member

njzjz commented May 4, 2024

long used here also requires to be reviewed:

if (!fparam.empty()) {
fparam_tensor =
torch::from_blob(const_cast<VALUETYPE*>(fparam.data()),
{1, static_cast<long int>(fparam.size())}, options)
.to(device);
}
c10::optional<torch::Tensor> aparam_tensor;
if (!aparam_.empty()) {
aparam_tensor = torch::from_blob(
const_cast<VALUETYPE*>(aparam_.data()),
{1, lmp_list.inum,
static_cast<long int>(aparam_.size()) / lmp_list.inum},
options)
.to(device);
}

github-merge-queue bot pushed a commit that referenced this pull request May 5, 2024
Following the discussion in #3657 this pull request addresses the usage
of `long` or `long int` by replacing them with `int64_t` in multiple
instances. This change aims to enhance code compatibility across
different platforms and improve code clarity.

The `int64_t` type is a feature introduced in C++11, defined in the
[`<cstdint>`](https://en.cppreference.com/w/cpp/header/cstdint) header.
Due to historical reasons, the compilation behavior of `int64_t` is
platform- and system-specific. On Linux, `int64_t` is compiled to
`long`, whereas on macOS, it's compiled to `long long`.

In relevant codebases such as PyTorch and TensorFlow, `int64_t` is
preferred over explicit declarations of `long` or `long long`.
Consequently, for precompiled libraries, on Linux, symbols are defined
exclusively to `long`, while on macOS, symbols are defined exclusively
based on `long long`.

For these reasons, `data_ptr<long int>()` is unable to compile on macOS.

## References

*
https://stackoverflow.com/questions/67584843/pytorch-tensordata-ptrlong-long-not-working-on-linux
*
https://github.com/llvm/llvm-project/blob/c7910ee1f0af64501bf068cdfec154ea359ff832/clang/test/Preprocessor/init.c

## Examples

### Example 1

For the code used here, `torch::from_blob`, is defined using

```cpp
inline at::Tensor from_blob(
    void* data,
    at::IntArrayRef sizes,
    const at::TensorOptions& options = at::TensorOptions()) {
```
where `IntArrayRef` is defined as
```cpp
using IntArrayRef = c10::ArrayRef<int64_t>;
```

### Example 2

Dumping the symbols in `libtorch_cpu.dylib` on macOS

```
-> % nm -gU libtorch_cpu.dylib | llvm-cxxfilt | grep TensorBase | grep ::data_ptr
0000000002153398 T c10::Float8_e5m2* at::TensorBase::data_ptr<c10::Float8_e5m2>() const
0000000002153524 T c10::Float8_e4m3fn* at::TensorBase::data_ptr<c10::Float8_e4m3fn>() const
0000000002152738 T c10::Half* at::TensorBase::data_ptr<c10::Half>() const
00000000021536b0 T c10::qint8* at::TensorBase::data_ptr<c10::qint8>() const
00000000021539c8 T c10::qint32* at::TensorBase::data_ptr<c10::qint32>() const
000000000215383c T c10::quint8* at::TensorBase::data_ptr<c10::quint8>() const
0000000002152bdc T c10::complex<c10::Half>* at::TensorBase::data_ptr<c10::complex<c10::Half>>() const
0000000002152ef4 T c10::complex<double>* at::TensorBase::data_ptr<c10::complex<double>>() const
0000000002152d68 T c10::complex<float>* at::TensorBase::data_ptr<c10::complex<float>>() const
000000000215320c T c10::BFloat16* at::TensorBase::data_ptr<c10::BFloat16>() const
0000000002153ce0 T c10::quint2x4* at::TensorBase::data_ptr<c10::quint2x4>() const
0000000002153b54 T c10::quint4x2* at::TensorBase::data_ptr<c10::quint4x2>() const
0000000002152108 T signed char* at::TensorBase::data_ptr<signed char>() const
0000000002153080 T bool* at::TensorBase::data_ptr<bool>() const
0000000002152a50 T double* at::TensorBase::data_ptr<double>() const
00000000021528c4 T float* at::TensorBase::data_ptr<float>() const
0000000002151f7c T unsigned char* at::TensorBase::data_ptr<unsigned char>() const
0000000002152420 T int* at::TensorBase::data_ptr<int>() const
0000000002152294 T short* at::TensorBase::data_ptr<short>() const
00000000021525ac T long long* at::TensorBase::data_ptr<long long>() const
```

dumping symbols in `libtorch_cpu.dylib` on Linux

```
-> % nm -gU libtorch_cpu.so | c++filt | grep TensorBase | grep ::data_ptr 
00000000031ec0d0 T c10::Float8_e5m2* at::TensorBase::data_ptr<c10::Float8_e5m2>() const
00000000031ec2f0 T c10::Float8_e4m3fn* at::TensorBase::data_ptr<c10::Float8_e4m3fn>() const
00000000031ec730 T c10::Float8_e4m3fnuz* at::TensorBase::data_ptr<c10::Float8_e4m3fnuz>() const
00000000031ec510 T c10::Float8_e5m2fnuz* at::TensorBase::data_ptr<c10::Float8_e5m2fnuz>() const
00000000031eb030 T c10::Half* at::TensorBase::data_ptr<c10::Half>() const
00000000031ec950 T c10::qint8* at::TensorBase::data_ptr<c10::qint8>() const
00000000031ecd80 T c10::qint32* at::TensorBase::data_ptr<c10::qint32>() const
00000000031ecb70 T c10::quint8* at::TensorBase::data_ptr<c10::quint8>() const
00000000031eb660 T c10::complex<c10::Half>* at::TensorBase::data_ptr<c10::complex<c10::Half> >() const
00000000031eba80 T c10::complex<double>* at::TensorBase::data_ptr<c10::complex<double> >() const
00000000031eb870 T c10::complex<float>* at::TensorBase::data_ptr<c10::complex<float> >() const
00000000031ebeb0 T c10::BFloat16* at::TensorBase::data_ptr<c10::BFloat16>() const
00000000031ed1c0 T c10::quint2x4* at::TensorBase::data_ptr<c10::quint2x4>() const
00000000031ecfa0 T c10::quint4x2* at::TensorBase::data_ptr<c10::quint4x2>() const
00000000031ea7f0 T signed char* at::TensorBase::data_ptr<signed char>() const
00000000031ebca0 T bool* at::TensorBase::data_ptr<bool>() const
00000000031eb450 T double* at::TensorBase::data_ptr<double>() const
00000000031eb240 T float* at::TensorBase::data_ptr<float>() const
00000000031ea5d0 T unsigned char* at::TensorBase::data_ptr<unsigned char>() const
00000000031eac10 T int* at::TensorBase::data_ptr<int>() const
00000000031ed5e0 T unsigned int* at::TensorBase::data_ptr<unsigned int>() const
00000000031eae20 T long* at::TensorBase::data_ptr<long>() const
00000000031ed7f0 T unsigned long* at::TensorBase::data_ptr<unsigned long>() const
00000000031eaa00 T short* at::TensorBase::data_ptr<short>() const
00000000031ed3d0 T unsigned short* at::TensorBase::data_ptr<unsigned short>() const
```

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Refactor**
- Improved data type consistency across various components for handling
larger data sizes more reliably.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] _lmp raise "assert mapping is not None" with dpa2 model_
5 participants