Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dynamically load op library in C++ interface #1384

Merged
merged 12 commits into from
Jan 21, 2022

Conversation

njzjz
Copy link
Member

@njzjz njzjz commented Dec 26, 2021

In this PR, C++ interface will dynamically load OP libraries, just like Python interface, so it no longer needs linking. Thus, I also remove CMAKE_LINK_WHAT_YOU_USE flag. Note that it needs to set RPATH (which we have already done).

Refer: https://discuss.tensorflow.org/t/how-to-load-custom-op-from-c/5748

C++ interface will dynamically load OP libraries, just like
Python interface, so it no longer needs linking.
@codecov-commenter
Copy link

codecov-commenter commented Dec 26, 2021

Codecov Report

Merging #1384 (1142420) into devel (70c0e73) will decrease coverage by 11.25%.
The diff coverage is n/a.

❗ Current head 1142420 differs from pull request most recent head de00e04. Consider uploading reports for the commit de00e04 to get more accurate results
Impacted file tree graph

@@             Coverage Diff             @@
##            devel    #1384       +/-   ##
===========================================
- Coverage   75.53%   64.28%   -11.26%     
===========================================
  Files          91        5       -86     
  Lines        7505       14     -7491     
===========================================
- Hits         5669        9     -5660     
+ Misses       1836        5     -1831     
Impacted Files Coverage Δ
deepmd/common.py
deepmd/descriptor/se.py
deepmd/descriptor/se_a.py
deepmd/descriptor/se_r.py
deepmd/descriptor/se_t.py
deepmd/entrypoints/freeze.py
deepmd/entrypoints/main.py
deepmd/env.py
deepmd/fit/dipole.py
deepmd/fit/ener.py
... and 76 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 70c0e73...de00e04. Read the comment docs.

source/install/test_cc.sh Outdated Show resolved Hide resolved
@wanghan-iapcm
Copy link
Collaborator

@denghuilu please check if it works.

@denghuilu
Copy link
Member

An error occurs during the MD process: Not sure what's going on, I'll check it this afternoon

root lmp $ git branch
* dynamically-load-op-library
root lmp $ mpirun --allow-run-as-root -n 1 /root/denghui/lammps/src/lmp_mpi < in.lammps 
LAMMPS (29 Sep 2021)
Reading data file ...
  triclinic box = (0.0000000 0.0000000 0.0000000) to (12.444700 12.444700 12.444700) with tilt (0.0000000 0.0000000 0.0000000)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  192 atoms
  read_data CPU = 0.001 seconds
Summary of lammps deepmd module ...
  >>> Info of deepmd-kit:
  installed to:       /root/denghui/deepmd_root
  source:             v2.0.2-60-g0f61527-dirty
  source branch:       dynamically-load-op-library
  source commit:      0f61527
  source commit at:   2022-01-09 06:24:45 -0500
  surpport model ver.:1.1 
  build float prec:   double
  build with tf inc:  /root/denghui/tensorflow_root/include;/root/denghui/tensorflow_root/include
  build with tf lib:  /root/denghui/tensorflow_root/lib/libtensorflow_cc.so;/root/denghui/tensorflow_root/lib/libtensorflow_framework.so
  set tf intra_op_parallelism_threads: 0
  set tf inter_op_parallelism_threads: 0
  >>> Info of lammps module:
  use deepmd-kit at:  /root/denghui/deepmd_root
  source:             v2.0.2-60-g0f61527-dirty
  source branch:      dynamically-load-op-library
  source commit:      0f61527
  source commit at:   2022-01-09 06:24:45 -0500
  build float prec:   double
  build with tf inc:  /root/denghui/tensorflow_root/include;/root/denghui/tensorflow_root/include
  build with tf lib:  /root/denghui/tensorflow_root/lib/libtensorflow_cc.so;/root/denghui/tensorflow_root/lib/libtensorflow_framework.so
*** Error in `/root/denghui/lammps/src/lmp_mpi': free(): invalid pointer: 0x0000000002314268 ***

@denghuilu
Copy link
Member

@njzjz after setting the LD_LIBRARY_PATH of $deepmd_root/lib, the MD process goes well.

root lmp $ mpirun --allow-run-as-root -n 1 /root/denghui/lammps/src/lmp_mpi < in.lammps 
LAMMPS (29 Sep 2021)
Reading data file ...
  triclinic box = (0.0000000 0.0000000 0.0000000) to (12.444700 12.444700 12.444700) with tilt (0.0000000 0.0000000 0.0000000)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  192 atoms
  read_data CPU = 0.001 seconds
Summary of lammps deepmd module ...
  >>> Info of deepmd-kit:
  installed to:       /root/denghui/deepmd_root
  source:             v2.0.2-60-g0f61527-dirty
  source branch:       dynamically-load-op-library
  source commit:      0f61527
  source commit at:   2022-01-09 06:24:45 -0500
  surpport model ver.:1.1 
  build float prec:   double
  build with tf inc:  /root/denghui/tensorflow_root/include;/root/denghui/tensorflow_root/include
  build with tf lib:  /root/denghui/tensorflow_root/lib/libtensorflow_cc.so;/root/denghui/tensorflow_root/lib/libtensorflow_framework.so
  set tf intra_op_parallelism_threads: 0
  set tf inter_op_parallelism_threads: 0
  >>> Info of lammps module:
  use deepmd-kit at:  /root/denghui/deepmd_root
  source:             v2.0.2-60-g0f61527-dirty
  source branch:      dynamically-load-op-library
  source commit:      0f61527
  source commit at:   2022-01-09 06:24:45 -0500
  build float prec:   double
  build with tf inc:  /root/denghui/tensorflow_root/include;/root/denghui/tensorflow_root/include
  build with tf lib:  /root/denghui/tensorflow_root/lib/libtensorflow_cc.so;/root/denghui/tensorflow_root/lib/libtensorflow_framework.so
*** Error in `/root/denghui/lammps/src/lmp_mpi': free(): invalid pointer: 0x0000000002094268 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x81329)[0x7f78ad80c329]
/root/denghui/deepmd_root/lib/libdeepmd_cc.so(+0x1ee7d)[0x7f78c978fe7d]
/root/denghui/deepmd_root/lib/libdeepmd_cc.so(_ZN6deepmd15load_op_libraryEv+0x104)[0x7f78c97905c4]
/root/denghui/deepmd_root/lib/libdeepmd_cc.so(_ZN6deepmd7DeepPot4initERKSsRKiS2_+0x80)[0x7f78c97869a0]
/root/denghui/lammps/src/lmp_mpi[0x86ed77]
/root/denghui/lammps/src/lmp_mpi[0x40c5b2]
/root/denghui/lammps/src/lmp_mpi[0x414441]
/root/denghui/lammps/src/lmp_mpi[0x4147ed]
/root/denghui/lammps/src/lmp_mpi[0x409088]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f78ad7ad555]
/root/denghui/lammps/src/lmp_mpi[0x409fe8]
======= Memory map: ========
00400000-0099d000 r-xp 00000000 fd:01 23862972                           /root/denghui/lammps/src/lmp_mpi
00b9c000-00b9d000 r--p 0059c000 fd:01 23862972                           /root/denghui/lammps/src/lmp_mpi
00b9d000-00ba0000 rw-p 0059d000 fd:01 23862972                           /root/denghui/lammps/src/lmp_mpi
00ba0000-00ba5000 rw-p 00000000 00:00 0 
00bd0000-020b1000 rw-p 00000000 00:00 0                                  [heap]
7f7890000000-7f7890021000 rw-p 00000000 00:00 0 
7f7890021000-7f7894000000 ---p 00000000 00:00 0 
7f78973ec000-7f7897bed000 rw-p 00000000 00:00 0 
7f7897bed000-7f7897bf3000 r-xp 00000000 fd:01 22545198                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_osc_sm.so
7f7897bf3000-7f7897df2000 ---p 00006000 fd:01 22545198                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_osc_sm.so
7f7897df2000-7f7897df3000 r--p 00005000 fd:01 22545198                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_osc_sm.so
7f7897df3000-7f7897df5000 rw-p 00006000 fd:01 22545198                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_osc_sm.so
7f7898000000-7f7898021000 rw-p 00000000 00:00 0 
7f7898021000-7f789c000000 ---p 00000000 00:00 0 
7f789c1f8000-7f789c22f000 r-xp 00000000 fd:01 22545204                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_osc_rdma.so
7f789c22f000-7f789c42e000 ---p 00037000 fd:01 22545204                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_osc_rdma.so
7f789c42e000-7f789c42f000 r--p 00036000 fd:01 22545204                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_osc_rdma.so
7f789c42f000-7f789c430000 rw-p 00037000 fd:01 22545204                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_osc_rdma.so
7f789c430000-7f789c440000 r-xp 00000000 fd:01 22545206                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_osc_ucx.so
7f789c440000-7f789c63f000 ---p 00010000 fd:01 22545206                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_osc_ucx.so
7f789c63f000-7f789c640000 r--p 0000f000 fd:01 22545206                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_osc_ucx.so
7f789c640000-7f789c649000 rw-p 00010000 fd:01 22545206                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_osc_ucx.so
7f789c649000-7f789c667000 r-xp 00000000 fd:01 22545202                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_osc_pt2pt.so
7f789c667000-7f789c866000 ---p 0001e000 fd:01 22545202                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_osc_pt2pt.so
7f789c866000-7f789c867000 r--p 0001d000 fd:01 22545202                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_osc_pt2pt.so
7f789c867000-7f789c868000 rw-p 0001e000 fd:01 22545202                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_osc_pt2pt.so
7f789c868000-7f789c86b000 r-xp 00000000 fd:01 22545174                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_coll_sync.so
7f789c86b000-7f789ca6b000 ---p 00003000 fd:01 22545174                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_coll_sync.so
7f789ca6b000-7f789ca6c000 r--p 00003000 fd:01 22545174                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_coll_sync.so
7f789ca6c000-7f789ca6d000 rw-p 00004000 fd:01 22545174                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_coll_sync.so
7f789ca6d000-7f789ca79000 r-xp 00000000 fd:01 22545164                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_coll_basic.so
7f789ca79000-7f789cc79000 ---p 0000c000 fd:01 22545164                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_coll_basic.so
7f789cc79000-7f789cc7a000 r--p 0000c000 fd:01 22545164                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_coll_basic.so
7f789cc7a000-7f789cc7b000 rw-p 0000d000 fd:01 22545164                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_coll_basic.so
7f789cc7b000-7f789cc8c000 r-xp 00000000 fd:01 22545176                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_coll_tuned.so
7f789cc8c000-7f789ce8b000 ---p 00011000 fd:01 22545176                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_coll_tuned.so
7f789ce8b000-7f789ce8c000 r--p 00010000 fd:01 22545176                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_coll_tuned.so
7f789ce8c000-7f789ce8d000 rw-p 00011000 fd:01 22545176                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_coll_tuned.so
7f789d09c000-7f789d0a0000 r-xp 00000000 fd:01 22545166                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_coll_inter.so
7f789d0a0000-7f789d29f000 ---p 00004000 fd:01 22545166                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_coll_inter.so[VM-0-4-centos:11397] *** Process received signal ***
[VM-0-4-centos:11397] Signal: Aborted (6)
[VM-0-4-centos:11397] Signal code:  (-6)
[VM-0-4-centos:11397] [ 0] /lib64/libpthread.so.0(+0xf630)[0x7f78adb68630]
[VM-0-4-centos:11397] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7f78ad7c1387]
[VM-0-4-centos:11397] [ 2] /lib64/libc.so.6(abort+0x148)[0x7f78ad7c2a78]
[VM-0-4-centos:11397] [ 3] /lib64/libc.so.6(+0x78f67)[0x7f78ad803f67]
[VM-0-4-centos:11397] [ 4] /lib64/libc.so.6(+0x81329)[0x7f78ad80c329]
[VM-0-4-centos:11397] [ 5] /root/denghui/deepmd_root/lib/libdeepmd_cc.so(+0x1ee7d)[0x7f78c978fe7d]
[VM-0-4-centos:11397] [ 6] /root/denghui/deepmd_root/lib/libdeepmd_cc.so(_ZN6deepmd15load_op_libraryEv+0x104)[0x7f78c97905c4]
[VM-0-4-centos:11397] [ 7] /root/denghui/deepmd_root/lib/libdeepmd_cc.so(_ZN6deepmd7DeepPot4initERKSsRKiS2_+0x80)[0x7f78c97869a0]
[VM-0-4-centos:11397] [ 8] /root/denghui/lammps/src/lmp_mpi[0x86ed77]
[VM-0-4-centos:11397] [ 9] /root/denghui/lammps/src/lmp_mpi[0x40c5b2]
[VM-0-4-centos:11397] [10] /root/denghui/lammps/src/lmp_mpi[0x414441]
[VM-0-4-centos:11397] [11] /root/denghui/lammps/src/lmp_mpi[0x4147ed]
[VM-0-4-centos:11397] [12] /root/denghui/lammps/src/lmp_mpi[0x409088]
[VM-0-4-centos:11397] [13] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f78ad7ad555]
[VM-0-4-centos:11397] [14] /root/denghui/lammps/src/lmp_mpi[0x409fe8]
[VM-0-4-centos:11397] *** End of error message ***
I'm in 1
Not found: libdeepmd_op.so: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node VM-0-4-centos exited on signal 6 (Aborted).
--------------------------------------------------------------------------
root lmp $ export LD_LIBRARY_PATH=$deepmd_root/lib:$LD_LIBRARY_PATH
root lmp $ mpirun --allow-run-as-root -n 1 /root/denghui/lammps/src/lmp_mpi < in.lammps 
LAMMPS (29 Sep 2021)
Reading data file ...
  triclinic box = (0.0000000 0.0000000 0.0000000) to (12.444700 12.444700 12.444700) with tilt (0.0000000 0.0000000 0.0000000)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  192 atoms
  read_data CPU = 0.001 seconds
Summary of lammps deepmd module ...
  >>> Info of deepmd-kit:
  installed to:       /root/denghui/deepmd_root
  source:             v2.0.2-60-g0f61527-dirty
  source branch:       dynamically-load-op-library
  source commit:      0f61527
  source commit at:   2022-01-09 06:24:45 -0500
  surpport model ver.:1.1 
  build float prec:   double
  build with tf inc:  /root/denghui/tensorflow_root/include;/root/denghui/tensorflow_root/include
  build with tf lib:  /root/denghui/tensorflow_root/lib/libtensorflow_cc.so;/root/denghui/tensorflow_root/lib/libtensorflow_framework.so
  set tf intra_op_parallelism_threads: 0
  set tf inter_op_parallelism_threads: 0
  >>> Info of lammps module:
  use deepmd-kit at:  /root/denghui/deepmd_root
  source:             v2.0.2-60-g0f61527-dirty
  source branch:      dynamically-load-op-library
  source commit:      0f61527
  source commit at:   2022-01-09 06:24:45 -0500
  build float prec:   double
  build with tf inc:  /root/denghui/tensorflow_root/include;/root/denghui/tensorflow_root/include
  build with tf lib:  /root/denghui/tensorflow_root/lib/libtensorflow_cc.so;/root/denghui/tensorflow_root/lib/libtensorflow_framework.so
I'm in 1
I'm in 2
2022-01-15 22:52:12.662306: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-01-15 22:52:12.662708: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-15 22:52:12.674188: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-15 22:52:12.675272: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-15 22:52:13.352641: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-15 22:52:13.353758: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-15 22:52:13.354812: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-15 22:52:13.355875: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 31006 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:00:08.0, compute capability: 7.0
  >>> Info of model(s):
  using   1 model(s): frozen_model.pb 
  rcut in model:      6
  ntypes in model:    2

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Your simulation uses code contributions which should be cited:
- USER-DEEPMD package:
The log file lists these citations in BibTeX format.

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Neighbor list info ...
  update every 10 steps, delay 0 steps, check no
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 8
  ghost atom cutoff = 8
  binsize = 4, bins = 4 4 4
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair deepmd, perpetual
      attributes: full, newton on
      pair build: full/bin/atomonly
      stencil: full/bin/3d
      bin: standard
Setting up Verlet run ...
  Unit style    : metal
  Current step  : 0
  Time step     : 0.0005
Per MPI rank memory allocation (min/avg/max) = 3.908 | 3.908 | 3.908 Mbytes
Step PotEng KinEng TotEng Temp Press Volume 
       0   -29949.687    8.1472669    -29941.54          330   -10315.294    1927.3176 
     100   -29949.774    8.2396323   -29941.535     333.7412    -17369.19    1927.3176 
     200   -29949.918    8.3734367   -29941.545    339.16087   -14767.494    1927.3176 
     300   -29949.426    7.8874137   -29941.539     319.4748   -10014.159    1927.3176 
     400   -29949.966    8.4216707   -29941.544    341.11455   -15011.137    1927.3176 
     500   -29949.534    7.9793362   -29941.554    323.19807    -19278.38    1927.3176 
     600   -29950.089    8.5298607    -29941.56    345.49672   -10833.846    1927.3176 
     700    -29950.03    8.4502146    -29941.58    342.27071   -10410.325    1927.3176 
     800   -29949.216    7.6218502   -29941.594    308.71832   -19800.675    1927.3176 
     900   -29949.528    7.9217608   -29941.606    320.86602   -14158.487    1927.3176 
    1000    -29949.78    8.1663933   -29941.614     330.7747   -19542.602    1927.3176 
Loop time of 4.72924 on 1 procs for 1000 steps with 192 atoms

Performance: 9.135 ns/day, 2.627 hours/ns, 211.451 timesteps/s
98.0% CPU use with 1 MPI tasks x no OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 4.5636     | 4.5636     | 4.5636     |   0.0 | 96.50
Neigh   | 0.13815    | 0.13815    | 0.13815    |   0.0 |  2.92
Comm    | 0.015456   | 0.015456   | 0.015456   |   0.0 |  0.33
Output  | 0.0029724  | 0.0029724  | 0.0029724  |   0.0 |  0.06
Modify  | 0.0064512  | 0.0064512  | 0.0064512  |   0.0 |  0.14
Other   |            | 0.002629   |            |       |  0.06

Nlocal:        192.000 ave         192 max         192 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Nghost:        2152.00 ave        2152 max        2152 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Neighs:         0.00000 ave           0 max           0 min
Histogram: 1 0 0 0 0 0 0 0 0 0
FullNghs:      41092.0 ave       41092 max       41092 min
Histogram: 1 0 0 0 0 0 0 0 0 0

Total # of neighbors = 41092
Ave neighs/atom = 214.02083
Neighbor list builds = 100
Dangerous builds not checked
Total wall time: 0:00:07
root lmp $ 

@njzjz
Copy link
Member Author

njzjz commented Jan 15, 2022

@denghuilu Can you check if RPATH is set?

@njzjz
Copy link
Member Author

njzjz commented Jan 15, 2022

I think rpath should have already been set here:

NNP_LIB=" -Wl,--no-as-needed -l@LIB_DEEPMD_CC@@variant_name@ -ltensorflow_cc -ltensorflow_framework -Wl,-rpath=$TF_RPATH -Wl,-rpath=$DEEPMD_ROOT/lib"

@denghuilu
Copy link
Member

root lmp $ ldd /root/denghui/lammps/src/lmp_mpi
        linux-vdso.so.1 =>  (0x00007ffe391e6000)
        libdeepmd_cc.so => /root/denghui/deepmd_root/lib/libdeepmd_cc.so (0x00007fc0acc4b000)
        libtensorflow_cc.so.2 => /root/denghui/tensorflow_root/lib/libtensorflow_cc.so.2 (0x00007fc09389e000)
        libtensorflow_framework.so.2 => /root/denghui/tensorflow_root/lib/libtensorflow_framework.so.2 (0x00007fc091d85000)
        libmpi.so.40 => /root/denghui/openmpi-4.0.6/lib/libmpi.so.40 (0x00007fc091a6f000)
        libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007fc091767000)
        libm.so.6 => /lib64/libm.so.6 (0x00007fc091465000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fc09124f000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fc091033000)
        libc.so.6 => /lib64/libc.so.6 (0x00007fc090c65000)
        libdeepmd.so => /root/denghui/deepmd_root/lib/libdeepmd.so (0x00007fc090a2a000)
        libdeepmd_op_cuda.so => /root/denghui/deepmd_root/lib/libdeepmd_op_cuda.so (0x00007fc0904be000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007fc0902ba000)
        librt.so.1 => /lib64/librt.so.1 (0x00007fc0900b2000)
        libgomp.so.1 => /lib64/libgomp.so.1 (0x00007fc08fe8c000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fc0ace7b000)
        libopen-rte.so.40 => /root/denghui/openmpi-4.0.6/lib/libopen-rte.so.40 (0x00007fc08fbd5000)
        libopen-pal.so.40 => /root/denghui/openmpi-4.0.6/lib/libopen-pal.so.40 (0x00007fc08f8c5000)
        libudev.so.1 => /lib64/libudev.so.1 (0x00007fc08f6af000)
        libutil.so.1 => /lib64/libutil.so.1 (0x00007fc08f4ac000)
        libz.so.1 => /lib64/libz.so.1 (0x00007fc08f296000)
        libcap.so.2 => /lib64/libcap.so.2 (0x00007fc08f091000)
        libdw.so.1 => /lib64/libdw.so.1 (0x00007fc08ee40000)
        libattr.so.1 => /lib64/libattr.so.1 (0x00007fc08ec3b000)
        libelf.so.1 => /lib64/libelf.so.1 (0x00007fc08ea23000)
        liblzma.so.5 => /lib64/liblzma.so.5 (0x00007fc08e7fd000)
        libbz2.so.1 => /lib64/libbz2.so.1 (0x00007fc08e5ed000)

@denghuilu
Copy link
Member

Have no idea what's going on. Devel branch works fine.

@njzjz

This comment has been minimized.

@denghuilu
Copy link
Member

compiler error

mpicxx -g -O3 -std=c++11 main.o -L/root/denghui/tensorflow_root/lib -L/root/denghui/tensorflow_root/lib -L/root/denghui/deepmd_root/lib     -L. -llammps_mpi -Wl,--no-as-needed -ldeepmd_cc -ltensorflow_cc -ltensorflow_framework -Wl,-rpath=/root/denghui/tensorflow_root/lib -Wl,-rpath=/root/denghui/tensorflow_root/lib -Wl,-rpath=/root/denghui/deepmd_root/lib -rpath=/root/denghui/deepmd_root/lib      -o ../lmp_mpi
g++: error: unrecognized command line option ‘-rpath=/root/denghui/deepmd_root/lib’
make[1]: *** [../lmp_mpi] Error 1
make[1]: Leaving directory `/root/denghui/lammps/src/Obj_mpi'
make: *** [mpi] Error 2

@njzjz njzjz force-pushed the dynamically-load-op-library branch from aea3d5b to a3f8d95 Compare January 16, 2022 15:39
@njzjz
Copy link
Member Author

njzjz commented Jan 16, 2022

I'll relook at it.

@njzjz
Copy link
Member Author

njzjz commented Jan 16, 2022

@denghuilu I rechecked 0f61527 by downloading and compiling a new LAMMPS. However, I found no problem running it without setting LD_LIBRARY_PATH.

@njzjz
Copy link
Member Author

njzjz commented Jan 16, 2022

@denghuilu Can you test the following command?

(base) [jz748@localhost lmp]$ readelf -d /home/jz748/codes/deepmd-kit/dp/lib/libdeepmd_cc.so | head -20
Dynamic section at offset 0x3ac88 contains 38 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libdeepmd.so]
 0x0000000000000001 (NEEDED)             Shared library: [libtensorflow_cc.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libtensorflow_framework.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libdeepmd_op_cuda.so]
 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libdl.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [librt.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libgomp.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000000e (SONAME)             Library soname: [libdeepmd_cc.so]
 0x000000000000000f (RPATH)              Library rpath: [$ORIGIN:/home/jz748/codes/deepmd-kit/dp/lib]
 0x000000000000000c (INIT)               0xb000
 0x000000000000000d (FINI)               0x347b8
 0x0000000000000019 (INIT_ARRAY)         0x3bb68

As you see, rpath is correctly set.

@denghuilu
Copy link
Member

Here's the output:

root denghui $ readelf -d /root/denghui/deepmd_root/lib/libdeepmd_cc.so | head -20

Dynamic section at offset 0x2ec68 contains 37 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libdeepmd.so]
 0x0000000000000001 (NEEDED)             Shared library: [libtensorflow_cc.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libtensorflow_framework.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libdeepmd_op_cuda.so]
 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libdl.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [librt.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libgomp.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000000e (SONAME)             Library soname: [libdeepmd_cc.so]
 0x000000000000000f (RPATH)              Library rpath: [$ORIGIN:/root/denghui/tensorflow_root/lib]
 0x000000000000000c (INIT)               0x90f0
 0x000000000000000d (FINI)               0x28e0c
 0x0000000000000019 (INIT_ARRAY)         0x22eac8

@njzjz
Copy link
Member Author

njzjz commented Jan 17, 2022

Checking LAMMPS?

(base) [jz748@localhost src]$ readelf -d lmp_serial | head -20

Dynamic section at offset 0x676d90 contains 32 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libdeepmd_cc.so]
 0x0000000000000001 (NEEDED)             Shared library: [libtensorflow_cc.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libtensorflow_framework.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000000f (RPATH)              Library rpath: [/home/jz748/codes/deepmd-kit/dp/lib]
 0x000000000000000c (INIT)               0x408000
 0x000000000000000d (FINI)               0x97b788
 0x0000000000000019 (INIT_ARRAY)         0xa778e8
 0x000000000000001b (INIT_ARRAYSZ)       80 (bytes)
 0x000000000000001a (FINI_ARRAY)         0xa77938
 0x000000000000001c (FINI_ARRAYSZ)       8 (bytes)
 0x0000000000000004 (HASH)               0x400378
 0x000000006ffffef5 (GNU_HASH)           0x400c58
 0x0000000000000005 (STRTAB)             0x402b28

@denghuilu
Copy link
Member

readelf -d lmp_mpi | head -20

Dynamic section at offset 0x59cd98 contains 33 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libdeepmd_cc.so]
 0x0000000000000001 (NEEDED)             Shared library: [libtensorflow_cc.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libtensorflow_framework.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libmpi.so.40]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000001d (RUNPATH)            Library runpath: [/root/denghui/tensorflow_root/lib:/root/denghui/deepmd_root/lib:/root/denghui/openmpi-4.0.6/lib]
 0x000000000000000c (INIT)               0x407948
 0x000000000000000d (FINI)               0x8b9614
 0x0000000000000019 (INIT_ARRAY)         0xb9cd38
 0x000000000000001b (INIT_ARRAYSZ)       80 (bytes)
 0x000000000000001a (FINI_ARRAY)         0xb9cd88
 0x000000000000001c (FINI_ARRAYSZ)       8 (bytes)
 0x000000006ffffef5 (GNU_HASH)           0x400298
root src $ cd ..
root lammps $ cd ..
root denghui $ cd deepmd-kit/examples/water/lmp/
root lmp $ mpirun --allow-run-as-root -n 1 /root/denghui/lammps/src/lmp_mpi < in.lammps 
LAMMPS (29 Sep 2021)
Reading data file ...
  triclinic box = (0.0000000 0.0000000 0.0000000) to (12.444700 12.444700 12.444700) with tilt (0.0000000 0.0000000 0.0000000)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  192 atoms
  read_data CPU = 0.001 seconds
Summary of lammps deepmd module ...
  >>> Info of deepmd-kit:
  installed to:       /root/denghui/deepmd_root
  source:             v2.0.2-60-g0f61527-dirty
  source branch:       dynamically-load-op-library
  source commit:      0f61527
  source commit at:   2022-01-09 06:24:45 -0500
  surpport model ver.:1.1 
  build float prec:   double
  build with tf inc:  /root/denghui/tensorflow_root/include;/root/denghui/tensorflow_root/include
  build with tf lib:  /root/denghui/tensorflow_root/lib/libtensorflow_cc.so;/root/denghui/tensorflow_root/lib/libtensorflow_framework.so
  set tf intra_op_parallelism_threads: 0
  set tf inter_op_parallelism_threads: 0
  >>> Info of lammps module:
  use deepmd-kit at:  /root/denghui/deepmd_root
  source:             v2.0.2-60-g0f61527-dirty
  source branch:      dynamically-load-op-library
  source commit:      0f61527
  source commit at:   2022-01-09 06:24:45 -0500
  build float prec:   double
  build with tf inc:  /root/denghui/tensorflow_root/include;/root/denghui/tensorflow_root/include
  build with tf lib:  /root/denghui/tensorflow_root/lib/libtensorflow_cc.so;/root/denghui/tensorflow_root/lib/libtensorflow_framework.so
*** Error in `/root/denghui/lammps/src/lmp_mpi': free(): invalid pointer: 0x0000000002903268 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x81329)[0x7ff180a63329]
/root/denghui/deepmd_root/lib/libdeepmd_cc.so(+0x1eead)[0x7ff19c9e6ead]
/root/denghui/deepmd_root/lib/libdeepmd_cc.so(_ZN6deepmd15load_op_libraryEv+0x104)[0x7ff19c9e75f4]
/root/denghui/deepmd_root/lib/libdeepmd_cc.so(_ZN6deepmd7DeepPot4initERKSsRKiS2_+0x5d)[0x7ff19c9dd9ed]
/root/denghui/lammps/src/lmp_mpi[0x86ed77]
/root/denghui/lammps/src/lmp_mpi[0x40c5b2]
/root/denghui/lammps/src/lmp_mpi[0x414441]
Not found: libdeepmd_op.so: cannot open shared object file: No such file or directory
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff180a04555]
/root/denghui/lammps/src/lmp_mpi[0x409fe8]

@njzjz
Copy link
Member Author

njzjz commented Jan 17, 2022

@denghuilu set the following environment variable before running LAMMPS.

export LD_DEBUG=libs

It will give the following information:

      7598:     find library=libdeepmd_op.so [0]; searching
      7598:      search path=/home/jz748/codes/deepmd-kit/dp/lib:/home/jz748/codes/deepmd-kit/dp/lib/.          (RPATH from file /home/jz748/codes/deepmd-kit/dp/lib/libtensorflow_cc.so.2)
      7598:       trying file=/home/jz748/codes/deepmd-kit/dp/lib/libdeepmd_op.so
      7598:
      7598:
      7598:     calling init: /home/jz748/codes/deepmd-kit/dp/lib/libdeepmd_op.so
      7598:

We can see the search path it tries.

@denghuilu
Copy link
Member

 build with tf lib:  /root/denghui/tensorflow_root/lib/libtensorflow_cc.so;/root/denghui/tensorflow_root/lib/libtensorflow_framework.so
      2236:
      2236:     calling init: /root/denghui/openmpi-4.0.6/lib/openmpi/mca_topo_treematch.so
      2236:
      2236:
      2236:     calling init: /root/denghui/openmpi-4.0.6/lib/openmpi/mca_topo_basic.so
      2236:
      2236:     find library=libdeepmd_op.so [0]; searching
      2236:      search path=           (RPATH from file /root/denghui/tensorflow_root/lib/libtensorflow_cc.so.2)
      2236:      search path=/root/denghui/tensorflow_root/lib          (RUNPATH from file /root/denghui/lammps/src/lmp_mpi)
      2236:       trying file=/root/denghui/tensorflow_root/lib/libdeepmd_op.so
      2236:      search path=/root/denghui/openmpi-4.0.6/lib            (RUNPATH from file /root/denghui/lammps/src/lmp_mpi)
      2236:       trying file=/root/denghui/openmpi-4.0.6/lib/libdeepmd_op.so
      2236:      search path=/usr/local/cuda-11.0/lib64:tls/x86_64:tls:x86_64:          (LD_LIBRARY_PATH)
      2236:       trying file=/usr/local/cuda-11.0/lib64/libdeepmd_op.so
      2236:       trying file=tls/x86_64/libdeepmd_op.so
      2236:       trying file=tls/libdeepmd_op.so
      2236:       trying file=x86_64/libdeepmd_op.so
      2236:       trying file=libdeepmd_op.so
      2236:      search cache=/etc/ld.so.cache
      2236:      search path=/lib64/tls/x86_64:/lib64/tls:/lib64/x86_64:/lib64:/usr/lib64/tls:/usr/lib64                (system search path)
      2236:       trying file=/lib64/tls/x86_64/libdeepmd_op.so
      2236:       trying file=/lib64/tls/libdeepmd_op.so
      2236:       trying file=/lib64/x86_64/libdeepmd_op.so
      2236:       trying file=/lib64/libdeepmd_op.so
      2236:       trying file=/usr/lib64/tls/libdeepmd_op.so
      2236:       trying file=/usr/lib64/libdeepmd_op.so
      2236:
*** Error in `/root/denghui/lammps/src/lmp_mpi': free(): invalid pointer: 0x0000000003b1e1a8 ***

@denghuilu
Copy link
Member

$deepmd_root/lib is not within the search path.

@njzjz
Copy link
Member Author

njzjz commented Jan 18, 2022

fabdac9 should fix it.

@denghuilu
Copy link
Member

The same error...

root lmp $ git log | head
commit fabdac91d42004409042c87456930797fdc39880
Author: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
Date:   Mon Jan 17 21:35:27 2022 -0500

    add the absolute path of library directory to cc rpath

commit e988c401f4432082ef2fca0a06d15aebef75be67
Author: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
Date:   Mon Jan 17 14:19:31 2022 -0500
LAMMPS (29 Sep 2021)
Reading data file ...
  triclinic box = (0.0000000 0.0000000 0.0000000) to (12.444700 12.444700 12.444700) with tilt (0.0000000 0.0000000 0.0000000)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  192 atoms
  read_data CPU = 0.001 seconds
Summary of lammps deepmd module ...
  >>> Info of deepmd-kit:
  installed to:       /root/denghui/deepmd_root
  source:             v2.0.2-62-gfabdac9-dirty
  source branch:       dynamically-load-op-library
  source commit:      fabdac9
  source commit at:   2022-01-17 21:35:27 -0500
  surpport model ver.:1.1 
  build float prec:   double
  build with tf inc:  /root/denghui/tensorflow_root/include;/root/denghui/tensorflow_root/include
  build with tf lib:  /root/denghui/tensorflow_root/lib/libtensorflow_cc.so;/root/denghui/tensorflow_root/lib/libtensorflow_framework.so
  set tf intra_op_parallelism_threads: 0
  set tf inter_op_parallelism_threads: 0
  >>> Info of lammps module:
  use deepmd-kit at:  /root/denghui/deepmd_root
  source:             v2.0.2-62-gfabdac9-dirty
  source branch:      dynamically-load-op-library
  source commit:      fabdac9
  source commit at:   2022-01-17 21:35:27 -0500
  build float prec:   double
  build with tf inc:  /root/denghui/tensorflow_root/include;/root/denghui/tensorflow_root/include
  build with tf lib:  /root/denghui/tensorflow_root/lib/libtensorflow_cc.so;/root/denghui/tensorflow_root/lib/libtensorflow_framework.so
     21609:
     21609:     calling init: /root/denghui/openmpi-4.0.6/lib/openmpi/mca_topo_treematch.so
     21609:
     21609:
     21609:     calling init: /root/denghui/openmpi-4.0.6/lib/openmpi/mca_topo_basic.so
     21609:
     21609:     find library=libdeepmd_op.so [0]; searching
     21609:      search path=           (RPATH from file /root/denghui/tensorflow_root/lib/libtensorflow_cc.so.2)
     21609:      search path=/root/denghui/tensorflow_root/lib          (RUNPATH from file /root/denghui/lammps/src/lmp_mpi)
     21609:       trying file=/root/denghui/tensorflow_root/lib/libdeepmd_op.so
     21609:      search path=/root/denghui/openmpi-4.0.6/lib            (RUNPATH from file /root/denghui/lammps/src/lmp_mpi)
     21609:       trying file=/root/denghui/openmpi-4.0.6/lib/libdeepmd_op.so
     21609:      search path=/usr/local/cuda-11.0/lib64:tls/x86_64:tls:x86_64:          (LD_LIBRARY_PATH)
     21609:       trying file=/usr/local/cuda-11.0/lib64/libdeepmd_op.so
     21609:       trying file=tls/x86_64/libdeepmd_op.so
     21609:       trying file=tls/libdeepmd_op.so
     21609:       trying file=x86_64/libdeepmd_op.so
     21609:       trying file=libdeepmd_op.so
     21609:      search cache=/etc/ld.so.cache
     21609:      search path=/lib64/tls/x86_64:/lib64/tls:/lib64/x86_64:/lib64:/usr/lib64/tls:/usr/lib64                (system search path)
     21609:       trying file=/lib64/tls/x86_64/libdeepmd_op.so
     21609:       trying file=/lib64/tls/libdeepmd_op.so
     21609:       trying file=/lib64/x86_64/libdeepmd_op.so
     21609:       trying file=/lib64/libdeepmd_op.so
     21609:       trying file=/usr/lib64/tls/libdeepmd_op.so
     21609:       trying file=/usr/lib64/libdeepmd_op.so
     21609:
*** Error in `/root/denghui/lammps/src/lmp_mpi': free(): invalid pointer: 0x0000000003112268 ***

@njzjz
Copy link
Member Author

njzjz commented Jan 18, 2022

Ok, I'll take a look...

@njzjz
Copy link
Member Author

njzjz commented Jan 18, 2022

@denghuilu Finally I reproduce it by add -Wl,--enable-new-dtags. This option started to be default in the new version of compilers.

Under the situation, linker will add RUNPATH instead of RPATH.

 0x000000000000001d (RUNPATH)            Library runpath: [/root/denghui/tensorflow_root/lib:/root/denghui/deepmd_root/lib:/root/denghui/openmpi-4.0.6/lib]

See https://stackoverflow.com/a/43703445/9567349 and https://stackoverflow.com/a/52020177/9567349.

Adding -Wl,--disable-new-dtags flag will resolve the issue. But I am looking for a more reasonable solution...

@njzjz
Copy link
Member Author

njzjz commented Jan 18, 2022

Ok, I give up finding other solutions... Adding -Wl,--disable-new-dtags should be useful (this flag is available since 2000). @denghuilu Could you also check what symbol dp_ipi uses, rpath or runpath, and if it works? CMake's default behavior is unclear.

@denghuilu
Copy link
Member

Nothing changed after setting the -Wl, --disable-new-dtags:

mpicxx -g -O3 -std=c++11 main.o -L/root/denghui/tensorflow_root/lib -L/root/denghui/tensorflow_root/lib -L/root/denghui/deepmd_root/lib     -L. -llammps_mpi -Wl,--no-as-needed -ldeepmd_cc -ltensorflow_cc -ltensorflow_framework -Wl,-rpath=/root/denghui/tensorflow_root/lib -Wl,-rpath=/root/denghui/tensorflow_root/lib -Wl,-rpath=/root/denghui/deepmd_root/lib -Wl,--disable-new-dtags      -o ../lmp_mpi
size ../lmp_mpi
   text    data     bss     dec     hex filename
5882685   12560   17384 5912629  5a3835 ../lmp_mpi
make[1]: Leaving directory `/root/denghui/lammps/src/Obj_mpi'
root src $ readelf -d lmp_mpi | head -20

Dynamic section at offset 0x59cd98 contains 33 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libdeepmd_cc.so]
 0x0000000000000001 (NEEDED)             Shared library: [libtensorflow_cc.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libtensorflow_framework.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libmpi.so.40]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000001d (RUNPATH)            Library runpath: [/root/denghui/tensorflow_root/lib:/root/denghui/deepmd_root/lib:/root/denghui/openmpi-4.0.6/lib]
 0x000000000000000c (INIT)               0x407948
 0x000000000000000d (FINI)               0x8b9614
 0x0000000000000019 (INIT_ARRAY)         0xb9cd38
 0x000000000000001b (INIT_ARRAYSZ)       80 (bytes)
 0x000000000000001a (FINI_ARRAY)         0xb9cd88
 0x000000000000001c (FINI_ARRAYSZ)       8 (bytes)
 0x000000006ffffef5 (GNU_HASH)           0x400298
root src $ cat Makefile.package
# Settings for libraries used by specific LAMMPS packages
# this file is auto-edited when those packages are included/excluded

PKG_INC =     -std=c++11 -DHIGH_PREC  -DLAMMPS_VERSION_NUMBER=20210929 -I/root/denghui/tensorflow_root/include -I/root/denghui/tensorflow_root/include -I/root/denghui/deepmd_root/include/ 
PKG_PATH =    -L/root/denghui/tensorflow_root/lib -L/root/denghui/tensorflow_root/lib -L/root/denghui/deepmd_root/lib
PKG_LIB =     -Wl,--no-as-needed -ldeepmd_cc -ltensorflow_cc -ltensorflow_framework -Wl,-rpath=/root/denghui/tensorflow_root/lib -Wl,-rpath=/root/denghui/tensorflow_root/lib -Wl,-rpath=/root/denghui/deepmd_root/lib -Wl,--disable-new-dtags
PKG_CPP_DEPENDS = 
PKG_LINK_DEPENDS = 

PKG_SYSINC =  
PKG_SYSLIB =  
PKG_SYSPATH = 

@denghuilu
Copy link
Member

@njzjz Here's my environment:

CentOS Linux release 7.9.2009 (Core)
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
cuda-11.0, V11.0.194
openmpi-4.0.6
python 3.7.0
tensorflow-gpu-2.6.0
cmake version 3.21.3
LAMMPS (29 Sep 2021)

@njzjz
Copy link
Member Author

njzjz commented Jan 19, 2022

I have no idea for it, but note that ld program is the actual linker gcc calls.

This flag is added by OpenMPI, see open-mpi/ompi#1089

@njzjz
Copy link
Member Author

njzjz commented Jan 19, 2022

If mpicxx adds the flag, I don't think we can override it though.

@njzjz
Copy link
Member Author

njzjz commented Jan 19, 2022

In de00e04, I call dlopen in our own library, but not use TF's function. @denghuilu I think it will also work with RUNPATH.

Copy link
Member

@denghuilu denghuilu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's my mistake. After recompiling DeeePMD-kit, everything works fine.

root lmp $ git log | head
commit de00e04206b93bf87e9c4b64a097266455ccb015
Author: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
Date:   Wed Jan 19 04:52:07 2022 -0500

    dlopen from dp lib but not TF

commit 3362f99b014259d25b981ad6fa04fe26e5ed3873
Author: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
Date:   Wed Jan 19 04:14:55 2022 -0500

root lmp $ echo $LD_LIBRARY_PATH
/root/denghui/openmpi-4.0.6/lib:/usr/local/cuda-11.0/lib64:/root/denghui/openmpi-4.0.6/lib:/usr/local/cuda-11.0/lib64:/root/denghui/openmpi-4.0.6/lib:/usr/local/cuda-11.0/lib64:
root lmp $ mpirun --allow-run-as-root -n 1 /root/denghui/lammps/src/lmp_mpi < in.lammps 
LAMMPS (29 Sep 2021)
Reading data file ...
  triclinic box = (0.0000000 0.0000000 0.0000000) to (12.444700 12.444700 12.444700) with tilt (0.0000000 0.0000000 0.0000000)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  192 atoms
  read_data CPU = 0.001 seconds
Summary of lammps deepmd module ...
  >>> Info of deepmd-kit:
  installed to:       /root/denghui/deepmd_root
  source:             v2.0.2-66-gde00e04-dirty
  source branch:       dynamically-load-op-library
  source commit:      de00e04
  source commit at:   2022-01-19 04:52:07 -0500
  surpport model ver.:1.1 
  build float prec:   double
  build with tf inc:  /root/denghui/tensorflow_root/include;/root/denghui/tensorflow_root/include
  build with tf lib:  /root/denghui/tensorflow_root/lib/libtensorflow_cc.so;/root/denghui/tensorflow_root/lib/libtensorflow_framework.so
  set tf intra_op_parallelism_threads: 0
  set tf inter_op_parallelism_threads: 0
  >>> Info of lammps module:
  use deepmd-kit at:  /root/denghui/deepmd_root
  source:             v2.0.2-66-gde00e04-dirty
  source branch:      dynamically-load-op-library
  source commit:      de00e04
  source commit at:   2022-01-19 04:52:07 -0500
  build float prec:   double
  build with tf inc:  /root/denghui/tensorflow_root/include;/root/denghui/tensorflow_root/include
  build with tf lib:  /root/denghui/tensorflow_root/lib/libtensorflow_cc.so;/root/denghui/tensorflow_root/lib/libtensorflow_framework.so
2022-01-21 09:20:11.789581: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-01-21 09:20:11.789987: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-21 09:20:11.801557: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-21 09:20:11.802645: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-21 09:20:12.474975: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-21 09:20:12.476106: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-21 09:20:12.477162: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-21 09:20:12.478219: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 31006 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:00:08.0, compute capability: 7.0
  >>> Info of model(s):
  using   1 model(s): frozen_model.pb 
  rcut in model:      6
  ntypes in model:    2

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Your simulation uses code contributions which should be cited:
- USER-DEEPMD package:
The log file lists these citations in BibTeX format.

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Neighbor list info ...
  update every 10 steps, delay 0 steps, check no
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 8
  ghost atom cutoff = 8
  binsize = 4, bins = 4 4 4
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair deepmd, perpetual
      attributes: full, newton on
      pair build: full/bin/atomonly
      stencil: full/bin/3d
      bin: standard
Setting up Verlet run ...
  Unit style    : metal
  Current step  : 0
  Time step     : 0.0005
Per MPI rank memory allocation (min/avg/max) = 3.908 | 3.908 | 3.908 Mbytes
Step PotEng KinEng TotEng Temp Press Volume 
       0   -29944.158    8.1472669   -29936.011          330    37078.187    1927.3176 
     100   -29943.989    7.9877789   -29936.001    323.54004    27603.467    1927.3176 
     200   -29943.349    7.3418604   -29936.007    297.37751    32879.887    1927.3176 
     300   -29944.262    8.2516105   -29936.011    334.22637    27118.163    1927.3176 
     400   -29944.503    8.4884408   -29936.014    343.81903    26527.481    1927.3176 
     500   -29944.535     8.514281   -29936.021    344.86568    40825.342    1927.3176 
     600   -29944.479    8.4484458   -29936.031    342.19906    26730.448    1927.3176 
     700    -29944.57    8.5090059   -29936.061    344.65201    27365.977    1927.3176 
     800   -29943.903    7.8286542   -29936.074    317.09479    34878.898    1927.3176 
     900   -29944.711    8.6057383   -29936.106     348.5701    34243.605    1927.3176 
    1000   -29944.493    8.3574289   -29936.136    338.51248    34715.817    1927.3176 
Loop time of 4.74255 on 1 procs for 1000 steps with 192 atoms

Performance: 9.109 ns/day, 2.635 hours/ns, 210.857 timesteps/s
97.5% CPU use with 1 MPI tasks x no OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 4.576      | 4.576      | 4.576      |   0.0 | 96.49
Neigh   | 0.13852    | 0.13852    | 0.13852    |   0.0 |  2.92
Comm    | 0.015699   | 0.015699   | 0.015699   |   0.0 |  0.33
Output  | 0.0030058  | 0.0030058  | 0.0030058  |   0.0 |  0.06
Modify  | 0.0065539  | 0.0065539  | 0.0065539  |   0.0 |  0.14
Other   |            | 0.002799   |            |       |  0.06

Nlocal:        192.000 ave         192 max         192 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Nghost:        2066.00 ave        2066 max        2066 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Neighs:         0.00000 ave           0 max           0 min
Histogram: 1 0 0 0 0 0 0 0 0 0
FullNghs:      40898.0 ave       40898 max       40898 min
Histogram: 1 0 0 0 0 0 0 0 0 0

Total # of neighbors = 40898
Ave neighs/atom = 213.01042
Neighbor list builds = 100
Dangerous builds not checked
Total wall time: 0:00:07

@wanghan-iapcm wanghan-iapcm merged commit 7068698 into deepmodeling:devel Jan 21, 2022
@njzjz njzjz mentioned this pull request Mar 11, 2022
njzjz added a commit to njzjz/deepmd-kit that referenced this pull request Mar 11, 2022
The library type was changed from SHARED to MODULE in deepmodeling#1384.

Fixes errors in conda-forge/deepmd-kit-feedstock#31
@njzjz njzjz deleted the dynamically-load-op-library branch March 11, 2022 22:30
wanghan-iapcm pushed a commit that referenced this pull request Mar 12, 2022
The library type was changed from SHARED to MODULE in #1384.

Fixes errors in conda-forge/deepmd-kit-feedstock#31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants