Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

编译sok会找不到tensorflow/core/kernels/gpu_device_array.h路径 #973

Closed
kangna-qi opened this issue Feb 27, 2024 · 3 comments
Closed

Comments

@kangna-qi
Copy link

kangna-qi commented Feb 27, 2024

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 20.04):ubuntu 20.04.2
  • DeepRec version or commit id:deeprec2310
  • Python version:3.8
  • Bazel version (if compiling from source):0.26.1
  • GCC/Compiler version (if compiling from source):9.4.0
  • CUDA/cuDNN version:11.7

Describe the problem
按照sok的文档编译,发现编译报错,找不到tensorflow/core/kernels/gpu_device_array.h路径,去/tensorflow_core/include下看的确没有对应路径,但是源码中是存在该头文件,报错日志:

[ 87%] Building CUDA object experiment/CMakeFiles/sok_experiment.dir/variable/impl/variable_base.cu.o
[ 88%] Building CXX object experiment/CMakeFiles/sok_experiment.dir/gpu_train/DeepRec-deeprec2310/addons/sparse_operation_kit/core/adapter/lookup_adapter.cpp.o
In file included from /usr/local/lib/python3.8/dist-packages/tensorflow_core/include/tensorflow/core/framework/embedding/embedding_var.h:31,
                 from /gpu_train/DeepRec-deeprec2310/addons/sparse_operation_kit/core/adapter/lookup_adapter.hpp:31,
                 from /gpu_train/DeepRec-deeprec2310/addons/sparse_operation_kit/core/adapter/lookup_adapter.cpp:17:
/usr/local/lib/python3.8/dist-packages/tensorflow_core/include/tensorflow/core/framework/embedding/embedding_var_context.h:22: warning: "EIGEN_USE_GPU" redefined
   22 | #define EIGEN_USE_GPU
      | 
<command-line>: note: this is the location of the previous definition
In file included from /usr/local/lib/python3.8/dist-packages/tensorflow_core/include/tensorflow/core/framework/embedding/embedding_var.h:31,
                 from /gpu_train/DeepRec-deeprec2310/addons/sparse_operation_kit/core/adapter/lookup_adapter.hpp:31,
                 from /gpu_train/DeepRec-deeprec2310/addons/sparse_operation_kit/core/adapter/lookup_adapter.cpp:17:
/usr/local/lib/python3.8/dist-packages/tensorflow_core/include/tensorflow/core/framework/embedding/embedding_var_context.h:23:10: fatal error: tensorflow/core/kernels/gpu_device_array.h: No such file or directory
   23 | #include "tensorflow/core/kernels/gpu_device_array.h"
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
make[2]: *** [experiment/CMakeFiles/sok_experiment.dir/build.make:398: experiment/CMakeFiles/sok_experiment.dir/gpu_train/DeepRec-deeprec2310/addons/sparse_operation_kit/core/adapter/lookup_adapter.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
In file included from /usr/local/lib/python3.8/dist-packages/tensorflow_core/include/tensorflow/core/framework/embedding/embedding_var.h:31,
                 from /gpu_train/DeepRec-deeprec2310/addons/sparse_operation_kit/core/adapter/lookup_adapter.hpp:31,
                 from /tmp/external/hugectr/sparse_operation_kit/experiment/lookup/kernels/embedding_collection.cc:1042:
/usr/local/lib/python3.8/dist-packages/tensorflow_core/include/tensorflow/core/framework/embedding/embedding_var_context.h:22: warning: "EIGEN_USE_GPU" redefined
   22 | #define EIGEN_USE_GPU
      | 
<command-line>: note: this is the location of the previous definition
In file included from /usr/local/lib/python3.8/dist-packages/tensorflow_core/include/tensorflow/core/framework/embedding/embedding_var.h:31,
                 from /gpu_train/DeepRec-deeprec2310/addons/sparse_operation_kit/core/adapter/lookup_adapter.hpp:31,
                 from /tmp/external/hugectr/sparse_operation_kit/experiment/lookup/kernels/embedding_collection.cc:1042:
/usr/local/lib/python3.8/dist-packages/tensorflow_core/include/tensorflow/core/framework/embedding/embedding_var_context.h:23:10: fatal error: tensorflow/core/kernels/gpu_device_array.h: No such file or directory
   23 | #include "tensorflow/core/kernels/gpu_device_array.h"
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
make[2]: *** [experiment/CMakeFiles/sok_experiment.dir/build.make:146: experiment/CMakeFiles/sok_experiment.dir/lookup/kernels/embedding_collection.cc.o] Error 1
/usr/local/lib/python3.8/dist-packages/tensorflow_core/include/unsupported/Eigen/CXX11/../../../Eigen/src/Core/util/DisableStupidWarnings.h(74): warning #20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead

Provide the exact sequence of commands / steps that you executed before running into the problem
docker images:alideeprec/deeprec-release:deeprec2310-gpu-py38-cu116-ubuntu20.04

# 安装openmpi
wget https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.1.tar.gz
tar -xvf openmpi-4.0.1.tar.gz
cd openmpi-4.0.1
./configure --prefix=/usr/local/openmpi
make
make install
export MPI_HOME=/usr/local/openmpi #添加安装路径环境变量
export PATH=$MPI_HOME/bin:$PATH
export LD_LIBRARY_PATH=$MPI_HOME/lib:$LD_LIBRARY_PATH
 
# 安装horovod
HOROVOD_NCCL_LINK=SHARED HOROVOD_GPU_OPERATIONS=NCCL pip install --no-cache-dir horovod
 
# 安装bazel
wget https://github.com/bazelbuild/bazel/releases/download/0.26.1/bazel-0.26.1-installer-linux-x86_64.sh
chmod +x bazel-0.26.1-installer-linux-x86_64.sh
./bazel-0.26.1-installer-linux-x86_64.sh
source /usr/local/lib/bazel/bin/bazel-complete.bash
 
# 编译sok
apt install libnuma-dev
wget https://github.com/DeepRec-AI/DeepRec/archive/refs/heads/deeprec2310.zip
unzip deeprec2310.zip
cd DeepRec-deeprec2310
./configure 
bazel --output_base /tmp build -j 16  -c opt --config=opt  //tensorflow/tools/pip_package:build_sok && ./bazel-bin/tensorflow/tools/pip_package/build_sok

Include any logs or source code that would be helpful to diagnose the problem.

@candyzone
Copy link
Collaborator

@Mesilenceki

@Mesilenceki
Copy link
Contributor

Hi kangna:
You can refer to the following link. Dockerfile

@Mesilenceki
Copy link
Contributor

BTW, we also have provided a ready-made image : alideeprec/deeprec-release:deeprec2310-gpu-py38-cu116-ubuntu20.04-hybridbackend .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants