A complete installation and usage guide #1

winston-li · 2021-11-01T07:43:28Z

Hi,
I read your great paper and was excited to give it a try. I followed README.md and executed "hydra_train.sh". However, it prompts with ModuleNotFoundError "No module named 'cyy_torch_cpp_extension.data_structure'". Looks like it needs module "torch_cpp_extension", which will need to build another CyyAlgorithmLib (repository cyyever/algorithm)? I was stuck and can't build it successfully (fatal error: #include in algorithm/src/alphabet/alphabet.hpp). Wondering if I misunderstood some steps or the installation steps were out of dated?

Thanks.

poppopbean0903 · 2023-11-27T13:28:24Z

Hi,

have you solved the above problem? I met the similar error when running hydra_train.py, it prompts with "ImportError: cannot import name 'SyncedTensorDict' from 'cyy_torch_algorithm.data_structure.synced_tensor_dict' (/home/amax/.local/lib/python3.11/site-packages/cyy_torch_algorithm/data_structure/synced_tensor_dict.py)" .

Thanks.

cyyever · 2023-11-27T13:58:41Z

@poppopbean0903 You need to build an Pytorch extension as follows:

git clone --recursive git@github.com:cyyever/torch_cpp_extension.git
cd torch_cpp_extension
mkdir build && cd build
cmake -DBUILD_SHARED_LIBS=on ..
sudo make install
env cmake_build_dir=build python3 setup.py install --user

poppopbean0903 · 2023-11-28T07:57:51Z

sorry to bother u,

I update my cmake to 3.28 version, when I run "cmake -DBUILD_SHARED_LIBS=on .." , it keep reporting errors about missing packages, like fmt, doctest, and spdlog. I have to install them one by one. I'm wondering whether I missed any step except u offered above, resulting endless missing dependencies.

I'm sorry I have rare knowledge about cmake and can't find the exact reason, so ask u for more technical indications. Many thanks.

The exact error be like:
"CMake Error at python_binding/CMakeLists.txt:2 (find_package):
Could not find a package configuration file provided by "pybind11" with any
of the following names:

pybind11Config.cmake
pybind11-config.cmake

"

cyyever · 2023-11-28T08:05:09Z

@poppopbean0903 I submitted some fixes to disable building tests by default. You can git pull the new code and re-build.

poppopbean0903 · 2023-11-28T08:37:10Z

The new version still reports erros like "Could not find a package configuration file provided by "spdlog" with any of the following names", lacking of dependencies like pybind11 and so on.

It seems error arises at "include(cmake/all.cmake)", I'm wondering if there is any relationship between these errors and my conda environment ? or with my cmake version? Many thanks !

The complete error is
"
-- Could NOT find clang-tidy (missing: clang-tidy_BINARY)
-- Could NOT find run-clang-tidy (missing: run-clang-tidy_BINARY)
-- Could NOT find iwyu_tool (missing: iwyu_tool_BINARY)
CMake Warning at cmake/build_cache.cmake:9 (message):
no ccache found
Call Stack (most recent call first):
cmake/all.cmake:23 (include)
CMakeLists.txt:7 (include)

-- Caffe2: CUDA detected: 10.0
-- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda
-- Caffe2: Header version is: 10.0
-- Found cuDNN: v7.6.5 (include: /usr/local/cuda/include, library: /usr/local/cuda/lib64/libcudnn.so)
-- Autodetected CUDA architecture(s): 7.5 7.5 7.5
-- Added CUDA NVCC flags for: -gencode;arch=compute_75,code=sm_75
-- Build spdlog: 1.12.0
-- Build type: Debug
CMake Error at python_binding/CMakeLists.txt:2 (find_package):
Could not find a package configuration file provided by "pybind11" with any
of the following names:

pybind11Config.cmake
pybind11-config.cmake

Add the installation prefix of "pybind11" to CMAKE_PREFIX_PATH or set
"pybind11_DIR" to a directory containing one of the above files. If
"pybind11" provides a separate development package or SDK, be sure it has
been installed.

-- Configuring incomplete, errors occurred! "

cyyever · 2023-11-28T08:42:28Z

@poppopbean0903 I see. I am fixing it.

cyyever · 2023-11-28T08:55:11Z

@poppopbean0903 I added the missing pybind11 as a git sub-module. The easiest way to build is to remove the old package and follow the new steps:

git clone --recursive git@github.com:cyyever/torch_cpp_extension.git    
cd torch_cpp_extension    
mkdir build && cd build    
cmake -DBUILD_SHARED_LIBS=on ..    
cmake --build . --config release    
cd ..    
env cmake_build_dir=build python3 setup.py install --user

poppopbean0903 · 2023-11-28T11:28:28Z

Thanks a lot, the above issue is solved. But another problem raised: It seems repeated creating 'torch_library', but I didn't create it explicitly and have cleaned the build directory.
Many thanks !

"CMake Error at /home/pami/anaconda3/lib/python3.6/site-packages/torch/share/cmake/Caffe2/public/utils.cmake:40 (add_library):
add_library cannot create target "torch_library" because another target
with the same name already exists. The existing target is an interface
library created in source directory "/home/DiskA/torch_cpp_extension".
See documentation for policy CMP0002 for more details."

cyyever · 2023-11-28T11:30:55Z

@poppopbean0903 You need python3.11 and torch >=2.1 to work. This specific pytorch error came from an older version that I didn't test.

poppopbean0903 · 2023-11-28T12:51:13Z

Thanks a lot, I think maybe its the crux. I've tried updating my torch version to 2.0, but failed due to my older cuda version with 10.0. Sorry to bother u with such technical problem, but I'm wondering whether it is possible to run your code with an older version? Because updating my cuda version on the server has brought some serious problem before, I prefer not to ask for trouble if there is another solution. Or is it possible to install torch>=2.1 with cuda 10.0? ( As far as I know it is impossible). I'm very sorry to bother you with this kind of problem, but I really want to get through your code, thank u very much.

cyyever · 2023-11-30T01:09:08Z

@poppopbean0903 Why not try it in a CUDA Docker container? Indeed, the code relies heavily on new API on latest Pytorch for better performance. I will build a Docker image for your convenience.

poppopbean0903 · 2023-11-30T13:08:04Z

Thanks a lot, I've installed python=3.11 and torch = 2.1. Sorry to keep disturbing u, but I still run into the following problem. It seems lack of some dependencies, and related to cmake version. My cmake version is 3.24.1, higher than required 3.20 ？

”CMake Error at cmake/all.cmake:1 (cmake_policy):
An attempt was made to set the policy version of CMake to "3.25.0" which is
greater than this version of CMake. This is not allowed because the
greater version may have new policies not known to this CMake. You may
need a newer CMake version to build this project.
Call Stack (most recent call first):
CMakeLists.txt:7 (include)

-- Could NOT find clang-tidy (missing: clang-tidy_BINARY)
-- Could NOT find clang-apply-replacements (missing: clang-apply-replacements_BINARY)
-- Could NOT find run-clang-tidy (missing: run-clang-tidy_BINARY)
-- Could NOT find iwyu_tool (missing: iwyu_tool_BINARY)
CMake Warning at cmake/build_cache.cmake:9 (message):
no ccache found
Call Stack (most recent call first):
cmake/all.cmake:23 (include)
CMakeLists.txt:7 (include)

CMake Error at /usr/local/lib/python3.8/dist-packages/cmake/data/share/cmake-3.24/Modules/CMakeDetermineCUDACompiler.cmake:277 (message):
CMAKE_CUDA_ARCHITECTURES must be non-empty if set.
Call Stack (most recent call first):
/root/anaconda3/envs/hydra/lib/python3.11/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:47 (enable_language)
/root/anaconda3/envs/hydra/lib/python3.11/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:87 (include)
/root/anaconda3/envs/hydra/lib/python3.11/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
CMakeLists.txt:25 (find_package)
-- Configuring incomplete, errors occurred! ”

cyyever · 2023-11-30T13:31:22Z

The log shows that python3.8 was used. You can try to remove the CMake requirement by

grep 'cmake_policy(VERSION 3.25.0)' -r third_party cmake

and remove related lines. But I think it is better to use the Docker image which I will deliver soon.

poppopbean0903 · 2023-11-30T13:49:22Z

oh it's great! Looking forward to your docker, many thanks !

cyyever · 2023-11-30T17:10:48Z

@poppopbean0903
If you want to use CUDA, make sure that the host nvdia driver >=545.29.06 and edge docker configured with CUDA runtime, see https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

docker pull cyyever/aaai_hydra:latest
sudo docker run --gpus all  -it --rm  aaai_hydra:latest bash

I suspect it is too hard to setup CUDA runtime, then it is fine to try CPU training and just use

docker pull cyyever/aaai_hydra:latest
sudo docker run -it --rm  aaai_hydra:latest bash

My code will detect that CUDA is unavailable and just use CPU (If you have a powerful CPU).

Anyway, now you are in the Docker container, try

cd  /root/aaai_hydra
env PYTHONPATH=/root/opt/python/lib/python3.11/site-packages /root/opt/python/bin/python3   lean_hydra_train.py --config-name mnist.yaml

poppopbean0903 · 2023-12-02T03:32:24Z

Thank u very much !! But I got the error after I successfully pulled the image, and run docker run command ：

"Unable to find image 'aaai_hydra:latest' locally
docker: Error response from daemon: pull access denied for aaai_hydra, repository does not exist or may require 'docker login': denied: requested access to the resource is denied.

first pull outputs : Status: Downloaded newer image for cyyever/aaai_hydra:latest
docker.io/cyyever/aaai_hydra:latest.

And after I met the error, I check the existence of image by running docker pull again, and output ：
"latest: Pulling from cyyever/aaai_hydra
Digest:sha256:033e84fb07e447eb8aec80092827f673b7c48df5760e9f5abe5acf58065d3a11
Status: Image is up to date for cyyever/aaai_hydra:latest
docker.io/cyyever/aaai_hydra:latest".

It seems I've successfully pulled the image? But when I try docker run, it looks like I have no permission to access the docker ?

cyyever · 2023-12-02T05:02:39Z

Use 'sudo docker image list' to find out the right image name. I can't help much here, you should be familiar with Docker operations.

poppopbean0903 · 2023-12-04T15:32:09Z

Great thanks!! I've successfully run the code. But there are a few errors, it seems related to your library:

When I run hydra_train.py, with use hessian = True , error occurs on line 263 in hydra_hook.py , reporting 'dict' object has no attribute 'cpu', with original code test_gradient = test_gradient.cpu()
commenting out above line, it reports "'cyy_torch_cpp_extension.data_structure.SyncedTenso' object has no attribute 'tensor_dict'" at line 277, with original code tensor_dict.tensor_dict.flush(True)
It generally warns "found inf in AMP, scale is tensor(65536., device='cuda:0')", what does it mean of "amp"?
What's the "lean" in lean_hydra_train stands for ? sorry, I can't remind of corresponding part in the paper.

And the followings are some problems about how to use your code accurately, it would be great and helpful to me if you are willing to offer some advice. ^-^ ( It would save me a lot of time ) But it's ok to ignore them, for it shouldn't have bothered u.

First, I want to save hypergradients of each sample, does the tensor_dict variable of line 270 in hydra_hook.py saves all the hypergradients ? But the tensor_dict seems to be empty when I run hydra_train.py . And if there is any advice on suitably and accurately saving these hypergradients , for its special type, which is better among torch.save , joblib, or anything else ?

Second, I shoud be able to relate hypergradient to its corresponding data for future usage , instead of only hypergradients with index, without knowing corresponding data. Is the index fixed everytime I loader the data ? If so, I load the data by similar steps will be ok ?

Thank you so much for your assistance and your time .

cyyever · 2023-12-04T16:57:22Z

@poppopbean0903 1 and 2 are due to recent code refactors and I will fix them sooner. 3 is https://pytorch.org/docs/stable/amp.html, a manner to accelerate training. 4 is our optimization to speed up hyper-gradient computing and it was not mentioned in the paper.
I will check the results to ensure that the resulting dict contains the influence values of samples.

cyyever · 2023-12-05T05:53:32Z

@poppopbean0903 I pushed the latest image with all the fixes.

poppopbean0903 · 2023-12-08T07:36:11Z

thanks ! But when I run your code with mnist, it reports RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu! in line 235 of cyy_torch_xai/hydra/hydra_hook.py at the second epoch. It was called in line 66 of hydra_sgd_hook.py , and when I check the device of instance gradient and hypergradient, but got instance and hyper gradient is None.

cyyever · 2023-12-08T07:39:00Z

thanks ! But when I run your code with mnist, it reports RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu! in line 235 of cyy_torch_xai/hydra/hydra_hook.py at the second epoch. It was called in line 66 of hydra_sgd_hook.py , and when I check the device of instance gradient and hypergradient, but got instance and hyper gradient is None.

No worry, I noticed the error and will push a new image immediately

cyyever · 2023-12-08T15:04:53Z

@poppopbean0903 Updated

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A complete installation and usage guide #1

A complete installation and usage guide #1

winston-li commented Nov 1, 2021

poppopbean0903 commented Nov 27, 2023

cyyever commented Nov 27, 2023 •

edited

poppopbean0903 commented Nov 28, 2023

cyyever commented Nov 28, 2023

poppopbean0903 commented Nov 28, 2023

cyyever commented Nov 28, 2023

cyyever commented Nov 28, 2023

poppopbean0903 commented Nov 28, 2023

cyyever commented Nov 28, 2023

poppopbean0903 commented Nov 28, 2023

cyyever commented Nov 30, 2023 •

edited

poppopbean0903 commented Nov 30, 2023

cyyever commented Nov 30, 2023

poppopbean0903 commented Nov 30, 2023

cyyever commented Nov 30, 2023 •

edited

poppopbean0903 commented Dec 2, 2023 •

edited

cyyever commented Dec 2, 2023

poppopbean0903 commented Dec 4, 2023 •

edited

cyyever commented Dec 4, 2023

cyyever commented Dec 5, 2023 •

edited

poppopbean0903 commented Dec 8, 2023

cyyever commented Dec 8, 2023

cyyever commented Dec 8, 2023

A complete installation and usage guide #1

A complete installation and usage guide #1

Comments

winston-li commented Nov 1, 2021

poppopbean0903 commented Nov 27, 2023

cyyever commented Nov 27, 2023 • edited

poppopbean0903 commented Nov 28, 2023

cyyever commented Nov 28, 2023

poppopbean0903 commented Nov 28, 2023

cyyever commented Nov 28, 2023

cyyever commented Nov 28, 2023

poppopbean0903 commented Nov 28, 2023

cyyever commented Nov 28, 2023

poppopbean0903 commented Nov 28, 2023

cyyever commented Nov 30, 2023 • edited

poppopbean0903 commented Nov 30, 2023

cyyever commented Nov 30, 2023

poppopbean0903 commented Nov 30, 2023

cyyever commented Nov 30, 2023 • edited

poppopbean0903 commented Dec 2, 2023 • edited

cyyever commented Dec 2, 2023

poppopbean0903 commented Dec 4, 2023 • edited

cyyever commented Dec 4, 2023

cyyever commented Dec 5, 2023 • edited

poppopbean0903 commented Dec 8, 2023

cyyever commented Dec 8, 2023

cyyever commented Dec 8, 2023

cyyever commented Nov 27, 2023 •

edited

cyyever commented Nov 30, 2023 •

edited

cyyever commented Nov 30, 2023 •

edited

poppopbean0903 commented Dec 2, 2023 •

edited

poppopbean0903 commented Dec 4, 2023 •

edited

cyyever commented Dec 5, 2023 •

edited