Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build for hip gpu backends #392

Merged
merged 15 commits into from Nov 8, 2022

Conversation

williamberman
Copy link
Collaborator

@williamberman williamberman commented Oct 30, 2022

Issue: #345

Testing the build

FF_GPU_BACKEND=hip_rocm ./docker/build.sh

Current status

hip_rocm builds e2e with a few changes to legion and flexflow source

Small source modifications for build

misc small changes to the source to get the build working that should be ok to merge.

Move tools to top level directory

We glob for files under src to get the source files for the flexflow
target. Moving tools to the top level directory prevents the tools
sourcefiles from accidentally being added to the flexflow target
source files.

change substitution_to_dot cuda_add_executable to add_executable.
When building with hip_rocm, we don't have cuda available and shouldn't
need to build with it for substitution_to_dot as the target does

Remaining:

  • hip_rocm backend builds
  • cuda backend builds
  • Do changes to legion source have to be merged in
  • Is switching from miopen.h to miopen/miopen.h header acceptabe
  • Fix out of date hip kernels to match kernel headers
  • Feedback on hip kernel changes
  • update Dockerfile to conditionally install hip dependencies
  • Document Docker build changes

Misc

Additional note on the legion change, I also don't know if the const_cast to remove the volatile qualifier is sound in that context. I mainly added it to get legion compiling with the changed build config

CMakeLists.txt Outdated Show resolved Hide resolved
@williamberman williamberman force-pushed the cmake-gpu-backends branch 19 times, most recently from f4ba864 to a6073c1 Compare November 3, 2022 01:38
@williamberman williamberman force-pushed the cmake-gpu-backends branch 9 times, most recently from 0ad6d3a to 633f7a9 Compare November 3, 2022 17:20
@goliaro
Copy link
Collaborator

goliaro commented Nov 8, 2022

One more small thing -- it would be really good to run shellcheck before merging this, since we're modifying bash files. We should probably add a CI workflow for this later too

@lockshaw
Copy link
Collaborator

lockshaw commented Nov 8, 2022

One more small thing -- it would be really good to run shellcheck before merging this, since we're modifying bash files. We should probably add a CI workflow for this later too

Good idea. #432

@williamberman
Copy link
Collaborator Author

williamberman commented Nov 8, 2022

The CI is currently broken because the python setup.py script reads the config.linux script as a string and attempts to parse the values set in it, so it can't be treated as a regular shell script. We added some standard bash conventions so the config script so it could have default values and read overrides in from the environment which breaks the adhoc variable parsing.

Will need to fix before merging in

@eddy16112
Copy link
Collaborator

eddy16112 commented Nov 8, 2022

Regarding the CI, if we can not find a AMD GPU for the CI, we can test the code with FF_GPU_BACKEND=hip_cuda

do not call sed to manually change config script

clone submodules in docker job
@GYDmedwin
Copy link

Hi @williamberman . I am trying to build the project according to the method you gave, but the error message said that kfd could not access the project due to insufficient permission. It may be because of this problem that the compiling error of legion was reported. I guess it is because docker permission is not enough to access kfd during the compilation process. Can you give me some advice? Thank you.
The error message is as follows:
make[2]: *** [deps/legion/runtime/CMakeFiles/LegionRuntime.dir/build.make:237: deps/legion/runtime/CMakeFiles/LegionRuntime.dir/legion/legion_analysis.cc.o] Error 1
make[2]: *** [deps/legion/runtime/CMakeFiles/LegionRuntime.dir/build.make:475: deps/legion/runtime/CMakeFiles/LegionRuntime.dir/legion/runtime.cc.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:390: deps/legion/runtime/CMakeFiles/LegionRuntime.dir/all] Error 2

@williamberman
Copy link
Collaborator Author

Hi @williamberman . I am trying to build the project according to the method you gave, but the error message said that kfd could not access the project due to insufficient permission. It may be because of this problem that the compiling error of legion was reported. I guess it is because docker permission is not enough to access kfd during the compilation process. Can you give me some advice? Thank you. The error message is as follows: make[2]: *** [deps/legion/runtime/CMakeFiles/LegionRuntime.dir/build.make:237: deps/legion/runtime/CMakeFiles/LegionRuntime.dir/legion/legion_analysis.cc.o] Error 1 make[2]: *** [deps/legion/runtime/CMakeFiles/LegionRuntime.dir/build.make:475: deps/legion/runtime/CMakeFiles/LegionRuntime.dir/legion/runtime.cc.o] Error 1 make[1]: *** [CMakeFiles/Makefile2:390: deps/legion/runtime/CMakeFiles/LegionRuntime.dir/all] Error 2

Hey @GYDmedwin I might need some additional information here :) I take it this is happening during the docker container build? What's the full error and where does it occur in the build process (maybe some more console output would be helpful). It sounds like you're running on a machine with an actual amd gpu if I'm reading your message correctly. Fwiw, we're mainly merging this with the intention of targeting amd but did not run it on an actual machine with an amd gpu due to the particulars involved in finding one that supports rocm.

auto-merge was automatically disabled November 8, 2022 05:48

Head branch was pushed to by a user without write access

@lockshaw lockshaw enabled auto-merge (squash) November 8, 2022 05:49
@GYDmedwin
Copy link

GYDmedwin commented Nov 8, 2022

@williamberman Thank you for your reply. Yes, I test this on a machine with the actual amd gpu MI100 and the errors occur during the docker container build.
All the errors occur in the following step:
Step 18/19 : RUN mkdir -p build && cd build && ../config/config.linux && make -j $N_BUILD_CORES && ../config/config.linux && make install

And the main error's details are as follows:
-- Using the single-header code from /usr/FlexFlow/build/_deps/json-src/single_include/
-- FlexFlow MAX_DIM: 5
Traceback (most recent call last):
File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 257, in
main()
File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 241, in main
target_list = readFromKFD()
File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 200, in readFromKFD
line = f.readline()
PermissionError: [Errno 1] Operation not permitted
-- hip::amdhip64 is SHARED_LIBRARY
Traceback (most recent call last):
File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 257, in
main()
File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 241, in main
target_list = readFromKFD()
File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 200, in readFromKFD
line = f.readline()
PermissionError: [Errno 1] Operation not permitted
-- hip::amdhip64 is SHARED_LIBRARY
Traceback (most recent call last):
File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 257, in
main()
File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 241, in main
target_list = readFromKFD()
File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 200, in readFromKFD
line = f.readline()
PermissionError: [Errno 1] Operation not permitted
-- hip::amdhip64 is SHARED_LIBRARY
-- pybind11 v2.11.0 dev1
-- Found PythonInterp: /opt/conda/bin/python (found suitable version "3.9.13", minimum required is "3.6")
-- Found PythonLibs: /opt/conda/lib/libpython3.9.so

`[ 10%] Building CXX object deps/legion/runtime/CMakeFiles/RealmRuntime.dir/realm/numa/numasysif.cc.o
Traceback (most recent call last):
File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 257, in
main()
File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 241, in main
target_list = readFromKFD()
File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 200, in readFromKFD
line = f.readline()
PermissionError: [Errno 1] Operation not permitted

[ 38%] Building CXX object deps/legion/runtime/CMakeFiles/RealmRuntime.dir/realm/deppart/image_4_5.cc.o
Traceback (most recent call last):
File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 257, in
main()
File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 241, in main
target_list = readFromKFD()
File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 200, in readFromKFD
line = f.readline()
PermissionError: [Errno 1] Operation not permitted`

This error occurred in several places, and the content was the same. I only showed three. I guess because of this error, legion compilation failed, and finally FlexFlow compilation failed.

Another error message details are as follows:
26 warnings and 1 error generated when compiling for gfx803.
make[2]: *** [deps/legion/runtime/CMakeFiles/LegionRuntime.dir/build.make:475: deps/legion/runtime/CMakeFiles/LegionRuntime.dir/legion/runtime.cc.o] Error 1
26 warnings and 1 error generated when compiling for gfx803.
make[2]: *** [deps/legion/runtime/CMakeFiles/LegionRuntime.dir/build.make:237: deps/legion/runtime/CMakeFiles/LegionRuntime.dir/legion/legion_analysis.cc.o] Error 1
26 warnings and 1 error generated when compiling for gfx803.
make[2]: *** [deps/legion/runtime/CMakeFiles/LegionRuntime.dir/build.make:489: deps/legion/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_1.cc.o] Error 1
26 warnings generated when compiling for host.
26 warnings generated when compiling for host.
26 warnings generated when compiling for host.
26 warnings generated when compiling for host.
26 warnings generated when compiling for host.
26 warnings generated when compiling for host.
make[1]: *** [CMakeFiles/Makefile2:390: deps/legion/runtime/CMakeFiles/LegionRuntime.dir/all] Error 2
make: *** [Makefile:136: all] Error 2

@williamberman
Copy link
Collaborator Author

@williamberman Thank you for your reply. Yes, I test this on a machine with the actual amd gpu MI100 and the errors occur during the docker container build. All the errors occur in the following step: Step 18/19 : RUN mkdir -p build && cd build && ../config/config.linux && make -j $N_BUILD_CORES && ../config/config.linux && make install

And the main error's details are as follows: -- Using the single-header code from /usr/FlexFlow/build/_deps/json-src/single_include/ -- FlexFlow MAX_DIM: 5 Traceback (most recent call last): File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 257, in main() File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 241, in main target_list = readFromKFD() File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 200, in readFromKFD line = f.readline() PermissionError: [Errno 1] Operation not permitted -- hip::amdhip64 is SHARED_LIBRARY Traceback (most recent call last): File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 257, in main() File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 241, in main target_list = readFromKFD() File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 200, in readFromKFD line = f.readline() PermissionError: [Errno 1] Operation not permitted -- hip::amdhip64 is SHARED_LIBRARY Traceback (most recent call last): File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 257, in main() File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 241, in main target_list = readFromKFD() File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 200, in readFromKFD line = f.readline() PermissionError: [Errno 1] Operation not permitted -- hip::amdhip64 is SHARED_LIBRARY -- pybind11 v2.11.0 dev1 -- Found PythonInterp: /opt/conda/bin/python (found suitable version "3.9.13", minimum required is "3.6") -- Found PythonLibs: /opt/conda/lib/libpython3.9.so

`[ 10%] Building CXX object deps/legion/runtime/CMakeFiles/RealmRuntime.dir/realm/numa/numasysif.cc.o Traceback (most recent call last): File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 257, in main() File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 241, in main target_list = readFromKFD() File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 200, in readFromKFD line = f.readline() PermissionError: [Errno 1] Operation not permitted

[ 38%] Building CXX object deps/legion/runtime/CMakeFiles/RealmRuntime.dir/realm/deppart/image_4_5.cc.o Traceback (most recent call last): File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 257, in main() File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 241, in main target_list = readFromKFD() File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 200, in readFromKFD line = f.readline() PermissionError: [Errno 1] Operation not permitted`

This error occurred in several places, and the content was the same. I only showed three. I guess because of this error, legion compilation failed, and finally FlexFlow compilation failed.

Another error message details are as follows: 26 warnings and 1 error generated when compiling for gfx803. make[2]: *** [deps/legion/runtime/CMakeFiles/LegionRuntime.dir/build.make:475: deps/legion/runtime/CMakeFiles/LegionRuntime.dir/legion/runtime.cc.o] Error 1 26 warnings and 1 error generated when compiling for gfx803. make[2]: *** [deps/legion/runtime/CMakeFiles/LegionRuntime.dir/build.make:237: deps/legion/runtime/CMakeFiles/LegionRuntime.dir/legion/legion_analysis.cc.o] Error 1 26 warnings and 1 error generated when compiling for gfx803. make[2]: *** [deps/legion/runtime/CMakeFiles/LegionRuntime.dir/build.make:489: deps/legion/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_1.cc.o] Error 1 26 warnings generated when compiling for host. 26 warnings generated when compiling for host. 26 warnings generated when compiling for host. 26 warnings generated when compiling for host. 26 warnings generated when compiling for host. 26 warnings generated when compiling for host. make[1]: *** [CMakeFiles/Makefile2:390: deps/legion/runtime/CMakeFiles/LegionRuntime.dir/all] Error 2 make: *** [Makefile:136: all] Error 2

Ok great thank you for the extra details! I opened up an issue here #457. I think your best immediate bet would be building with the Makefiles and on your standard system (not in docker)

@lockshaw lockshaw merged commit 81304c8 into flexflow:master Nov 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Document how to build with the hip_rocm backend on a machine without AMD GPUs
6 participants