Build for hip gpu backends #392

williamberman · 2022-10-30T18:16:51Z

Issue: #345

Testing the build

FF_GPU_BACKEND=hip_rocm ./docker/build.sh

Current status

hip_rocm builds e2e with a few changes to legion and flexflow source

Small source modifications for build

misc small changes to the source to get the build working that should be ok to merge.

Move tools to top level directory

We glob for files under src to get the source files for the flexflow
target. Moving tools to the top level directory prevents the tools
sourcefiles from accidentally being added to the flexflow target
source files.

change substitution_to_dot cuda_add_executable to add_executable.
When building with hip_rocm, we don't have cuda available and shouldn't
need to build with it for substitution_to_dot as the target does

Remaining:

hip_rocm backend builds
cuda backend builds
Do changes to legion source have to be merged in
Is switching from miopen.h to miopen/miopen.h header acceptabe
Fix out of date hip kernels to match kernel headers
Feedback on hip kernel changes
update Dockerfile to conditionally install hip dependencies
Document Docker build changes

Misc

Additional note on the legion change, I also don't know if the const_cast to remove the volatile qualifier is sound in that context. I mainly added it to get legion compiling with the changed build config

CMakeLists.txt

src/ops/concat.cpp

include/flexflow/config.h

goliaro · 2022-11-08T01:05:23Z

One more small thing -- it would be really good to run shellcheck before merging this, since we're modifying bash files. We should probably add a CI workflow for this later too

lockshaw · 2022-11-08T01:07:28Z

One more small thing -- it would be really good to run shellcheck before merging this, since we're modifying bash files. We should probably add a CI workflow for this later too

Good idea. #432

williamberman · 2022-11-08T01:24:21Z

The CI is currently broken because the python setup.py script reads the config.linux script as a string and attempts to parse the values set in it, so it can't be treated as a regular shell script. We added some standard bash conventions so the config script so it could have default values and read overrides in from the environment which breaks the adhoc variable parsing.

Will need to fix before merging in

python/flexflow_c.cc

eddy16112 · 2022-11-08T02:36:59Z

Regarding the CI, if we can not find a AMD GPU for the CI, we can test the code with FF_GPU_BACKEND=hip_cuda

do not call sed to manually change config script clone submodules in docker job

…ead of parsing it adhoc

GYDmedwin · 2022-11-08T04:59:40Z

Hi @williamberman . I am trying to build the project according to the method you gave, but the error message said that kfd could not access the project due to insufficient permission. It may be because of this problem that the compiling error of legion was reported. I guess it is because docker permission is not enough to access kfd during the compilation process. Can you give me some advice? Thank you.
The error message is as follows:
make[2]: *** [deps/legion/runtime/CMakeFiles/LegionRuntime.dir/build.make:237: deps/legion/runtime/CMakeFiles/LegionRuntime.dir/legion/legion_analysis.cc.o] Error 1
make[2]: *** [deps/legion/runtime/CMakeFiles/LegionRuntime.dir/build.make:475: deps/legion/runtime/CMakeFiles/LegionRuntime.dir/legion/runtime.cc.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:390: deps/legion/runtime/CMakeFiles/LegionRuntime.dir/all] Error 2

williamberman · 2022-11-08T05:39:35Z

Hi @williamberman . I am trying to build the project according to the method you gave, but the error message said that kfd could not access the project due to insufficient permission. It may be because of this problem that the compiling error of legion was reported. I guess it is because docker permission is not enough to access kfd during the compilation process. Can you give me some advice? Thank you. The error message is as follows: make[2]: *** [deps/legion/runtime/CMakeFiles/LegionRuntime.dir/build.make:237: deps/legion/runtime/CMakeFiles/LegionRuntime.dir/legion/legion_analysis.cc.o] Error 1 make[2]: *** [deps/legion/runtime/CMakeFiles/LegionRuntime.dir/build.make:475: deps/legion/runtime/CMakeFiles/LegionRuntime.dir/legion/runtime.cc.o] Error 1 make[1]: *** [CMakeFiles/Makefile2:390: deps/legion/runtime/CMakeFiles/LegionRuntime.dir/all] Error 2

Hey @GYDmedwin I might need some additional information here :) I take it this is happening during the docker container build? What's the full error and where does it occur in the build process (maybe some more console output would be helpful). It sounds like you're running on a machine with an actual amd gpu if I'm reading your message correctly. Fwiw, we're mainly merging this with the intention of targeting amd but did not run it on an actual machine with an amd gpu due to the particulars involved in finding one that supports rocm.

GYDmedwin · 2022-11-08T06:57:47Z

@williamberman Thank you for your reply. Yes, I test this on a machine with the actual amd gpu MI100 and the errors occur during the docker container build.
All the errors occur in the following step:
Step 18/19 : RUN mkdir -p build && cd build && ../config/config.linux && make -j $N_BUILD_CORES && ../config/config.linux && make install

And the main error's details are as follows:
-- Using the single-header code from /usr/FlexFlow/build/_deps/json-src/single_include/
-- FlexFlow MAX_DIM: 5
Traceback (most recent call last):
File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 257, in
main()
File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 241, in main
target_list = readFromKFD()
File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 200, in readFromKFD
line = f.readline()
PermissionError: [Errno 1] Operation not permitted
-- hip::amdhip64 is SHARED_LIBRARY
Traceback (most recent call last):
File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 257, in
main()
File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 241, in main
target_list = readFromKFD()
File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 200, in readFromKFD
line = f.readline()
PermissionError: [Errno 1] Operation not permitted
-- hip::amdhip64 is SHARED_LIBRARY
Traceback (most recent call last):
File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 257, in
main()
File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 241, in main
target_list = readFromKFD()
File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 200, in readFromKFD
line = f.readline()
PermissionError: [Errno 1] Operation not permitted
-- hip::amdhip64 is SHARED_LIBRARY
-- pybind11 v2.11.0 dev1
-- Found PythonInterp: /opt/conda/bin/python (found suitable version "3.9.13", minimum required is "3.6")
-- Found PythonLibs: /opt/conda/lib/libpython3.9.so

`[ 10%] Building CXX object deps/legion/runtime/CMakeFiles/RealmRuntime.dir/realm/numa/numasysif.cc.o
Traceback (most recent call last):
File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 257, in
main()
File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 241, in main
target_list = readFromKFD()
File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 200, in readFromKFD
line = f.readline()
PermissionError: [Errno 1] Operation not permitted

[ 38%] Building CXX object deps/legion/runtime/CMakeFiles/RealmRuntime.dir/realm/deppart/image_4_5.cc.o
Traceback (most recent call last):
File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 257, in
main()
File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 241, in main
target_list = readFromKFD()
File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 200, in readFromKFD
line = f.readline()
PermissionError: [Errno 1] Operation not permitted`

This error occurred in several places, and the content was the same. I only showed three. I guess because of this error, legion compilation failed, and finally FlexFlow compilation failed.

Another error message details are as follows:
26 warnings and 1 error generated when compiling for gfx803.
make[2]: *** [deps/legion/runtime/CMakeFiles/LegionRuntime.dir/build.make:475: deps/legion/runtime/CMakeFiles/LegionRuntime.dir/legion/runtime.cc.o] Error 1
26 warnings and 1 error generated when compiling for gfx803.
make[2]: *** [deps/legion/runtime/CMakeFiles/LegionRuntime.dir/build.make:237: deps/legion/runtime/CMakeFiles/LegionRuntime.dir/legion/legion_analysis.cc.o] Error 1
26 warnings and 1 error generated when compiling for gfx803.
make[2]: *** [deps/legion/runtime/CMakeFiles/LegionRuntime.dir/build.make:489: deps/legion/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_1.cc.o] Error 1
26 warnings generated when compiling for host.
26 warnings generated when compiling for host.
26 warnings generated when compiling for host.
26 warnings generated when compiling for host.
26 warnings generated when compiling for host.
26 warnings generated when compiling for host.
make[1]: *** [CMakeFiles/Makefile2:390: deps/legion/runtime/CMakeFiles/LegionRuntime.dir/all] Error 2
make: *** [Makefile:136: all] Error 2

williamberman · 2022-11-08T07:35:08Z

@williamberman Thank you for your reply. Yes, I test this on a machine with the actual amd gpu MI100 and the errors occur during the docker container build. All the errors occur in the following step: Step 18/19 : RUN mkdir -p build && cd build && ../config/config.linux && make -j $N_BUILD_CORES && ../config/config.linux && make install

And the main error's details are as follows: -- Using the single-header code from /usr/FlexFlow/build/_deps/json-src/single_include/ -- FlexFlow MAX_DIM: 5 Traceback (most recent call last): File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 257, in main() File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 241, in main target_list = readFromKFD() File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 200, in readFromKFD line = f.readline() PermissionError: [Errno 1] Operation not permitted -- hip::amdhip64 is SHARED_LIBRARY Traceback (most recent call last): File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 257, in main() File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 241, in main target_list = readFromKFD() File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 200, in readFromKFD line = f.readline() PermissionError: [Errno 1] Operation not permitted -- hip::amdhip64 is SHARED_LIBRARY Traceback (most recent call last): File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 257, in main() File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 241, in main target_list = readFromKFD() File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 200, in readFromKFD line = f.readline() PermissionError: [Errno 1] Operation not permitted -- hip::amdhip64 is SHARED_LIBRARY -- pybind11 v2.11.0 dev1 -- Found PythonInterp: /opt/conda/bin/python (found suitable version "3.9.13", minimum required is "3.6") -- Found PythonLibs: /opt/conda/lib/libpython3.9.so

`[ 10%] Building CXX object deps/legion/runtime/CMakeFiles/RealmRuntime.dir/realm/numa/numasysif.cc.o Traceback (most recent call last): File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 257, in main() File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 241, in main target_list = readFromKFD() File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 200, in readFromKFD line = f.readline() PermissionError: [Errno 1] Operation not permitted

[ 38%] Building CXX object deps/legion/runtime/CMakeFiles/RealmRuntime.dir/realm/deppart/image_4_5.cc.o Traceback (most recent call last): File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 257, in main() File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 241, in main target_list = readFromKFD() File "/opt/rocm-5.2.0/bin/rocm_agent_enumerator", line 200, in readFromKFD line = f.readline() PermissionError: [Errno 1] Operation not permitted`

This error occurred in several places, and the content was the same. I only showed three. I guess because of this error, legion compilation failed, and finally FlexFlow compilation failed.

Another error message details are as follows: 26 warnings and 1 error generated when compiling for gfx803. make[2]: *** [deps/legion/runtime/CMakeFiles/LegionRuntime.dir/build.make:475: deps/legion/runtime/CMakeFiles/LegionRuntime.dir/legion/runtime.cc.o] Error 1 26 warnings and 1 error generated when compiling for gfx803. make[2]: *** [deps/legion/runtime/CMakeFiles/LegionRuntime.dir/build.make:237: deps/legion/runtime/CMakeFiles/LegionRuntime.dir/legion/legion_analysis.cc.o] Error 1 26 warnings and 1 error generated when compiling for gfx803. make[2]: *** [deps/legion/runtime/CMakeFiles/LegionRuntime.dir/build.make:489: deps/legion/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_1.cc.o] Error 1 26 warnings generated when compiling for host. 26 warnings generated when compiling for host. 26 warnings generated when compiling for host. 26 warnings generated when compiling for host. 26 warnings generated when compiling for host. 26 warnings generated when compiling for host. make[1]: *** [CMakeFiles/Makefile2:390: deps/legion/runtime/CMakeFiles/LegionRuntime.dir/all] Error 2 make: *** [Makefile:136: all] Error 2

Ok great thank you for the extra details! I opened up an issue here #457. I think your best immediate bet would be building with the Makefiles and on your standard system (not in docker)

williamberman commented Oct 30, 2022

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

williamberman force-pushed the cmake-gpu-backends branch 19 times, most recently from f4ba864 to a6073c1 Compare November 3, 2022 01:38

williamberman commented Nov 3, 2022

View reviewed changes

src/ops/concat.cpp Show resolved Hide resolved

williamberman force-pushed the cmake-gpu-backends branch 9 times, most recently from 0ad6d3a to 633f7a9 Compare November 3, 2022 17:20

williamberman added 4 commits November 7, 2022 15:56

system dependencies install instructions

beeb8eb

Ensure docker build script is called from FF_HOME

453b6f2

Add .dockerignore file to ignore build directories

9a1d868

Add new lines

bf96efc

williamberman force-pushed the cmake-gpu-backends branch from 6f584a1 to bf96efc Compare November 7, 2022 23:56

lockshaw enabled auto-merge (squash) November 7, 2022 23:58

lockshaw disabled auto-merge November 8, 2022 00:02

williamberman commented Nov 8, 2022

View reviewed changes

include/flexflow/config.h Show resolved Hide resolved

eddy16112 reviewed Nov 8, 2022

View reviewed changes

python/flexflow_c.cc Show resolved Hide resolved

williamberman force-pushed the cmake-gpu-backends branch from 48f1521 to ef97c21 Compare November 8, 2022 03:34

williamberman added 2 commits November 7, 2022 19:58

Fix CI for changes in PR

049c3a1

do not call sed to manually change config script clone submodules in docker job

Change the python setup script to shell out to the config script inst…

a8c4ed4

…ead of parsing it adhoc

williamberman force-pushed the cmake-gpu-backends branch from f7a1fee to a8c4ed4 Compare November 8, 2022 03:59

Update docs to note FF_GPU_BACKEND=hip_cuda is not supported

fdf422d

williamberman mentioned this pull request Nov 8, 2022

Add FF_GPU_BACKEND=hip_rocm to CI #433

Closed

lockshaw enabled auto-merge (squash) November 8, 2022 04:22

Fix path to mt5 dockerfile

a366c98

auto-merge was automatically disabled November 8, 2022 05:48
Head branch was pushed to by a user without write access

lockshaw enabled auto-merge (squash) November 8, 2022 05:49

williamberman mentioned this pull request Nov 8, 2022

Testing cmake gpu backend hip_rocm build on machine with AMD gpu #457

Open

lockshaw merged commit 81304c8 into flexflow:master Nov 8, 2022

goliaro mentioned this pull request Nov 21, 2022

[Docker] - Refactor and automatic upload of containers to repo #486

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build for hip gpu backends #392

Build for hip gpu backends #392

williamberman commented Oct 30, 2022 •

edited

goliaro commented Nov 8, 2022

lockshaw commented Nov 8, 2022

williamberman commented Nov 8, 2022 •

edited

eddy16112 commented Nov 8, 2022 •

edited

GYDmedwin commented Nov 8, 2022

williamberman commented Nov 8, 2022

GYDmedwin commented Nov 8, 2022 •

edited

williamberman commented Nov 8, 2022

Build for hip gpu backends #392

Build for hip gpu backends #392

Conversation

williamberman commented Oct 30, 2022 • edited

Testing the build

Current status

Small source modifications for build

Move tools to top level directory

Remaining:

Misc

goliaro commented Nov 8, 2022

lockshaw commented Nov 8, 2022

williamberman commented Nov 8, 2022 • edited

eddy16112 commented Nov 8, 2022 • edited

GYDmedwin commented Nov 8, 2022

williamberman commented Nov 8, 2022

GYDmedwin commented Nov 8, 2022 • edited

williamberman commented Nov 8, 2022

williamberman commented Oct 30, 2022 •

edited

williamberman commented Nov 8, 2022 •

edited

eddy16112 commented Nov 8, 2022 •

edited

GYDmedwin commented Nov 8, 2022 •

edited