Testing cmake gpu backend hip_rocm build on machine with AMD gpu #457

williamberman · 2022-11-08T07:33:27Z

Follow up from: #392

This build has been tested on a non-amd machine targetting rocm. It has been reported that at least when using docker, the build fails. We should test this build directly on an amd machine and potentially integrate as a part of CI.

The amd gpus that aws provides do not support rocm

GYDmedwin · 2022-11-08T09:49:26Z

@williamberman Thanks for your advice! After I tested it on my machine, the problem with the kfd not being accessible was solved. I also discovered two new problems:

When FF_GPU_BACKEND is hip_rocm and FF_BUILD_ALL_EXAMPLES is ON, the error message is as follows:
CMake Error at examples/cpp/ResNet/CMakeLists.txt:14 (cuda_add_executable): Unknown CMake command "cuda_add_executable".
When FF_GPU_BACKEND is hip_rocm, error message on my amd machine is as follows:

/FlexFlow/deps/legion/runtime/legion/legion_utilities.h:872:7: error: no matching function for call to 'memcpy'
memcpy(&element, buffer+index, sizeof(T));
^~~~~~
/FlexFlow/deps/legion/runtime/legion/legion_utilities.h:992:7: note: in instantiation of function template specialization 'Legion::Deserializer::deserialize' requested here
deserialize(dom.dim);
^
/usr/include/string.h:43:14: note: candidate function not viable: no known conversion from 'volatile int *' to 'void *__restrict' for 1st argument
extern void *memcpy (void *__restrict __dest, const void *__restrict __src,
^
/opt/rocm-5.2.0/hip/include/hip/amd_detail/../../../../include/hip/amd_detail/amd_device_functions.h:1050:32: note: candidate function not viable: call to __device__ function from __host__ function static inline __device__ void* memcpy(void* dst, const void* src, size_t size) {

williamberman · 2022-11-08T21:05:25Z

@GYDmedwin Thank you for following up with this, super helpful!

When FF_GPU_BACKEND is hip_rocm and FF_BUILD_ALL_EXAMPLES is ON, the error message is as follows:
CMake Error at examples/cpp/ResNet/CMakeLists.txt:14 (cuda_add_executable): Unknown CMake command "cuda_add_executable".

Very good catch, I can follow up with a fix.

When FF_GPU_BACKEND is hip_rocm, error message on my amd machine is as follows:
/FlexFlow/deps/legion/runtime/legion/legion_utilities.h:872:7: error: no matching function for call to 'memcpy'
memcpy(&element, buffer+index, sizeof(T));

For this, I am hoping that this is because we pushed a new legion commit and your legion submodule needs to be updated. If you are testing from flexflow tip, you can confirm this by running git status and seeing that your legion commit is different from what git expects. cd into legion and run git pull then check git status again. Let me know if that does fix this issue or if I'm off base.

Once again, super helpful reporting this and please keep following up with us with any build issues you have!

williamberman · 2022-11-08T21:17:26Z

@lockshaw I was thinking that this was an unnecessary use of cuda_add_executable instead of add_executable similar to as was previously done in tools. However it looks like many of the examples do have cuda code with no alternative hip code.

We have a few options,

throw an error during configure if attempting to build examples and the backend is hip
select examples that are not dependent on cuda to build when the backend is hip
add equivalent hip code for examples that depend on cuda

I'm not sure the extent flexflow is used with non-cuda backends. If hip is very frequently used and needs to be a first class citizen, we could go all the way to 3. If hip is not frequently used and we mainly just want to make sure we can build with it when necessary, we could do 1. 2 is a middle option. Let me know what you think and please tag anyone else who has additional context

GYDmedwin · 2022-11-09T02:12:27Z

@GYDmedwin Thank you for following up with this, super helpful!

When FF_GPU_BACKEND is hip_rocm and FF_BUILD_ALL_EXAMPLES is ON, the error message is as follows:
CMake Error at examples/cpp/ResNet/CMakeLists.txt:14 (cuda_add_executable): Unknown CMake command "cuda_add_executable".

Very good catch, I can follow up with a fix.

When FF_GPU_BACKEND is hip_rocm, error message on my amd machine is as follows:
/FlexFlow/deps/legion/runtime/legion/legion_utilities.h:872:7: error: no matching function for call to 'memcpy'
memcpy(&element, buffer+index, sizeof(T));

For this, I am hoping that this is because we pushed a new legion commit and your legion submodule needs to be updated. If you are testing from flexflow tip, you can confirm this by running git status and seeing that your legion commit is different from what git expects. cd into legion and run git pull then check git status again. Let me know if that does fix this issue or if I'm off base.

Once again, super helpful reporting this and please keep following up with us with any build issues you have!

Great, you are right, it has been successfully compiled on the real machine. Thank you so much!

williamberman · 2022-11-09T02:32:35Z

@GYDmedwin Thank you for reporting the build issues!

lockshaw · 2022-11-09T20:02:47Z

@lockshaw I was thinking that this was an unnecessary use of cuda_add_executable instead of add_executable similar to as was previously done in tools. However it looks like many of the examples do have cuda code with no alternative hip code.

We have a few options,
1. throw an error during configure if attempting to build examples and the backend is hip

2. select examples that are not dependent on cuda to build when the backend is hip

3. add equivalent hip code for examples that depend on cuda
I'm not sure the extent flexflow is used with non-cuda backends. If hip is very frequently used and needs to be a first class citizen, we could go all the way to 3. If hip is not frequently used and we mainly just want to make sure we can build with it when necessary, we could do 1. 2 is a middle option. Let me know what you think and please tag anyone else who has additional context

I'd say 1 is probably want we want to do for now, unless @eddy16112 or @jiazhihao disagree. Obviously 3 would be nice to reach in the long term, but I don't know if it's currently high priority enough to put resources toward (@williamberman can you create an issue for 3?)

williamberman · 2022-11-11T17:20:15Z

@lockshaw sorry missed this, agreed! Will do

lockshaw · 2023-06-22T06:39:33Z

Summary of the status of this by @williamberman:

this specific issue was that the examples were built with cuda_add_executable which isn't a valid cmake command when not building with cuda. I'm not sure what the equivalent hip cmake command is but I'd just replace it in that case.

For this specific issue I don't think you need anything particular from the machine

iirc very likely that we just added a change to not build the examples when targeting hip rocm

But also not sure if the refactor status impacts some of these build things - might already be fixed if you did so

Based on Will's description it sounds like this issue still exists, so I will remove the "out of date" flag

williamberman mentioned this issue Nov 8, 2022

Build for hip gpu backends #392

Merged

8 tasks

lockshaw added this to FlexFlow Development Nov 9, 2022

lockshaw assigned williamberman Nov 9, 2022

lockshaw moved this to Backlog in FlexFlow Development Nov 9, 2022

williamberman mentioned this issue Nov 11, 2022

Build examples when targeting hip_rocm gpu backend #470

Open

lockshaw added the out of date? Reports that need to be confirmed to still exist label Jun 16, 2023

lockshaw added tech debt Refactoring/improvement of existing code/abstractions and removed out of date? Reports that need to be confirmed to still exist labels Jun 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testing cmake gpu backend hip_rocm build on machine with AMD gpu #457

Testing cmake gpu backend hip_rocm build on machine with AMD gpu #457

williamberman commented Nov 8, 2022

GYDmedwin commented Nov 8, 2022

williamberman commented Nov 8, 2022

williamberman commented Nov 8, 2022

GYDmedwin commented Nov 9, 2022

williamberman commented Nov 9, 2022

lockshaw commented Nov 9, 2022 •

edited

Loading

williamberman commented Nov 11, 2022

lockshaw commented Jun 22, 2023 •

edited

Loading

Testing cmake gpu backend hip_rocm build on machine with AMD gpu #457

Testing cmake gpu backend hip_rocm build on machine with AMD gpu #457

Comments

williamberman commented Nov 8, 2022

GYDmedwin commented Nov 8, 2022

williamberman commented Nov 8, 2022

williamberman commented Nov 8, 2022

GYDmedwin commented Nov 9, 2022

williamberman commented Nov 9, 2022

lockshaw commented Nov 9, 2022 • edited Loading

williamberman commented Nov 11, 2022

lockshaw commented Jun 22, 2023 • edited Loading

lockshaw commented Nov 9, 2022 •

edited

Loading

lockshaw commented Jun 22, 2023 •

edited

Loading