Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing cmake gpu backend hip_rocm build on machine with AMD gpu #457

Open
williamberman opened this issue Nov 8, 2022 · 8 comments
Open
Assignees
Labels
tech debt Refactoring/improvement of existing code/abstractions

Comments

@williamberman
Copy link
Collaborator

Follow up from: #392

This build has been tested on a non-amd machine targetting rocm. It has been reported that at least when using docker, the build fails. We should test this build directly on an amd machine and potentially integrate as a part of CI.

The amd gpus that aws provides do not support rocm

@GYDmedwin
Copy link

@williamberman Thanks for your advice! After I tested it on my machine, the problem with the kfd not being accessible was solved. I also discovered two new problems:

  1. When FF_GPU_BACKEND is hip_rocm and FF_BUILD_ALL_EXAMPLES is ON, the error message is as follows:
    CMake Error at examples/cpp/ResNet/CMakeLists.txt:14 (cuda_add_executable): Unknown CMake command "cuda_add_executable".

  2. When FF_GPU_BACKEND is hip_rocm, error message on my amd machine is as follows:

/FlexFlow/deps/legion/runtime/legion/legion_utilities.h:872:7: error: no matching function for call to 'memcpy'
memcpy(&element, buffer+index, sizeof(T));
^~~~~~
/FlexFlow/deps/legion/runtime/legion/legion_utilities.h:992:7: note: in instantiation of function template specialization 'Legion::Deserializer::deserialize' requested here
deserialize(dom.dim);
^
/usr/include/string.h:43:14: note: candidate function not viable: no known conversion from 'volatile int *' to 'void *__restrict' for 1st argument
extern void *memcpy (void *__restrict __dest, const void *__restrict __src,
^
/opt/rocm-5.2.0/hip/include/hip/amd_detail/../../../../include/hip/amd_detail/amd_device_functions.h:1050:32: note: candidate function not viable: call to __device__ function from __host__ function static inline __device__ void* memcpy(void* dst, const void* src, size_t size) {

@williamberman
Copy link
Collaborator Author

@GYDmedwin Thank you for following up with this, super helpful!

When FF_GPU_BACKEND is hip_rocm and FF_BUILD_ALL_EXAMPLES is ON, the error message is as follows:
CMake Error at examples/cpp/ResNet/CMakeLists.txt:14 (cuda_add_executable): Unknown CMake command "cuda_add_executable".

Very good catch, I can follow up with a fix.

When FF_GPU_BACKEND is hip_rocm, error message on my amd machine is as follows:
/FlexFlow/deps/legion/runtime/legion/legion_utilities.h:872:7: error: no matching function for call to 'memcpy'
memcpy(&element, buffer+index, sizeof(T));

For this, I am hoping that this is because we pushed a new legion commit and your legion submodule needs to be updated. If you are testing from flexflow tip, you can confirm this by running git status and seeing that your legion commit is different from what git expects. cd into legion and run git pull then check git status again. Let me know if that does fix this issue or if I'm off base.

Once again, super helpful reporting this and please keep following up with us with any build issues you have!

@williamberman
Copy link
Collaborator Author

@lockshaw I was thinking that this was an unnecessary use of cuda_add_executable instead of add_executable similar to as was previously done in tools. However it looks like many of the examples do have cuda code with no alternative hip code.

We have a few options,

  1. throw an error during configure if attempting to build examples and the backend is hip
  2. select examples that are not dependent on cuda to build when the backend is hip
  3. add equivalent hip code for examples that depend on cuda

I'm not sure the extent flexflow is used with non-cuda backends. If hip is very frequently used and needs to be a first class citizen, we could go all the way to 3. If hip is not frequently used and we mainly just want to make sure we can build with it when necessary, we could do 1. 2 is a middle option. Let me know what you think and please tag anyone else who has additional context

@GYDmedwin
Copy link

@GYDmedwin Thank you for following up with this, super helpful!

When FF_GPU_BACKEND is hip_rocm and FF_BUILD_ALL_EXAMPLES is ON, the error message is as follows:
CMake Error at examples/cpp/ResNet/CMakeLists.txt:14 (cuda_add_executable): Unknown CMake command "cuda_add_executable".

Very good catch, I can follow up with a fix.

When FF_GPU_BACKEND is hip_rocm, error message on my amd machine is as follows:
/FlexFlow/deps/legion/runtime/legion/legion_utilities.h:872:7: error: no matching function for call to 'memcpy'
memcpy(&element, buffer+index, sizeof(T));

For this, I am hoping that this is because we pushed a new legion commit and your legion submodule needs to be updated. If you are testing from flexflow tip, you can confirm this by running git status and seeing that your legion commit is different from what git expects. cd into legion and run git pull then check git status again. Let me know if that does fix this issue or if I'm off base.

Once again, super helpful reporting this and please keep following up with us with any build issues you have!

Great, you are right, it has been successfully compiled on the real machine. Thank you so much!

@williamberman
Copy link
Collaborator Author

@GYDmedwin Thank you for reporting the build issues!

@lockshaw
Copy link
Collaborator

lockshaw commented Nov 9, 2022

@lockshaw I was thinking that this was an unnecessary use of cuda_add_executable instead of add_executable similar to as was previously done in tools. However it looks like many of the examples do have cuda code with no alternative hip code.

We have a few options,

1. throw an error during configure if attempting to build examples and the backend is hip

2. select examples that are not dependent on cuda to build when the backend is hip

3. add equivalent hip code for examples that depend on cuda

I'm not sure the extent flexflow is used with non-cuda backends. If hip is very frequently used and needs to be a first class citizen, we could go all the way to 3. If hip is not frequently used and we mainly just want to make sure we can build with it when necessary, we could do 1. 2 is a middle option. Let me know what you think and please tag anyone else who has additional context

I'd say 1 is probably want we want to do for now, unless @eddy16112 or @jiazhihao disagree. Obviously 3 would be nice to reach in the long term, but I don't know if it's currently high priority enough to put resources toward (@williamberman can you create an issue for 3?)

@williamberman
Copy link
Collaborator Author

@lockshaw sorry missed this, agreed! Will do

@lockshaw lockshaw added the out of date? Reports that need to be confirmed to still exist label Jun 16, 2023
@lockshaw
Copy link
Collaborator

lockshaw commented Jun 22, 2023

Summary of the status of this by @williamberman:

this specific issue was that the examples were built with cuda_add_executable which isn't a valid cmake command when not building with cuda. I'm not sure what the equivalent hip cmake command is but I'd just replace it in that case.

For this specific issue I don't think you need anything particular from the machine

iirc very likely that we just added a change to not build the examples when targeting hip rocm

But also not sure if the refactor status impacts some of these build things - might already be fixed if you did so

Based on Will's description it sounds like this issue still exists, so I will remove the "out of date" flag

@lockshaw lockshaw added tech debt Refactoring/improvement of existing code/abstractions and removed out of date? Reports that need to be confirmed to still exist labels Jun 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tech debt Refactoring/improvement of existing code/abstractions
Projects
No open projects
Status: Backlog
Development

No branches or pull requests

3 participants