-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROCm support #3462
base: main
Are you sure you want to change the base?
ROCm support #3462
Conversation
iotamudelta
commented
May 20, 2024
•
edited
Loading
edited
- add hipify at configure time
- ROCm specific code paths behind USE_ROCM guards
- support for wavefront 32 (Navi) and 64 (MI)
- use builtins to match inline PTX
- support C API on ROCm
- support Python API on ROCm
* add hipify at configure time * ROCm specific code paths behind USE_ROCM guards * support for wavefront 32 (Navi) and 64 (MI) * use builtins to match inline PTX * support C API on ROCm * disable Python API on ROCm for now --------- Co-authored-by: Jeff Daily <jeff.daily@amd.com> Co-authored-by: hubertlu-tw <hubertlu@amd.com> Co-authored-by: Xinya Zhang <Xinya.Zhang@amd.com> Co-authored-by: Michael Pittard <michael.pittard@amd.com>
This superseedes #3126 and addresses #3231 As discussed in #3231 , @ItsPitt (Michael Pittard) and @iotamudelta (myself) can maintain the ROCm backend. AMD can provide two MI200-class servers for CI with a similar arrangement to the PyTorch CI. |
Given the size of the work and difficulty to review that, we'd be more than happy to split it up into smaller PRs based on feedback what such PRs should encompass. Thanks! |
Assuming this is WIP because the Python tests are not there yet. Otherwise I can write them but could take some time. |
@mdouze thanks a lot for having a look! yes, we can figure out the Python support prior to getting it merged. Are there any other immediate issues you see that we should be working on? Do you prefer getting this all merged as one big PR or split up (if so: how)? Again - thanks a lot! |
We are looking into compiling this in the CI |
@mdouze we sorted the Python support out. I am not sure if the Windows CI target failing after my latest merge of @ramilbakhshyiev anything we can help with w.r.t. getting this into CI? Thanks! |
@iotamudelta We will be trying this soon, our plan is to use a g4ad instance from AWS. Once we try it on that, we will most likely come back with some feedback. |
@ramilbakhshyiev I'm not sure g4ad will work - according to https://aws.amazon.com/ec2/instance-types/ , it features a AMD Radeon Pro V520 w/ RDNA1 architecture which is not supported by ROCm (see: https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html ) |
@iotamudelta Is there any plan to support it? I think AWS is a widely used platform and it only seems to support g4ad, right? |
@ramilbakhshyiev I do not know about support plans. ROCm does support RDNA2 and RDNA3 GPUs. |
Based on some of my very high-level research, it seems that some users still get ROCm on g4ad even though it's unsupported by AMD. Do you have any opinion on using that to build this in our CI? If we don't have an option to build and test on g4ad then I believe that our only option is to use CI only for builds. |
@ramilbakhshyiev it's hard to answer this - building CI is must-have, I think. If we can get started with that, it'd be a huge step forward. We can try to test on the AWS instance but we cannot guarantee correctness, stability, or performance on it. As stated earlier, we could also use two of the PyTorch CI MI200-class machines but it sounds that would be too much effort to integrate? |
@iotamudelta Hey, apologies for missing that earlier offer, I missed that context when picking this up. I think getting AMD to provide backends would work great and I think we can support the integration. I have glanced through #3231 and PyTorch but did not figure out AMD servers register runners with GitHub. Any chance you can point me to that? I assume you're using PyTorch's GH App to retrieve API tokens and register runners to use. |
|
@ItsPitt I introduced the I did add a step to call |
@ramilbakhshyiev It looks like it cannot find the hip cmake module. You should be able to resolve this by adding the following to the cmake file:
The line above
Understood, sounds good! |
Also, here is the documentation for including FindHIP. It has a few more examples: https://github.com/ROCm/HIP/tree/master/samples/2_Cookbook/12_cmake_hip_add_executable#including-findhip-cmake-module-in-the-project |
Thanks @ItsPitt! I tried a few versions and it didn't solve the problem. Upon some investigating, I was only able to find |
Made it to the next step but it failed then:
|
@ramilbakhshyiev I did some digging and was able to repo this. You just need to add |
Thanks @ItsPitt! It started building but now fails because it cannot find libm, I'm looking into it right now:
|
Hey @ItsPitt, I want to provide an update. |
@ramilbakhshyiev I have not run into that issue before. However, I am setting |
@ItsPitt @iotamudelta I have some updates. I have been able to build this PR in CI but with some workarounds (some are pretty brute force): link to full diff. This includes rebase onto main but here are the main files to pay attention to:
Can you please review and let me know what you think? it's not ideal but at least we have a way to compile ROCm version. Main questions:
If this is ok, I can review with the team if we are ok to proceed to rebase and review this PR to merge it in and then add the compile-only steps. |
@ramilbakhshyiev Glad to hear it's building!
This makes sense. We could maybe enable gpu automatically if ROCm is enabled to avoid having to enable both. But this doesn't have to be done now.
Moving this to
I believe there's been some movement. I'll find out more details and get back to you on this.
I'll look into this and get you an answer. Of course, if the testing code tries to make a device call as some kinda validation during building, this would cause an issue. Not sure that's what's going on here, just something to check.
When ROCm installs, it makes the symbolic link This also helps for installing multiple versions of ROCm:
These look to be system libs and not part of ROCm. |
@ItsPitt ack and thank you! I am currently trying to figure out the build errors with testing code, it actually seems to be failing to link against glib and also can't find one of the ROCm libs (I assume). Question on that: I assume AMD integrates into CMake/make setup, is that correct? Are there any recommended ways to debug why make does not find the libraries that it's looking for? I mean specific to ROCm/HIP setup. |
@ramilbakhshyiev Do you know which ROCm lib(s) it was failing to link against? It is correct. Out of curiosity, do you know which version of CMake you're using? It might be best if we start synchronizing dependency versions to eliminate factors between the working docker and the conda build. |
@ItsPitt We use CMake
|
I tried with 3.24 cmake and got the same error. I saw a commit on the dockerfile you provided to lower the cmake version from 3.29 to 3.24. What was the reason for doing that? Meanwhile I am trying to match more versions to see if it helps. |
I think I got a working build. I will clean things up, test it again, and update you with what I needed to change. |
@ramilbakhshyiev Great to hear! As for the CMake question, I hadn't fully validated the new version of CMake in my docker, so I rolled it back. No specific issue. It can more than likely be bumped once it's fully working. |
faiss/gpu/CMakeLists.txt
Outdated
if (USE_ROCM) | ||
find_package(HIP REQUIRED) | ||
find_package(hipBLAS REQUIRED) | ||
target_link_libraries(faiss_gpu PRIVATE hip::host roc::hipblas) | ||
target_compile_options(faiss_gpu PRIVATE) | ||
else() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- indentation
- Can we please move
find_package
statements to the top-levelCMakeLists.txt
guarded byUSE_ROCM
? - For
target_link_libraries
, can we please use the condition syntax like we do for RAFT?$<$<BOOL:${USE_ROCM}>:hip::host>
@@ -365,8 +365,8 @@ void StandardGpuResourcesImpl::initializeForDevice(int device) { | |||
|
|||
// Our code is pre-built with and expects warpSize == 32, validate that | |||
FAISS_ASSERT_FMT( | |||
prop.warpSize == 32, | |||
"Device id %d does not have expected warpSize of 32", | |||
prop.warpSize == 32 || prop.warpSize == 64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this ROCm specific? If so, can we allow 64 only for ROCm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have both wavefront 32 (E.g. navi) and 64 (E.g. MI250) devices. So this offers support for both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like Nvidia is 32 only and ROCm is 32 or 64. Should we lock it accordingly in code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that is desired, I could rework that assert using a ROCm flag to only allow a warpSize of 64 (and 32) on ROCm devices. It shouldn't be an issue at all!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think I would do that.
faiss/gpu/test/CMakeLists.txt
Outdated
if(USE_ROCM) | ||
find_package(HIP REQUIRED) | ||
target_link_libraries(faiss_gpu_test_helper PUBLIC faiss gtest hip::host) | ||
else() | ||
find_package(CUDAToolkit REQUIRED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above:
- Can we rely on top-level find package call?
- Can we use conditional inclusion like we do with RAFT?
faiss/gpu/test/CMakeLists.txt
Outdated
if(USE_ROCM) | ||
faiss_gpu_test(TestGpuDistance.hip) | ||
faiss_gpu_test(TestGpuSelect.hip) | ||
else() | ||
faiss_gpu_test(TestGpuDistance.cu) | ||
faiss_gpu_test(TestGpuSelect.cu) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use $HIP_OR_CU here?
faiss/gpu/test/CMakeLists.txt
Outdated
if (USE_ROCM) | ||
target_link_libraries(demo_ivfpq_indexing_gpu | ||
PRIVATE faiss gtest_main hip::host) | ||
else() | ||
target_link_libraries(demo_ivfpq_indexing_gpu | ||
PRIVATE faiss gtest_main CUDA::cudart) | ||
endif() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do conditional here please?
faiss/python/CMakeLists.txt
Outdated
if(USE_ROCM) | ||
find_package(HIP REQUIRED) | ||
target_link_libraries(swigfaiss PRIVATE hip::host) | ||
target_link_libraries(swigfaiss_avx2 PRIVATE hip::host) | ||
target_link_libraries(swigfaiss_avx512 PRIVATE hip::host) | ||
else() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use conditional statements here and elsewhere in this file too?
faiss/gpu/CMakeLists.txt
Outdated
@@ -197,6 +197,16 @@ function(generate_ivf_interleaved_code) | |||
"64|2048|8" | |||
) | |||
|
|||
if(USE_ROCM) | |||
set(CU_OR_HIP "hip") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you name it something like GPU_EXT_PREFIX
?
@ItsPitt @iotamudelta I found a workaround to build the tests in our CI environment so I think we should be good for build-only setup for now. I left some initial comments on the PR. Can you please rebase as well? Once we address the comments, I will run the PR by the team and we will figure what else needs to happen so we could merge it. After that, I will publish the PR to enable build-only. |
Here's a successful build with testing enabled: https://github.com/facebookresearch/faiss/actions/runs/10085927308/job/27887614699?pr=3622 |
@ramilbakhshyiev I'll work on these changes and get back to you. |
Sounds great, thank you! |
@ramilbakhshyiev I believe I address all issues. If there are more or I missed any, just let me know! Fixed indentations. |
@ItsPitt That's great, thank you so much! Could you resolve conflicts and rebase one more time? I will be import this and send to team for review as soon as you do. |
@ramilbakhshyiev rebased and merged - hopefully correctly. Thanks for working on this! |
@ramilbakhshyiev has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |