-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unroll loop in lookup_2_lanes #3364
Conversation
The pull request if for a patch to faiss/utils/simdlib_emulated.h. The patch improves the performance of the bench_ivf_fastscan.py workload by unrolling the for loop to eliminate the if (j < 16) statement by doing the j and j + 16 iterations in parallel. The loop is then further unrolled to do the j, j+1, j+16 and j+17 iterations in parallel. The code change reduces the execution time on Power 10 to about 45% of the original execution time. |
@carll99 As of your PR, I have several notes:
|
OK, commenting out the original code and then adding the change is easy to do.
I would be happy to introduce simd_ppc64.h but need better direction on how exactly you want this done. Thanks. |
https://github.com/facebookresearch/faiss/blob/main/faiss/utils/simdlib.h needs to be modified in the following way
. Please feel free to let me know if you need any further assistance for that. |
Thanks for the discussion and the example showing what you are thinking that is very helpful. In the example as you say, the entire file is copied into the arch specific include file. My first thought is that gives a lot of duplicated code that needs to be updated when some fix is made. It can be a bit of a maintenance headache. That probably makes a lot more sense when there are multiple architectural changes. In my case, I am thinking it might be better to just put the PPC version of the lookup_2_lane in the ppc specific file with something like #if defined(POWERPC_XYZ) #include <faiss/utils/simdlib_emulated_ppc64.h> #else That way updates to the other stuff in the file will not need to be replicated. At some point if there are enough architecture specific changes it might make sense to replicate the entire file. My preference would be to not replicate a lot of stuff unnecessarily at this point. From your message, I think the above approach would be acceptable? Please if it is and I will just move lookup_2_lanes to the PPC file for now. Thanks. |
@carll99 @mdouze what would be your opinion on this topic? |
OK, I will make a complete simdlib_ppc64.h file, not a problem. I was looking to see how AVX2 and aarch64 get defined... I see the various uses but I am not seeing where they get defined? I am not seeing anything in the faiss source tree. I have worked on some projects where this type of thing is determined as by the make config stuff. Faiss is using cmake, I am not that familiar yet with cmake, perhaps cmake can check the platform and set these defines? |
@carll99 most of the defines are located in https://github.com/facebookresearch/faiss/blob/main/faiss/impl/platform_macros.h |
@carll99 you may also use https://godbolt.org to ensure that a compiler reacts properly on defines, if needed. Godbolt provides GCC for PowerPC compiler option as well |
I agree with @alexanderguzhva, please add a PPC specific implementation. |
I have updated the patch. I Put the #if define in simdlib_emulated.h to either include the new simdlib_emulated_ppc64.h file or use the original simdlib_emulated.h file. The new simdlib_emulated_ppc64.h file has the original loop commented out followed by the new loop implementation. I believe this is how you want it. It has been tested with Linux and GCC and AIX with the XLC clang compiler. It all seems to work fine. I am wondering what the procedure is for the FAISS project for posting an updated patch via github? Should I close out this pull request and then create a new one for the updated patch? Or do I just push the patch to my github (I think I have to do a forced update to overwrite the previous commit) and continue with this pull request. Note, I am a little new to using github for submitting patches (I am used to posting patches via a mailing list, GCC and GDB) so not really sure what the FAISS process is with github. Thanks for the help with patch, the PR and github. |
@carll99 please just continue your work here. Your previous commit may be overridden(by using |
Not sure how I managed to close the pull request when I was updating my copy of faiss. Re-opened the pull request. I updated my fork of faiss to the current faiss branch and pushed the new version of the unroll loop in lookup_2_lanes patch. The patch has been tested on Power 10 running Linux with the GCC version 13.2.0-5 and on Power 10 running AIX and the XLC clang compiler using the bench_ivf_fastscan.py workload. The benchmark generates the same results, just faster. Please let me know if this version of the patch is acceptable or if there are any additional changes needed. Thank you for all your help and patience. |
|
Ah, I see, I didn't put the if define at the right level of the include files. Updated the patch to put the #if define into faiss/utils/simdlib.h not faiss/utils/simdlib_emulated.h. Hopefully I got it all correct this time. Thanks. |
@carll99 , please change
into
And it will be good from my point of view after these two changes. |
Also, don't forget to fix ci/circleci formatting :) |
OK, I see where you are going with the #elif defined(PPC64). That is an easy fix. I should have figured out what you wanted earlier. Sorry. As for the comment "don't forget to fix ci/circleci formatting ", I am not familiar with those terms. I tried googling them but that wasn't very helpful. Is the comment talking about the formatting of the commit log or comments in the source code??? Formatting of commit logs seems to be very project specific. If there is some document for formatting code/commit logs that the projects follows, a link to that would be great. Again, thanks for the help with the patch. |
@carll99 please read about |
The current loop in function lookup_2_lanes infais/utils/simdlib_emulated.h goes from 0 to 31. It has an if statement to do an assignment for j < 16 and a different assignment for j >= 16. By unrolling the loop to do the j < 16 and the j >= 16 iterations in parallel the if j < 16 is eliminated and the number of loop iterations is reduced in half. Then unroll the loop for the j < 16 and the j >=16 to a depth of 2. This change results in approximately a 55% reduction in the execution time for the bench_ivf_fastscan.py workload on Power 10 when compiled with CMAKE_INSTALL_CONFIG_NAME=Release. The removal of the if (j < 16) statement and the unrolling of the loop removes branch cycle stall and register dependencies on instruction issue. The result is the unrolled code is able issue instructions earlier thus reducing the total number of cycles required to execute the function. This patch makes a copy of faiss/utils/simdlib_emulated.h and names it faiss/utils/simdlib_ppc64.h. The new file has the new version of lookup_2_lanes. The new included file is gets included in file faiss/utils/simdlib.h if the define __PPC64__ is set by the GCC compiler on Linux or the XLC clang compiler for AIX.
The pull request seems to get closed every time I resync with main and throw out my old patch before I push the updated patch. Reopening. |
I updated the patch again per the comments about how I should be doing the #if define. I also ran the clang format commands:
and didn't see any messages about the format being wrong. Please let me know if there are any additional issues. Retested the patch on Power 10 to verify it builds and runs as expected. Thanks. |
lgtm |
OK, Thanks. So what happens next? I am guessing someone will pull my change into mainline and then close the pull request? |
@carll99 some guys from Meta should accept your PR and it will be merged into the baseline |
@junjieqi could you please take a look? Thanks |
@mdouze has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: The current loop goes from 0 to 31. It has an if statement to do an assignment for j < 16 and a different assignment for j >= 16. By unrolling the loop to do the j < 16 and the j >= 16 iterations in parallel the if j < 16 is eliminated and the number of loop iterations is reduced in half. Then unroll the loop for the j < 16 and the j >=16 to a depth of 2. This change results in approximately a 55% reduction in the execution time for the bench_ivf_fastscan.py workload on Power 10 when compiled with CMAKE_INSTALL_CONFIG_NAME=Release. The removal of the if (j < 16) statement and the unrolling of the loop removes branch cycle stall and register dependencies on instruction issue. The result is the unrolled code is able issue instructions earlier thus reducing the total number of cycles required to execute the function. Pull Request resolved: facebookresearch#3364 Reviewed By: kuarora Differential Revision: D56455690 Pulled By: mdouze fbshipit-source-id: 490a17a40d9d4439b1a8ea22e991e706d68fb2fa
Summary: The current loop goes from 0 to 31. It has an if statement to do an assignment for j < 16 and a different assignment for j >= 16. By unrolling the loop to do the j < 16 and the j >= 16 iterations in parallel the if j < 16 is eliminated and the number of loop iterations is reduced in half. Then unroll the loop for the j < 16 and the j >=16 to a depth of 2. This change results in approximately a 55% reduction in the execution time for the bench_ivf_fastscan.py workload on Power 10 when compiled with CMAKE_INSTALL_CONFIG_NAME=Release. The removal of the if (j < 16) statement and the unrolling of the loop removes branch cycle stall and register dependencies on instruction issue. The result is the unrolled code is able issue instructions earlier thus reducing the total number of cycles required to execute the function. Pull Request resolved: facebookresearch#3364 Reviewed By: kuarora Differential Revision: D56455690 Pulled By: mdouze fbshipit-source-id: 490a17a40d9d4439b1a8ea22e991e706d68fb2fa
The current loop goes from 0 to 31. It has an if statement to do an assignment for j < 16 and a different assignment for j >= 16. By unrolling the loop to do the j < 16 and the j >= 16 iterations in parallel the if j < 16 is eliminated and the number of loop iterations is reduced in half.
Then unroll the loop for the j < 16 and the j >=16 to a depth of 2.
This change results in approximately a 55% reduction in the execution time for the bench_ivf_fastscan.py workload on Power 10 when compiled with CMAKE_INSTALL_CONFIG_NAME=Release.
The removal of the if (j < 16) statement and the unrolling of the loop removes branch cycle stall and register dependencies on instruction issue. The result is the unrolled code is able issue instructions earlier thus reducing the total number of cycles required to execute the function.