Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unroll loop in lookup_2_lanes #3364

Closed
wants to merge 1 commit into from
Closed

Conversation

carll99
Copy link

@carll99 carll99 commented Apr 11, 2024

The current loop goes from 0 to 31. It has an if statement to do an assignment for j < 16 and a different assignment for j >= 16. By unrolling the loop to do the j < 16 and the j >= 16 iterations in parallel the if j < 16 is eliminated and the number of loop iterations is reduced in half.

Then unroll the loop for the j < 16 and the j >=16 to a depth of 2.

This change results in approximately a 55% reduction in the execution time for the bench_ivf_fastscan.py workload on Power 10 when compiled with CMAKE_INSTALL_CONFIG_NAME=Release.

The removal of the if (j < 16) statement and the unrolling of the loop removes branch cycle stall and register dependencies on instruction issue. The result is the unrolled code is able issue instructions earlier thus reducing the total number of cycles required to execute the function.

@carll99
Copy link
Author

carll99 commented Apr 11, 2024

The pull request if for a patch to faiss/utils/simdlib_emulated.h. The patch improves the performance of the bench_ivf_fastscan.py workload by unrolling the for loop to eliminate the if (j < 16) statement by doing the j and j + 16 iterations in parallel. The loop is then further unrolled to do the j, j+1, j+16 and j+17 iterations in parallel. The code change reduces the execution time on Power 10 to about 45% of the original execution time.

@alexanderguzhva
Copy link
Contributor

@carll99
Would it be possible to introduce simd_ppc64.h instead?
Basically, there are news on the street that SIMD-related patches for AIX are being worked on as well. Also, I will propagate this changes for AIX into Milvus soon. So, this simd_ppc64.h is going to happen anyways, sooner or later.

As of your PR, I have several notes:

  1. could you please keep the baseline in the commented form, in the way it is done here
    // // baseline version
    . The reason is that the code for the baseline version is clear and short.
  2. Faiss supports multiple architectures, so I wonder whether your changes may hurt some other platform
  3. I've checked the change in godbolt (https://godbolt.org/z/K9bM98TY3) and it takes some time to see why the execution time got better tbh

@carll99
Copy link
Author

carll99 commented Apr 11, 2024

OK, commenting out the original code and then adding the change is easy to do.
The patch was tried on AIX and showed similar performance improvement. I have not tested it on other architectures.

Would it be possible to introduce simd_ppc64.h instead?
Yes, but I am not sure exactly what you want me to do?

  • Do you want simd_ppc64.h to be included into simdlib_emulated.h?
  • The function lookup_2_lanes is the key loop so I would think that all architectures would need it so putting it in a ppc specific include file seems wrong. Not clear that we want to have two versions of the function, i.e. one in simd_ppc64.h and simdlib_emulated.h?

I would be happy to introduce simd_ppc64.h but need better direction on how exactly you want this done. Thanks.

@alexanderguzhva
Copy link
Contributor

alexanderguzhva commented Apr 12, 2024

@carll99

https://github.com/facebookresearch/faiss/blob/main/faiss/utils/simdlib.h needs to be modified in the following way

#if defined(__AVX512F__)

#include <faiss/utils/simdlib_avx2.h>
#include <faiss/utils/simdlib_avx512.h>

#elif defined(__AVX2__)

#include <faiss/utils/simdlib_avx2.h>

#elif defined(__aarch64__)

#include <faiss/utils/simdlib_neon.h>

#elif defined(POWERPC_XYZ)

#include <faiss/utils/simdlib_ppc64.h>

#else

// emulated = all operations are implemented as scalars
#include <faiss/utils/simdlib_emulated.h>

#endif

.
As of one, you may just literally copy-paste simdlib_emulated.h into a new simdlib_ppc64.h and integrate your change on top of it. If you check simdlib_avx2.h, for example, you'll see that it completely reimplements everything from simdlib_emulated.h, so there is no way that both of the headers will be used simultaneosly. Although, simdlib_avx512.h is a bit different story because it is unfinished.

Please feel free to let me know if you need any further assistance for that.

@carll99
Copy link
Author

carll99 commented Apr 12, 2024

Thanks for the discussion and the example showing what you are thinking that is very helpful. In the example as you say, the entire file is copied into the arch specific include file. My first thought is that gives a lot of duplicated code that needs to be updated when some fix is made. It can be a bit of a maintenance headache. That probably makes a lot more sense when there are multiple architectural changes. In my case, I am thinking it might be better to just put the PPC version of the lookup_2_lane in the ppc specific file with something like

#if defined(POWERPC_XYZ)

#include <faiss/utils/simdlib_emulated_ppc64.h>

#else
// The very important operation that everything relies on
simd32uint8 lookup_2_lanes(const simd32uint8& idx) const {
simd32uint8 c;
for (int j = 0; j < 32; j++) {
if (idx.u8[j] & 0x80) {
c.u8[j] = 0;
} else {
uint8_t i = idx.u8[j] & 15;
if (j < 16) {
c.u8[j] = u8[i];
} else {
c.u8[j] = u8[16 + i];
}
}
}
return c;
}
#endif

That way updates to the other stuff in the file will not need to be replicated. At some point if there are enough architecture specific changes it might make sense to replicate the entire file. My preference would be to not replicate a lot of stuff unnecessarily at this point.

From your message, I think the above approach would be acceptable? Please if it is and I will just move lookup_2_lanes to the PPC file for now. Thanks.

@alexanderguzhva
Copy link
Contributor

alexanderguzhva commented Apr 12, 2024

@carll99
simdlib is not something that gets changed. Very-very unlikely.
So, I personally would definitely prefer having a dedicated simdlib_ppc64.h file, especially given that there will be more updates to PowerPC-related code. The duplicate code is not a big deal in this particular case.

@mdouze what would be your opinion on this topic?

@carll99
Copy link
Author

carll99 commented Apr 12, 2024

OK, I will make a complete simdlib_ppc64.h file, not a problem.

I was looking to see how AVX2 and aarch64 get defined... I see the various uses but I am not seeing where they get defined? I am not seeing anything in the faiss source tree. I have worked on some projects where this type of thing is determined as by the make config stuff. Faiss is using cmake, I am not that familiar yet with cmake, perhaps cmake can check the platform and set these defines?

@alexanderguzhva
Copy link
Contributor

alexanderguzhva commented Apr 12, 2024

@carll99 most of the defines are located in https://github.com/facebookresearch/faiss/blob/main/faiss/impl/platform_macros.h
Certain defines, such as __aarch64__ are provided by a C++ compiler. There should be some compiler-provided define for the Power architecture as well.

@alexanderguzhva
Copy link
Contributor

@carll99 you may also use https://godbolt.org to ensure that a compiler reacts properly on defines, if needed. Godbolt provides GCC for PowerPC compiler option as well

@mdouze
Copy link
Contributor

mdouze commented Apr 15, 2024

I agree with @alexanderguzhva, please add a PPC specific implementation.
The simdlib_emulated.h file has to remain as simple as possible to serve as a documentation.
Altivec was the first type of SIMD programming that I worked with so I'd love to see it supported ;-)

@carll99
Copy link
Author

carll99 commented Apr 15, 2024

I have updated the patch. I Put the #if define in simdlib_emulated.h to either include the new simdlib_emulated_ppc64.h file or use the original simdlib_emulated.h file. The new simdlib_emulated_ppc64.h file has the original loop commented out followed by the new loop implementation. I believe this is how you want it. It has been tested with Linux and GCC and AIX with the XLC clang compiler. It all seems to work fine.

I am wondering what the procedure is for the FAISS project for posting an updated patch via github? Should I close out this pull request and then create a new one for the updated patch? Or do I just push the patch to my github (I think I have to do a forced update to overwrite the previous commit) and continue with this pull request. Note, I am a little new to using github for submitting patches (I am used to posting patches via a mailing list, GCC and GDB) so not really sure what the FAISS process is with github. Thanks for the help with patch, the PR and github.

@alexanderguzhva
Copy link
Contributor

@carll99 please just continue your work here. Your previous commit may be overridden(by using git push --force.

@carll99
Copy link
Author

carll99 commented Apr 16, 2024

Not sure how I managed to close the pull request when I was updating my copy of faiss. Re-opened the pull request.

I updated my fork of faiss to the current faiss branch and pushed the new version of the unroll loop in lookup_2_lanes patch.
This patch makes a copy of faiss/utils/simdlib_emulated.h and names it faiss/utils/simdlib_emulated_ppc64.h. The new file has the new version of lookup_2_lanes. The new included file is used if the define PPC64 is set by the GCC or XLC clang compiler.

The patch has been tested on Power 10 running Linux with the GCC version 13.2.0-5 and on Power 10 running AIX and the XLC clang compiler using the bench_ivf_fastscan.py workload. The benchmark generates the same results, just faster.

Please let me know if this version of the patch is acceptable or if there are any additional changes needed.

Thank you for all your help and patience.

@alexanderguzhva
Copy link
Contributor

@carll99

  1. Please keep simdlib_emulated.h unchanged.
  2. Please add #ifdef + #else into simdlib.h in the same way it is done for other platforms
    Thanks

@carll99
Copy link
Author

carll99 commented Apr 16, 2024

Ah, I see, I didn't put the if define at the right level of the include files. Updated the patch to put the #if define into faiss/utils/simdlib.h not faiss/utils/simdlib_emulated.h. Hopefully I got it all correct this time. Thanks.

@alexanderguzhva
Copy link
Contributor

Ah, I see, I didn't put the if define at the right level of the include files. Updated the patch to put the #if define into faiss/utils/simdlib.h not faiss/utils/simdlib_emulated.h. Hopefully I got it all correct this time. Thanks.

@carll99 , please change
1.

#else

// emulated = all operations are implemented as scalars
#if defined(__PPC64__)

into

#elif defined(__PPC64__)
  1. rename simdlib_emulated_ppc64.h to simdlib_ppc64.h

And it will be good from my point of view after these two changes.
Thanks

@alexanderguzhva
Copy link
Contributor

Also, don't forget to fix ci/circleci formatting :)

@carll99
Copy link
Author

carll99 commented Apr 16, 2024

OK, I see where you are going with the #elif defined(PPC64). That is an easy fix. I should have figured out what you wanted earlier. Sorry.

As for the comment "don't forget to fix ci/circleci formatting ", I am not familiar with those terms. I tried googling them but that wasn't very helpful. Is the comment talking about the formatting of the commit log or comments in the source code??? Formatting of commit logs seems to be very project specific. If there is some document for formatting code/commit logs that the projects follows, a link to that would be great.

Again, thanks for the help with the patch.

@alexanderguzhva
Copy link
Contributor

@carll99 please read about clang-format.
Basically, you need to go to the root of your faiss installation and run something like clang-format -i faiss/utils/simdlib.h and clang-format -i faiss/utils/simdlib_ppc64.h. This lets clang-format tool refactor provided files inplace according to faiss code style, and you may put these two changes files into git and update your PR. For example, check ab2b7f5 , clang-format was applied.
If you see that clang-format changed nothing in the files that you've changed, then it is good to go.

The current loop in function lookup_2_lanes infais/utils/simdlib_emulated.h
goes from 0 to 31.  It has an if statement to do an assignment for j < 16
and a different assignment for j >= 16.  By unrolling the loop to do the
j < 16 and the j >= 16 iterations in parallel the if j < 16 is eliminated and
the number of loop iterations is reduced in half.

Then unroll the loop for the j < 16 and the j >=16 to a depth of 2.

This change results in approximately a 55% reduction in the execution time
for the bench_ivf_fastscan.py workload on Power 10 when compiled with
CMAKE_INSTALL_CONFIG_NAME=Release.

The removal of the if (j < 16) statement and the unrolling of the loop
removes branch cycle stall and register dependencies on instruction issue.
The result is the unrolled code is able issue instructions earlier thus
reducing the total number of cycles required to execute the function.

This patch makes a copy of faiss/utils/simdlib_emulated.h and names it
faiss/utils/simdlib_ppc64.h.  The new file has the new version of
lookup_2_lanes.  The new included file is gets included in file
faiss/utils/simdlib.h if the define __PPC64__ is set by the GCC
compiler on Linux or the XLC clang compiler for AIX.
@carll99 carll99 reopened this Apr 17, 2024
@carll99
Copy link
Author

carll99 commented Apr 17, 2024

The pull request seems to get closed every time I resync with main and throw out my old patch before I push the updated patch. Reopening.

@carll99
Copy link
Author

carll99 commented Apr 17, 2024

I updated the patch again per the comments about how I should be doing the #if define. I also ran the clang format commands:

  clang-format -i faiss/utils/simdlib.h
  clang-format -i faiss/utils/simdlib_ppc64.h

and didn't see any messages about the format being wrong. Please let me know if there are any additional issues.

Retested the patch on Power 10 to verify it builds and runs as expected.

Thanks.

@alexanderguzhva
Copy link
Contributor

lgtm

@carll99
Copy link
Author

carll99 commented Apr 18, 2024

OK, Thanks. So what happens next? I am guessing someone will pull my change into mainline and then close the pull request?

@alexanderguzhva
Copy link
Contributor

@carll99 some guys from Meta should accept your PR and it will be merged into the baseline

@alexanderguzhva
Copy link
Contributor

@junjieqi could you please take a look? Thanks

@facebook-github-bot
Copy link
Contributor

@mdouze has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mdouze merged this pull request in b2e91f6.

abhinavdangeti pushed a commit to blevesearch/faiss that referenced this pull request Jul 12, 2024
Summary:
The current loop goes from 0 to 31.  It has an if statement to do an assignment for j < 16 and a different assignment for j >= 16.  By unrolling the loop to do the j < 16 and the j >= 16 iterations in parallel the if j < 16 is eliminated and the number of loop iterations is reduced in half.

Then unroll the loop for the j < 16 and the j >=16 to a depth of 2.

This change results in approximately a 55% reduction in the execution time for the bench_ivf_fastscan.py workload on Power 10 when compiled with CMAKE_INSTALL_CONFIG_NAME=Release.

The removal of the if (j < 16) statement and the unrolling of the loop removes branch cycle stall and register dependencies on instruction issue. The result is the unrolled code is able issue instructions earlier thus reducing the total number of cycles required to execute the function.

Pull Request resolved: facebookresearch#3364

Reviewed By: kuarora

Differential Revision: D56455690

Pulled By: mdouze

fbshipit-source-id: 490a17a40d9d4439b1a8ea22e991e706d68fb2fa
aalekhpatel07 pushed a commit to aalekhpatel07/faiss that referenced this pull request Oct 17, 2024
Summary:
The current loop goes from 0 to 31.  It has an if statement to do an assignment for j < 16 and a different assignment for j >= 16.  By unrolling the loop to do the j < 16 and the j >= 16 iterations in parallel the if j < 16 is eliminated and the number of loop iterations is reduced in half.

Then unroll the loop for the j < 16 and the j >=16 to a depth of 2.

This change results in approximately a 55% reduction in the execution time for the bench_ivf_fastscan.py workload on Power 10 when compiled with CMAKE_INSTALL_CONFIG_NAME=Release.

The removal of the if (j < 16) statement and the unrolling of the loop removes branch cycle stall and register dependencies on instruction issue. The result is the unrolled code is able issue instructions earlier thus reducing the total number of cycles required to execute the function.

Pull Request resolved: facebookresearch#3364

Reviewed By: kuarora

Differential Revision: D56455690

Pulled By: mdouze

fbshipit-source-id: 490a17a40d9d4439b1a8ea22e991e706d68fb2fa
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants