Unroll loop in lookup_2_lanes #3364

carll99 · 2024-04-11T20:17:45Z

The current loop goes from 0 to 31. It has an if statement to do an assignment for j < 16 and a different assignment for j >= 16. By unrolling the loop to do the j < 16 and the j >= 16 iterations in parallel the if j < 16 is eliminated and the number of loop iterations is reduced in half.

Then unroll the loop for the j < 16 and the j >=16 to a depth of 2.

This change results in approximately a 55% reduction in the execution time for the bench_ivf_fastscan.py workload on Power 10 when compiled with CMAKE_INSTALL_CONFIG_NAME=Release.

The removal of the if (j < 16) statement and the unrolling of the loop removes branch cycle stall and register dependencies on instruction issue. The result is the unrolled code is able issue instructions earlier thus reducing the total number of cycles required to execute the function.

carll99 · 2024-04-11T22:04:49Z

The pull request if for a patch to faiss/utils/simdlib_emulated.h. The patch improves the performance of the bench_ivf_fastscan.py workload by unrolling the for loop to eliminate the if (j < 16) statement by doing the j and j + 16 iterations in parallel. The loop is then further unrolled to do the j, j+1, j+16 and j+17 iterations in parallel. The code change reduces the execution time on Power 10 to about 45% of the original execution time.

alexanderguzhva · 2024-04-11T22:40:02Z

@carll99
Would it be possible to introduce simd_ppc64.h instead?
Basically, there are news on the street that SIMD-related patches for AIX are being worked on as well. Also, I will propagate this changes for AIX into Milvus soon. So, this simd_ppc64.h is going to happen anyways, sooner or later.

As of your PR, I have several notes:

could you please keep the baseline in the commented form, in the way it is done here

faiss/faiss/impl/HNSW.cpp

Line 574 in 40e8643

// // baseline version

. The reason is that the code for the baseline version is clear and short.
Faiss supports multiple architectures, so I wonder whether your changes may hurt some other platform
I've checked the change in godbolt (https://godbolt.org/z/K9bM98TY3) and it takes some time to see why the execution time got better tbh

carll99 · 2024-04-11T22:54:49Z

OK, commenting out the original code and then adding the change is easy to do.
The patch was tried on AIX and showed similar performance improvement. I have not tested it on other architectures.

Would it be possible to introduce simd_ppc64.h instead?
Yes, but I am not sure exactly what you want me to do?

Do you want simd_ppc64.h to be included into simdlib_emulated.h?
The function lookup_2_lanes is the key loop so I would think that all architectures would need it so putting it in a ppc specific include file seems wrong. Not clear that we want to have two versions of the function, i.e. one in simd_ppc64.h and simdlib_emulated.h?

I would be happy to introduce simd_ppc64.h but need better direction on how exactly you want this done. Thanks.

alexanderguzhva · 2024-04-12T13:54:01Z

@carll99

https://github.com/facebookresearch/faiss/blob/main/faiss/utils/simdlib.h needs to be modified in the following way

#if defined(__AVX512F__)

#include <faiss/utils/simdlib_avx2.h>
#include <faiss/utils/simdlib_avx512.h>

#elif defined(__AVX2__)

#include <faiss/utils/simdlib_avx2.h>

#elif defined(__aarch64__)

#include <faiss/utils/simdlib_neon.h>

#elif defined(POWERPC_XYZ)

#include <faiss/utils/simdlib_ppc64.h>

#else

// emulated = all operations are implemented as scalars
#include <faiss/utils/simdlib_emulated.h>

#endif

.
As of one, you may just literally copy-paste simdlib_emulated.h into a new simdlib_ppc64.h and integrate your change on top of it. If you check simdlib_avx2.h, for example, you'll see that it completely reimplements everything from simdlib_emulated.h, so there is no way that both of the headers will be used simultaneosly. Although, simdlib_avx512.h is a bit different story because it is unfinished.

Please feel free to let me know if you need any further assistance for that.

carll99 · 2024-04-12T17:49:46Z

Thanks for the discussion and the example showing what you are thinking that is very helpful. In the example as you say, the entire file is copied into the arch specific include file. My first thought is that gives a lot of duplicated code that needs to be updated when some fix is made. It can be a bit of a maintenance headache. That probably makes a lot more sense when there are multiple architectural changes. In my case, I am thinking it might be better to just put the PPC version of the lookup_2_lane in the ppc specific file with something like

#if defined(POWERPC_XYZ)

#include <faiss/utils/simdlib_emulated_ppc64.h>

#else
// The very important operation that everything relies on
simd32uint8 lookup_2_lanes(const simd32uint8& idx) const {
simd32uint8 c;
for (int j = 0; j < 32; j++) {
if (idx.u8[j] & 0x80) {
c.u8[j] = 0;
} else {
uint8_t i = idx.u8[j] & 15;
if (j < 16) {
c.u8[j] = u8[i];
} else {
c.u8[j] = u8[16 + i];
}
}
}
return c;
}
#endif

That way updates to the other stuff in the file will not need to be replicated. At some point if there are enough architecture specific changes it might make sense to replicate the entire file. My preference would be to not replicate a lot of stuff unnecessarily at this point.

From your message, I think the above approach would be acceptable? Please if it is and I will just move lookup_2_lanes to the PPC file for now. Thanks.

alexanderguzhva · 2024-04-12T19:40:27Z

@carll99
simdlib is not something that gets changed. Very-very unlikely.
So, I personally would definitely prefer having a dedicated simdlib_ppc64.h file, especially given that there will be more updates to PowerPC-related code. The duplicate code is not a big deal in this particular case.

@mdouze what would be your opinion on this topic?

carll99 · 2024-04-12T20:03:16Z

OK, I will make a complete simdlib_ppc64.h file, not a problem.

I was looking to see how AVX2 and aarch64 get defined... I see the various uses but I am not seeing where they get defined? I am not seeing anything in the faiss source tree. I have worked on some projects where this type of thing is determined as by the make config stuff. Faiss is using cmake, I am not that familiar yet with cmake, perhaps cmake can check the platform and set these defines?

alexanderguzhva · 2024-04-12T20:15:45Z

@carll99 most of the defines are located in https://github.com/facebookresearch/faiss/blob/main/faiss/impl/platform_macros.h
Certain defines, such as __aarch64__ are provided by a C++ compiler. There should be some compiler-provided define for the Power architecture as well.

alexanderguzhva · 2024-04-12T20:18:06Z

@carll99 you may also use https://godbolt.org to ensure that a compiler reacts properly on defines, if needed. Godbolt provides GCC for PowerPC compiler option as well

mdouze · 2024-04-15T17:29:29Z

I agree with @alexanderguzhva, please add a PPC specific implementation.
The simdlib_emulated.h file has to remain as simple as possible to serve as a documentation.
Altivec was the first type of SIMD programming that I worked with so I'd love to see it supported ;-)

carll99 · 2024-04-15T22:45:16Z

I have updated the patch. I Put the #if define in simdlib_emulated.h to either include the new simdlib_emulated_ppc64.h file or use the original simdlib_emulated.h file. The new simdlib_emulated_ppc64.h file has the original loop commented out followed by the new loop implementation. I believe this is how you want it. It has been tested with Linux and GCC and AIX with the XLC clang compiler. It all seems to work fine.

I am wondering what the procedure is for the FAISS project for posting an updated patch via github? Should I close out this pull request and then create a new one for the updated patch? Or do I just push the patch to my github (I think I have to do a forced update to overwrite the previous commit) and continue with this pull request. Note, I am a little new to using github for submitting patches (I am used to posting patches via a mailing list, GCC and GDB) so not really sure what the FAISS process is with github. Thanks for the help with patch, the PR and github.

alexanderguzhva · 2024-04-16T11:12:10Z

@carll99 please just continue your work here. Your previous commit may be overridden(by using git push --force.

carll99 · 2024-04-16T15:36:49Z

Not sure how I managed to close the pull request when I was updating my copy of faiss. Re-opened the pull request.

I updated my fork of faiss to the current faiss branch and pushed the new version of the unroll loop in lookup_2_lanes patch.
This patch makes a copy of faiss/utils/simdlib_emulated.h and names it faiss/utils/simdlib_emulated_ppc64.h. The new file has the new version of lookup_2_lanes. The new included file is used if the define PPC64 is set by the GCC or XLC clang compiler.

The patch has been tested on Power 10 running Linux with the GCC version 13.2.0-5 and on Power 10 running AIX and the XLC clang compiler using the bench_ivf_fastscan.py workload. The benchmark generates the same results, just faster.

Please let me know if this version of the patch is acceptable or if there are any additional changes needed.

Thank you for all your help and patience.

alexanderguzhva · 2024-04-16T16:58:41Z

@carll99

Please keep simdlib_emulated.h unchanged.
Please add #ifdef + #else into simdlib.h in the same way it is done for other platforms
Thanks

carll99 · 2024-04-16T20:07:57Z

Ah, I see, I didn't put the if define at the right level of the include files. Updated the patch to put the #if define into faiss/utils/simdlib.h not faiss/utils/simdlib_emulated.h. Hopefully I got it all correct this time. Thanks.

alexanderguzhva · 2024-04-16T23:08:44Z

Ah, I see, I didn't put the if define at the right level of the include files. Updated the patch to put the #if define into faiss/utils/simdlib.h not faiss/utils/simdlib_emulated.h. Hopefully I got it all correct this time. Thanks.

@carll99 , please change
1.

#else

// emulated = all operations are implemented as scalars
#if defined(__PPC64__)

into

#elif defined(__PPC64__)

rename simdlib_emulated_ppc64.h to simdlib_ppc64.h

And it will be good from my point of view after these two changes.
Thanks

alexanderguzhva · 2024-04-16T23:09:03Z

Also, don't forget to fix ci/circleci formatting :)

carll99 · 2024-04-16T23:44:33Z

OK, I see where you are going with the #elif defined(PPC64). That is an easy fix. I should have figured out what you wanted earlier. Sorry.

As for the comment "don't forget to fix ci/circleci formatting ", I am not familiar with those terms. I tried googling them but that wasn't very helpful. Is the comment talking about the formatting of the commit log or comments in the source code??? Formatting of commit logs seems to be very project specific. If there is some document for formatting code/commit logs that the projects follows, a link to that would be great.

Again, thanks for the help with the patch.

alexanderguzhva · 2024-04-17T01:04:26Z

@carll99 please read about clang-format.
Basically, you need to go to the root of your faiss installation and run something like clang-format -i faiss/utils/simdlib.h and clang-format -i faiss/utils/simdlib_ppc64.h. This lets clang-format tool refactor provided files inplace according to faiss code style, and you may put these two changes files into git and update your PR. For example, check ab2b7f5 , clang-format was applied.
If you see that clang-format changed nothing in the files that you've changed, then it is good to go.

The current loop in function lookup_2_lanes infais/utils/simdlib_emulated.h goes from 0 to 31. It has an if statement to do an assignment for j < 16 and a different assignment for j >= 16. By unrolling the loop to do the j < 16 and the j >= 16 iterations in parallel the if j < 16 is eliminated and the number of loop iterations is reduced in half. Then unroll the loop for the j < 16 and the j >=16 to a depth of 2. This change results in approximately a 55% reduction in the execution time for the bench_ivf_fastscan.py workload on Power 10 when compiled with CMAKE_INSTALL_CONFIG_NAME=Release. The removal of the if (j < 16) statement and the unrolling of the loop removes branch cycle stall and register dependencies on instruction issue. The result is the unrolled code is able issue instructions earlier thus reducing the total number of cycles required to execute the function. This patch makes a copy of faiss/utils/simdlib_emulated.h and names it faiss/utils/simdlib_ppc64.h. The new file has the new version of lookup_2_lanes. The new included file is gets included in file faiss/utils/simdlib.h if the define __PPC64__ is set by the GCC compiler on Linux or the XLC clang compiler for AIX.

carll99 · 2024-04-17T15:33:55Z

The pull request seems to get closed every time I resync with main and throw out my old patch before I push the updated patch. Reopening.

carll99 · 2024-04-17T15:36:32Z

I updated the patch again per the comments about how I should be doing the #if define. I also ran the clang format commands:

  clang-format -i faiss/utils/simdlib.h
  clang-format -i faiss/utils/simdlib_ppc64.h

and didn't see any messages about the format being wrong. Please let me know if there are any additional issues.

Retested the patch on Power 10 to verify it builds and runs as expected.

Thanks.

alexanderguzhva · 2024-04-17T17:56:30Z

lgtm

carll99 · 2024-04-18T15:21:10Z

OK, Thanks. So what happens next? I am guessing someone will pull my change into mainline and then close the pull request?

alexanderguzhva · 2024-04-18T15:38:19Z

@carll99 some guys from Meta should accept your PR and it will be merged into the baseline

alexanderguzhva · 2024-04-18T15:52:41Z

@junjieqi could you please take a look? Thanks

facebook-github-bot · 2024-04-23T06:10:24Z

@mdouze has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-04-24T09:09:29Z

@mdouze merged this pull request in b2e91f6.

Summary: The current loop goes from 0 to 31. It has an if statement to do an assignment for j < 16 and a different assignment for j >= 16. By unrolling the loop to do the j < 16 and the j >= 16 iterations in parallel the if j < 16 is eliminated and the number of loop iterations is reduced in half. Then unroll the loop for the j < 16 and the j >=16 to a depth of 2. This change results in approximately a 55% reduction in the execution time for the bench_ivf_fastscan.py workload on Power 10 when compiled with CMAKE_INSTALL_CONFIG_NAME=Release. The removal of the if (j < 16) statement and the unrolling of the loop removes branch cycle stall and register dependencies on instruction issue. The result is the unrolled code is able issue instructions earlier thus reducing the total number of cycles required to execute the function. Pull Request resolved: facebookresearch#3364 Reviewed By: kuarora Differential Revision: D56455690 Pulled By: mdouze fbshipit-source-id: 490a17a40d9d4439b1a8ea22e991e706d68fb2fa

facebook-github-bot added the CLA Signed label Apr 11, 2024

carll99 force-pushed the main branch from 2065087 to 15f1943 Compare April 16, 2024 15:23

carll99 closed this Apr 16, 2024

carll99 force-pushed the main branch from 030a6dd to 366a814 Compare April 16, 2024 15:27

carll99 reopened this Apr 16, 2024

carll99 force-pushed the main branch from 6fc62fd to f1c2700 Compare April 16, 2024 20:04

carll99 force-pushed the main branch from f1c2700 to 9df52ab Compare April 17, 2024 15:25

carll99 closed this Apr 17, 2024

carll99 force-pushed the main branch from 9df52ab to 0169f29 Compare April 17, 2024 15:26

carll99 reopened this Apr 17, 2024

facebook-github-bot closed this in b2e91f6 Apr 24, 2024

facebook-github-bot added the Merged label Apr 24, 2024

alexanderguzhva mentioned this pull request Apr 24, 2024

[cherrypick] Unroll loop in lookup_2_lanes for PQFasfScan for PPC64 zilliztech/knowhere#524

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unroll loop in lookup_2_lanes #3364

Unroll loop in lookup_2_lanes #3364

carll99 commented Apr 11, 2024

carll99 commented Apr 11, 2024

alexanderguzhva commented Apr 11, 2024

carll99 commented Apr 11, 2024

alexanderguzhva commented Apr 12, 2024 •

edited

Loading

carll99 commented Apr 12, 2024

alexanderguzhva commented Apr 12, 2024 •

edited

Loading

carll99 commented Apr 12, 2024

alexanderguzhva commented Apr 12, 2024 •

edited

Loading

alexanderguzhva commented Apr 12, 2024

mdouze commented Apr 15, 2024

carll99 commented Apr 15, 2024

alexanderguzhva commented Apr 16, 2024

carll99 commented Apr 16, 2024

alexanderguzhva commented Apr 16, 2024

carll99 commented Apr 16, 2024

alexanderguzhva commented Apr 16, 2024

alexanderguzhva commented Apr 16, 2024

carll99 commented Apr 16, 2024

alexanderguzhva commented Apr 17, 2024

carll99 commented Apr 17, 2024

carll99 commented Apr 17, 2024 •

edited

Loading

alexanderguzhva commented Apr 17, 2024

carll99 commented Apr 18, 2024

alexanderguzhva commented Apr 18, 2024

alexanderguzhva commented Apr 18, 2024

facebook-github-bot commented Apr 23, 2024

facebook-github-bot commented Apr 24, 2024

Unroll loop in lookup_2_lanes #3364

Unroll loop in lookup_2_lanes #3364

Conversation

carll99 commented Apr 11, 2024

carll99 commented Apr 11, 2024

alexanderguzhva commented Apr 11, 2024

carll99 commented Apr 11, 2024

alexanderguzhva commented Apr 12, 2024 • edited Loading

carll99 commented Apr 12, 2024

alexanderguzhva commented Apr 12, 2024 • edited Loading

carll99 commented Apr 12, 2024

alexanderguzhva commented Apr 12, 2024 • edited Loading

alexanderguzhva commented Apr 12, 2024

mdouze commented Apr 15, 2024

carll99 commented Apr 15, 2024

alexanderguzhva commented Apr 16, 2024

carll99 commented Apr 16, 2024

alexanderguzhva commented Apr 16, 2024

carll99 commented Apr 16, 2024

alexanderguzhva commented Apr 16, 2024

alexanderguzhva commented Apr 16, 2024

carll99 commented Apr 16, 2024

alexanderguzhva commented Apr 17, 2024

carll99 commented Apr 17, 2024

carll99 commented Apr 17, 2024 • edited Loading

alexanderguzhva commented Apr 17, 2024

carll99 commented Apr 18, 2024

alexanderguzhva commented Apr 18, 2024

alexanderguzhva commented Apr 18, 2024

facebook-github-bot commented Apr 23, 2024

facebook-github-bot commented Apr 24, 2024

alexanderguzhva commented Apr 12, 2024 •

edited

Loading

alexanderguzhva commented Apr 12, 2024 •

edited

Loading

alexanderguzhva commented Apr 12, 2024 •

edited

Loading

carll99 commented Apr 17, 2024 •

edited

Loading