-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of IndexPQFastScan and IndexIVFPQFastScan on aarch64 and non-AVX2 devices #1812
Comments
That looks great! Thanks so much for working on this. Of course, we would welcome this as a PR. |
Summary: related: #1812 This PR improves the performance of contents in `simdlib_emulated.h` . `IndexPQFastScan` and `IndexIVFPQFastScan` will become faster on non-AVX2 environments, e.g., 4x faster on SIFT1M. This PR contains below changes: - Use `template` instead of `std::function` on argument of `unary_func` and `binary_func` - Because `std::function` hinders some optimizations like function inlining - Use `const T&` instead of `T` for vector classes like `simd16uint16` on argument of functions - Vector classes on `simdlib_emulated.h` has the data member as array, so the runtime cost for copying is not so low. - Passing by const lvalue-ref prevents copy. Pull Request resolved: #1814 Reviewed By: beauby Differential Revision: D27760072 Pulled By: mdouze fbshipit-source-id: cbc5a14658d1960b24ce55a395e71c80998742dc
…4 devices (#1815) Summary: related: #1812 This PR improves the performance of `IndexPQFastScan` and `IndexIVFPQFastScan` on aarch64 devices, e.g., 60x faster on an AWS Arm instance with the SIFT1M dataset. The contents of this PR are below: - Add `simdlib_neon.h` - `simdlib_neon.h` has `simdlib` compatible API, and they are implemented with Arm NEON intrinsics. - `simdlib.h` includes `simdlib_neon.h` if `__aarch64__` is defined. - Move `geteven` , `getodd` , `getlow128` , and `gethigh128` from `distances_simd.cpp` to `simdlib_avx2.h` . - Port `geteven` , `getodd` , `getlow128` , and `gethigh128` for non-AVX2 environments. - These codes are implemented with AVX2 intrinsics, so they have prevented to implement `compute_PQ_dis_tables_dsub2` for non-AVX2 environments. - Now `simdlib_avx2.h` , `simdlib_emulated.h` , and `simdlib_neon.h` all have those functions. - Enable `compute_PQ_dis_tables_dsub2` on aarch64 - Above change makes `compute_PQ_dis_tables_dsub2` independent from `geteven` and so on. - `compute_PQ_dis_tables_dsub2` implemented with `simdlib_neon.h` is little faster than current implementation, so enabling that. - In contrast, `compute_PQ_dis_tables_dsub2` implemented with `simdlib_emulated.h` is slower than current implementation, so we have not enabled it in our PR. Pull Request resolved: #1815 Reviewed By: beauby Differential Revision: D27760259 Pulled By: mdouze fbshipit-source-id: 5df6168ac35ae0174bedf04508dbaf19f11fab3f
Thank you for merging the PRs! We would be honored if you could include our names (@vorj, @n-miyamoto-fixstars, @LWisteria, and @matsui528) as contributors in |
That's a good point, we don't have a dedicated place to thank Faiss contributors outside the core Faiss developers (other than https://github.com/facebookresearch/faiss/graphs/contributors). |
From now on we will add CHANGELOG entries with the names of (external) contributors. |
Thank you 😄 |
Summary
Dear @mdouze and all,
Currently,
IndexPQFastScan
andIndexIVFPQFastScan
are optimized only for AVX2. Thus, the algorithms are not much fast in non-AVX2 environments.Our team (@vorj, @n-miyamoto-fixstars, @LWisteria, and @matsui528) has accelerated the above two algorithms for the following two environments:
simdlib_emulated.h
, enabling 4x speedup for SIFT1M.simdlib_neon.h
, which is a NEON counterpart ofsimdlib_avx2.h
.If you are likely to merge, we plan to submit a PR for the above changes. What do you think? The changes are mainly in the above header files and do not significantly change the faiss codebase.
Kind regards,
Platform
OS: Ubuntu 20.04.2 LTS (GNU/Linux 5.4.0-1038-aws aarch64)
Faiss version: fe7b061
Installed from: compiled by myself
Faiss compilation options:
cmake -DFAISS_ENABLE_GPU=OFF -DFAISS_OPT_LEVEL=aarch64 -DCMAKE_BUILD_TYPE=Release
Running on:
Interface:
Reproduction instructions
We only post the results. If you need more detailed information, please let me know.
original
is the current codeimproved-emulated
is the result of optimizingsimdlib_emulated.h
, which is faster thanoriginal
neon
is the result bysimdlib_neon.h
, which is much faster thanoriginal
The above image illustrates the ratio of speedup.
neon
is approx. 60x faster thanoriginal
(M=32, nbits=4, nprobe=16)original
: 4.5 msneon
: 0.077 msThe text was updated successfully, but these errors were encountered: