Summary
Dear @mdouze and all,
Currently, IndexPQFastScan and IndexIVFPQFastScan are optimized only for AVX2. Thus, the algorithms are not much fast in non-AVX2 environments.
Our team (@vorj, @n-miyamoto-fixstars, @LWisteria, and @matsui528) has accelerated the above two algorithms for the following two environments:
- for non-AVX2 environment (e.g., old x86 computer w/o AVX2)
- We optimized
simdlib_emulated.h, enabling 4x speedup for SIFT1M.
- for aarch64 (e.g., Rasberry Pi)
- We implemented
simdlib_neon.h, which is a NEON counterpart of simdlib_avx2.h.
- This achieved 60x faster performance on SIFT1M.
If you are likely to merge, we plan to submit a PR for the above changes. What do you think? The changes are mainly in the above header files and do not significantly change the faiss codebase.
Kind regards,
Platform
OS: Ubuntu 20.04.2 LTS (GNU/Linux 5.4.0-1038-aws aarch64)
Faiss version: fe7b061
Installed from: compiled by myself
Faiss compilation options: cmake -DFAISS_ENABLE_GPU=OFF -DFAISS_OPT_LEVEL=aarch64 -DCMAKE_BUILD_TYPE=Release
Running on:
Interface:
Reproduction instructions
We only post the results. If you need more detailed information, please let me know.

- Evaluated on an AWS EC2 ARM instance (c6g.4xlarge)
original is the current code
improved-emulated is the result of optimizing simdlib_emulated.h, which is faster than original
neon is the result by simdlib_neon.h, which is much faster than original

The above image illustrates the ratio of speedup.
- In the best case,
neon is approx. 60x faster than original (M=32, nbits=4, nprobe=16)
original: 4.5 ms
neon: 0.077 ms
Summary
Dear @mdouze and all,
Currently,
IndexPQFastScanandIndexIVFPQFastScanare optimized only for AVX2. Thus, the algorithms are not much fast in non-AVX2 environments.Our team (@vorj, @n-miyamoto-fixstars, @LWisteria, and @matsui528) has accelerated the above two algorithms for the following two environments:
simdlib_emulated.h, enabling 4x speedup for SIFT1M.simdlib_neon.h, which is a NEON counterpart ofsimdlib_avx2.h.If you are likely to merge, we plan to submit a PR for the above changes. What do you think? The changes are mainly in the above header files and do not significantly change the faiss codebase.
Kind regards,
Platform
OS: Ubuntu 20.04.2 LTS (GNU/Linux 5.4.0-1038-aws aarch64)
Faiss version: fe7b061
Installed from: compiled by myself
Faiss compilation options:
cmake -DFAISS_ENABLE_GPU=OFF -DFAISS_OPT_LEVEL=aarch64 -DCMAKE_BUILD_TYPE=ReleaseRunning on:
Interface:
Reproduction instructions
We only post the results. If you need more detailed information, please let me know.

originalis the current codeimproved-emulatedis the result of optimizingsimdlib_emulated.h, which is faster thanoriginalneonis the result bysimdlib_neon.h, which is much faster thanoriginalThe above image illustrates the ratio of speedup.
neonis approx. 60x faster thanoriginal(M=32, nbits=4, nprobe=16)original: 4.5 msneon: 0.077 ms