Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of IndexPQFastScan and IndexIVFPQFastScan on aarch64 and non-AVX2 devices #1812

Closed
2 of 4 tasks
matsui528 opened this issue Apr 8, 2021 · 6 comments
Closed
2 of 4 tasks

Comments

@matsui528
Copy link

Summary

Dear @mdouze and all,

Currently, IndexPQFastScan and IndexIVFPQFastScan are optimized only for AVX2. Thus, the algorithms are not much fast in non-AVX2 environments.

Our team (@vorj, @n-miyamoto-fixstars, @LWisteria, and @matsui528) has accelerated the above two algorithms for the following two environments:

  • for non-AVX2 environment (e.g., old x86 computer w/o AVX2)
    • We optimized simdlib_emulated.h, enabling 4x speedup for SIFT1M.
  • for aarch64 (e.g., Rasberry Pi)
    • We implemented simdlib_neon.h, which is a NEON counterpart of simdlib_avx2.h.
    • This achieved 60x faster performance on SIFT1M.

If you are likely to merge, we plan to submit a PR for the above changes. What do you think? The changes are mainly in the above header files and do not significantly change the faiss codebase.

Kind regards,

Platform

OS: Ubuntu 20.04.2 LTS (GNU/Linux 5.4.0-1038-aws aarch64)

Faiss version: fe7b061

Installed from: compiled by myself

Faiss compilation options: cmake -DFAISS_ENABLE_GPU=OFF -DFAISS_OPT_LEVEL=aarch64 -DCMAKE_BUILD_TYPE=Release

Running on:

  • CPU
  • GPU

Interface:

  • C++
  • Python

Reproduction instructions

We only post the results. If you need more detailed information, please let me know.
1

  • Evaluated on an AWS EC2 ARM instance (c6g.4xlarge)
  • original is the current code
  • improved-emulated is the result of optimizing simdlib_emulated.h, which is faster than original
  • neon is the result by simdlib_neon.h, which is much faster than original

2
The above image illustrates the ratio of speedup.

  • In the best case, neon is approx. 60x faster than original (M=32, nbits=4, nprobe=16)
    • original: 4.5 ms
    • neon: 0.077 ms
@mdouze
Copy link
Contributor

mdouze commented Apr 12, 2021

That looks great! Thanks so much for working on this. Of course, we would welcome this as a PR.

@matsui528
Copy link
Author

Glad to hear that! Could you review the following two PRs when you're available? Best,
#1814 #1815

facebook-github-bot pushed a commit that referenced this issue Apr 16, 2021
Summary:
related: #1812

This PR improves the performance of contents in `simdlib_emulated.h` .
`IndexPQFastScan` and `IndexIVFPQFastScan` will become faster on non-AVX2 environments, e.g., 4x faster on SIFT1M.
This PR contains below changes:

- Use `template` instead of `std::function` on argument of `unary_func` and `binary_func`
    - Because `std::function` hinders some optimizations like function inlining
- Use `const T&` instead of `T` for vector classes like `simd16uint16` on argument of functions
    - Vector classes on `simdlib_emulated.h` has the data member as array, so the runtime cost for copying is not so low.
    - Passing by const lvalue-ref prevents copy.

Pull Request resolved: #1814

Reviewed By: beauby

Differential Revision: D27760072

Pulled By: mdouze

fbshipit-source-id: cbc5a14658d1960b24ce55a395e71c80998742dc
facebook-github-bot pushed a commit that referenced this issue Apr 16, 2021
…4 devices (#1815)

Summary:
related: #1812

This PR improves the performance of `IndexPQFastScan` and `IndexIVFPQFastScan` on aarch64 devices, e.g., 60x faster on an AWS Arm instance with the SIFT1M dataset.
The contents of this PR are below:

- Add `simdlib_neon.h`
    - `simdlib_neon.h` has `simdlib` compatible API, and they are implemented with Arm NEON intrinsics.
    - `simdlib.h` includes `simdlib_neon.h` if `__aarch64__` is defined.
- Move `geteven` , `getodd` , `getlow128` , and `gethigh128` from `distances_simd.cpp` to `simdlib_avx2.h` .
- Port `geteven` , `getodd` , `getlow128` , and `gethigh128` for non-AVX2 environments.
    - These codes are implemented with AVX2 intrinsics, so they have prevented to implement `compute_PQ_dis_tables_dsub2` for non-AVX2 environments.
    - Now `simdlib_avx2.h` , `simdlib_emulated.h` , and `simdlib_neon.h` all have those functions.
- Enable `compute_PQ_dis_tables_dsub2` on aarch64
    - Above change makes `compute_PQ_dis_tables_dsub2` independent from `geteven` and so on.
    - `compute_PQ_dis_tables_dsub2` implemented with `simdlib_neon.h` is little faster than current implementation, so enabling that.
        - In contrast, `compute_PQ_dis_tables_dsub2` implemented with `simdlib_emulated.h` is slower than current implementation, so we have not enabled it in our PR.

Pull Request resolved: #1815

Reviewed By: beauby

Differential Revision: D27760259

Pulled By: mdouze

fbshipit-source-id: 5df6168ac35ae0174bedf04508dbaf19f11fab3f
@matsui528
Copy link
Author

Thank you for merging the PRs! We would be honored if you could include our names (@vorj, @n-miyamoto-fixstars, @LWisteria, and @matsui528) as contributors in README.md or CHANGELOG.md or somewhere else.

@mdouze
Copy link
Contributor

mdouze commented Apr 21, 2021

That's a good point, we don't have a dedicated place to thank Faiss contributors outside the core Faiss developers (other than https://github.com/facebookresearch/faiss/graphs/contributors).
@beauby where could we do that?

@mdouze
Copy link
Contributor

mdouze commented May 16, 2021

From now on we will add CHANGELOG entries with the names of (external) contributors.

@matsui528
Copy link
Author

Thank you 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants