Improve IndexPQFastScan and IndexIVFPQFastScan performance for aarch64 devices#1815
Closed
vorj wants to merge 7 commits intofacebookresearch:masterfrom
Closed
Improve IndexPQFastScan and IndexIVFPQFastScan performance for aarch64 devices#1815vorj wants to merge 7 commits intofacebookresearch:masterfrom
vorj wants to merge 7 commits intofacebookresearch:masterfrom
Conversation
Closed
4 tasks
Contributor
|
Looks good! Unfortunately there is no continuous build for ARM yet. |
Contributor
|
@mdouze has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Contributor
facebook-github-bot
pushed a commit
that referenced
this pull request
May 20, 2021
Summary: related: #1815, #1880 `vshl` / `vshr` of ARM NEON requires immediate (compiletime constant) value as shift parameter. However, the implementations of those intrinsics on GCC can receive runtime value. Current faiss implementation depends on this, so some correct-behavioring compilers like Clang can't build faiss for aarch64. This PR fix this issue; thus faiss applied this PR can be built with Clang for aarch64 machines like M1 Mac. Pull Request resolved: #1882 Reviewed By: beauby Differential Revision: D28465563 Pulled By: mdouze fbshipit-source-id: e431dfb3b27c9728072f50b4bf9445a3f4a5ac43
facebook-github-bot
pushed a commit
that referenced
this pull request
Mar 23, 2023
Summary: related: #1916, #1979, #2009, #2013, #2195, #2210 After my PR #1815 had been merged `-DCMAKE_BUILD_TYPE=Debug` has been invalid on aarch64, and many people have been hit the problem. (sorry to inconvenience...) This PR fixes this. ### Details: Using the function pointers of intrinsics on run-time context causes the link errors. `-DCMAKE_BUILD_TYPE=Release` has been available because compiler optimizer can propagate and collapse the function pointers as constant. However, when the pointers doesn't collapsed the link errors occurred, so `-DCMAKE_BUILD_TYPE=Debug` has been unavailable. To prevent the link errors, I've replaced the function pointers of intrinsics on run-time context to on compile-time context explicitly. Pull Request resolved: #2768 Reviewed By: mdouze Differential Revision: D44296147 Pulled By: alexanderguzhva fbshipit-source-id: 81fa013c5e05a486b6b82cb85d76eeefdefca891
mdouze
added a commit
that referenced
this pull request
Mar 23, 2023
* fix windows test (#2775) Summary: Pull Request resolved: #2775 Reviewed By: algoriddle Differential Revision: D44210010 fbshipit-source-id: b9b620a4b0a874e09ee2f6082ff0f9463716fdf4 * faiss/utils/simdlib_avx2.h: avoid C++20 ambiguous overloaded operator (#2772) Summary: Pull Request resolved: #2772 Resolves errors from overloaded ambiguous operators: ``` faiss/utils/partitioning.cpp:283:34: error: ISO C++20 considers use of overloaded operator '==' (with operand types 'faiss::simd16uint16' and 'faiss::simd16uint16') to be ambiguous despite there being a unique best viable function [-Werror,-Wambiguous-reversed-operator] ``` Reviewed By: alexanderguzhva, meyering Differential Revision: D44186458 fbshipit-source-id: 0257fa0aaa4fe74c056bef751591f5f7e5357c9d * Fix Debug Build on aarch64 (#2768) Summary: related: #1916, #1979, #2009, #2013, #2195, #2210 After my PR #1815 had been merged `-DCMAKE_BUILD_TYPE=Debug` has been invalid on aarch64, and many people have been hit the problem. (sorry to inconvenience...) This PR fixes this. ### Details: Using the function pointers of intrinsics on run-time context causes the link errors. `-DCMAKE_BUILD_TYPE=Release` has been available because compiler optimizer can propagate and collapse the function pointers as constant. However, when the pointers doesn't collapsed the link errors occurred, so `-DCMAKE_BUILD_TYPE=Debug` has been unavailable. To prevent the link errors, I've replaced the function pointers of intrinsics on run-time context to on compile-time context explicitly. Pull Request resolved: #2768 Reviewed By: mdouze Differential Revision: D44296147 Pulled By: alexanderguzhva fbshipit-source-id: 81fa013c5e05a486b6b82cb85d76eeefdefca891 --------- Co-authored-by: Matthijs Douze <matthijs@meta.com> Co-authored-by: Jeff Palm <palmje@meta.com> Co-authored-by: Y.Imaizumi <40021161+vorj@users.noreply.github.com>
BZO95
added a commit
to BZO95/faiss
that referenced
this pull request
Apr 10, 2025
…4 devices (#1815) Summary: related: facebookresearch/faiss#1812 This PR improves the performance of `IndexPQFastScan` and `IndexIVFPQFastScan` on aarch64 devices, e.g., 60x faster on an AWS Arm instance with the SIFT1M dataset. The contents of this PR are below: - Add `simdlib_neon.h` - `simdlib_neon.h` has `simdlib` compatible API, and they are implemented with Arm NEON intrinsics. - `simdlib.h` includes `simdlib_neon.h` if `__aarch64__` is defined. - Move `geteven` , `getodd` , `getlow128` , and `gethigh128` from `distances_simd.cpp` to `simdlib_avx2.h` . - Port `geteven` , `getodd` , `getlow128` , and `gethigh128` for non-AVX2 environments. - These codes are implemented with AVX2 intrinsics, so they have prevented to implement `compute_PQ_dis_tables_dsub2` for non-AVX2 environments. - Now `simdlib_avx2.h` , `simdlib_emulated.h` , and `simdlib_neon.h` all have those functions. - Enable `compute_PQ_dis_tables_dsub2` on aarch64 - Above change makes `compute_PQ_dis_tables_dsub2` independent from `geteven` and so on. - `compute_PQ_dis_tables_dsub2` implemented with `simdlib_neon.h` is little faster than current implementation, so enabling that. - In contrast, `compute_PQ_dis_tables_dsub2` implemented with `simdlib_emulated.h` is slower than current implementation, so we have not enabled it in our PR. Pull Request resolved: facebookresearch/faiss#1815 Reviewed By: beauby Differential Revision: D27760259 Pulled By: mdouze fbshipit-source-id: 5df6168ac35ae0174bedf04508dbaf19f11fab3f
BZO95
added a commit
to BZO95/faiss
that referenced
this pull request
Apr 10, 2025
Summary: related: facebookresearch/faiss#1815, facebookresearch/faiss#1880 `vshl` / `vshr` of ARM NEON requires immediate (compiletime constant) value as shift parameter. However, the implementations of those intrinsics on GCC can receive runtime value. Current faiss implementation depends on this, so some correct-behavioring compilers like Clang can't build faiss for aarch64. This PR fix this issue; thus faiss applied this PR can be built with Clang for aarch64 machines like M1 Mac. Pull Request resolved: facebookresearch/faiss#1882 Reviewed By: beauby Differential Revision: D28465563 Pulled By: mdouze fbshipit-source-id: e431dfb3b27c9728072f50b4bf9445a3f4a5ac43
BZO95
added a commit
to BZO95/faiss
that referenced
this pull request
Apr 10, 2025
Summary: related: facebookresearch/faiss#1916, facebookresearch/faiss#1979, facebookresearch/faiss#2009, facebookresearch/faiss#2013, facebookresearch/faiss#2195, facebookresearch/faiss#2210 After my PR facebookresearch/faiss#1815 had been merged `-DCMAKE_BUILD_TYPE=Debug` has been invalid on aarch64, and many people have been hit the problem. (sorry to inconvenience...) This PR fixes this. ### Details: Using the function pointers of intrinsics on run-time context causes the link errors. `-DCMAKE_BUILD_TYPE=Release` has been available because compiler optimizer can propagate and collapse the function pointers as constant. However, when the pointers doesn't collapsed the link errors occurred, so `-DCMAKE_BUILD_TYPE=Debug` has been unavailable. To prevent the link errors, I've replaced the function pointers of intrinsics on run-time context to on compile-time context explicitly. Pull Request resolved: facebookresearch/faiss#2768 Reviewed By: mdouze Differential Revision: D44296147 Pulled By: alexanderguzhva fbshipit-source-id: 81fa013c5e05a486b6b82cb85d76eeefdefca891
samanthawaters8882michaeldonovan
added a commit
to samanthawaters8882michaeldonovan/faiss
that referenced
this pull request
Oct 12, 2025
…4 devices (#1815) Summary: related: facebookresearch/faiss#1812 This PR improves the performance of `IndexPQFastScan` and `IndexIVFPQFastScan` on aarch64 devices, e.g., 60x faster on an AWS Arm instance with the SIFT1M dataset. The contents of this PR are below: - Add `simdlib_neon.h` - `simdlib_neon.h` has `simdlib` compatible API, and they are implemented with Arm NEON intrinsics. - `simdlib.h` includes `simdlib_neon.h` if `__aarch64__` is defined. - Move `geteven` , `getodd` , `getlow128` , and `gethigh128` from `distances_simd.cpp` to `simdlib_avx2.h` . - Port `geteven` , `getodd` , `getlow128` , and `gethigh128` for non-AVX2 environments. - These codes are implemented with AVX2 intrinsics, so they have prevented to implement `compute_PQ_dis_tables_dsub2` for non-AVX2 environments. - Now `simdlib_avx2.h` , `simdlib_emulated.h` , and `simdlib_neon.h` all have those functions. - Enable `compute_PQ_dis_tables_dsub2` on aarch64 - Above change makes `compute_PQ_dis_tables_dsub2` independent from `geteven` and so on. - `compute_PQ_dis_tables_dsub2` implemented with `simdlib_neon.h` is little faster than current implementation, so enabling that. - In contrast, `compute_PQ_dis_tables_dsub2` implemented with `simdlib_emulated.h` is slower than current implementation, so we have not enabled it in our PR. Pull Request resolved: facebookresearch/faiss#1815 Reviewed By: beauby Differential Revision: D27760259 Pulled By: mdouze fbshipit-source-id: 5df6168ac35ae0174bedf04508dbaf19f11fab3f
samanthawaters8882michaeldonovan
added a commit
to samanthawaters8882michaeldonovan/faiss
that referenced
this pull request
Oct 12, 2025
Summary: related: facebookresearch/faiss#1815, facebookresearch/faiss#1880 `vshl` / `vshr` of ARM NEON requires immediate (compiletime constant) value as shift parameter. However, the implementations of those intrinsics on GCC can receive runtime value. Current faiss implementation depends on this, so some correct-behavioring compilers like Clang can't build faiss for aarch64. This PR fix this issue; thus faiss applied this PR can be built with Clang for aarch64 machines like M1 Mac. Pull Request resolved: facebookresearch/faiss#1882 Reviewed By: beauby Differential Revision: D28465563 Pulled By: mdouze fbshipit-source-id: e431dfb3b27c9728072f50b4bf9445a3f4a5ac43
samanthawaters8882michaeldonovan
added a commit
to samanthawaters8882michaeldonovan/faiss
that referenced
this pull request
Oct 12, 2025
Summary: related: facebookresearch/faiss#1916, facebookresearch/faiss#1979, facebookresearch/faiss#2009, facebookresearch/faiss#2013, facebookresearch/faiss#2195, facebookresearch/faiss#2210 After my PR facebookresearch/faiss#1815 had been merged `-DCMAKE_BUILD_TYPE=Debug` has been invalid on aarch64, and many people have been hit the problem. (sorry to inconvenience...) This PR fixes this. ### Details: Using the function pointers of intrinsics on run-time context causes the link errors. `-DCMAKE_BUILD_TYPE=Release` has been available because compiler optimizer can propagate and collapse the function pointers as constant. However, when the pointers doesn't collapsed the link errors occurred, so `-DCMAKE_BUILD_TYPE=Debug` has been unavailable. To prevent the link errors, I've replaced the function pointers of intrinsics on run-time context to on compile-time context explicitly. Pull Request resolved: facebookresearch/faiss#2768 Reviewed By: mdouze Differential Revision: D44296147 Pulled By: alexanderguzhva fbshipit-source-id: 81fa013c5e05a486b6b82cb85d76eeefdefca891
dimitraseferiadi
pushed a commit
to dimitraseferiadi/SuCo
that referenced
this pull request
Mar 8, 2026
…4 devices (facebookresearch#1815) Summary: related: facebookresearch#1812 This PR improves the performance of `IndexPQFastScan` and `IndexIVFPQFastScan` on aarch64 devices, e.g., 60x faster on an AWS Arm instance with the SIFT1M dataset. The contents of this PR are below: - Add `simdlib_neon.h` - `simdlib_neon.h` has `simdlib` compatible API, and they are implemented with Arm NEON intrinsics. - `simdlib.h` includes `simdlib_neon.h` if `__aarch64__` is defined. - Move `geteven` , `getodd` , `getlow128` , and `gethigh128` from `distances_simd.cpp` to `simdlib_avx2.h` . - Port `geteven` , `getodd` , `getlow128` , and `gethigh128` for non-AVX2 environments. - These codes are implemented with AVX2 intrinsics, so they have prevented to implement `compute_PQ_dis_tables_dsub2` for non-AVX2 environments. - Now `simdlib_avx2.h` , `simdlib_emulated.h` , and `simdlib_neon.h` all have those functions. - Enable `compute_PQ_dis_tables_dsub2` on aarch64 - Above change makes `compute_PQ_dis_tables_dsub2` independent from `geteven` and so on. - `compute_PQ_dis_tables_dsub2` implemented with `simdlib_neon.h` is little faster than current implementation, so enabling that. - In contrast, `compute_PQ_dis_tables_dsub2` implemented with `simdlib_emulated.h` is slower than current implementation, so we have not enabled it in our PR. Pull Request resolved: facebookresearch#1815 Reviewed By: beauby Differential Revision: D27760259 Pulled By: mdouze fbshipit-source-id: 5df6168ac35ae0174bedf04508dbaf19f11fab3f
dimitraseferiadi
pushed a commit
to dimitraseferiadi/SuCo
that referenced
this pull request
Mar 8, 2026
Summary: related: facebookresearch#1815, facebookresearch#1880 `vshl` / `vshr` of ARM NEON requires immediate (compiletime constant) value as shift parameter. However, the implementations of those intrinsics on GCC can receive runtime value. Current faiss implementation depends on this, so some correct-behavioring compilers like Clang can't build faiss for aarch64. This PR fix this issue; thus faiss applied this PR can be built with Clang for aarch64 machines like M1 Mac. Pull Request resolved: facebookresearch#1882 Reviewed By: beauby Differential Revision: D28465563 Pulled By: mdouze fbshipit-source-id: e431dfb3b27c9728072f50b4bf9445a3f4a5ac43
dimitraseferiadi
pushed a commit
to dimitraseferiadi/SuCo
that referenced
this pull request
Mar 8, 2026
Summary: related: facebookresearch#1916, facebookresearch#1979, facebookresearch#2009, facebookresearch#2013, facebookresearch#2195, facebookresearch#2210 After my PR facebookresearch#1815 had been merged `-DCMAKE_BUILD_TYPE=Debug` has been invalid on aarch64, and many people have been hit the problem. (sorry to inconvenience...) This PR fixes this. ### Details: Using the function pointers of intrinsics on run-time context causes the link errors. `-DCMAKE_BUILD_TYPE=Release` has been available because compiler optimizer can propagate and collapse the function pointers as constant. However, when the pointers doesn't collapsed the link errors occurred, so `-DCMAKE_BUILD_TYPE=Debug` has been unavailable. To prevent the link errors, I've replaced the function pointers of intrinsics on run-time context to on compile-time context explicitly. Pull Request resolved: facebookresearch#2768 Reviewed By: mdouze Differential Revision: D44296147 Pulled By: alexanderguzhva fbshipit-source-id: 81fa013c5e05a486b6b82cb85d76eeefdefca891
dimitraseferiadi
pushed a commit
to dimitraseferiadi/SuCo
that referenced
this pull request
Mar 16, 2026
…4 devices (facebookresearch#1815) Summary: related: facebookresearch#1812 This PR improves the performance of `IndexPQFastScan` and `IndexIVFPQFastScan` on aarch64 devices, e.g., 60x faster on an AWS Arm instance with the SIFT1M dataset. The contents of this PR are below: - Add `simdlib_neon.h` - `simdlib_neon.h` has `simdlib` compatible API, and they are implemented with Arm NEON intrinsics. - `simdlib.h` includes `simdlib_neon.h` if `__aarch64__` is defined. - Move `geteven` , `getodd` , `getlow128` , and `gethigh128` from `distances_simd.cpp` to `simdlib_avx2.h` . - Port `geteven` , `getodd` , `getlow128` , and `gethigh128` for non-AVX2 environments. - These codes are implemented with AVX2 intrinsics, so they have prevented to implement `compute_PQ_dis_tables_dsub2` for non-AVX2 environments. - Now `simdlib_avx2.h` , `simdlib_emulated.h` , and `simdlib_neon.h` all have those functions. - Enable `compute_PQ_dis_tables_dsub2` on aarch64 - Above change makes `compute_PQ_dis_tables_dsub2` independent from `geteven` and so on. - `compute_PQ_dis_tables_dsub2` implemented with `simdlib_neon.h` is little faster than current implementation, so enabling that. - In contrast, `compute_PQ_dis_tables_dsub2` implemented with `simdlib_emulated.h` is slower than current implementation, so we have not enabled it in our PR. Pull Request resolved: facebookresearch#1815 Reviewed By: beauby Differential Revision: D27760259 Pulled By: mdouze fbshipit-source-id: 5df6168ac35ae0174bedf04508dbaf19f11fab3f
dimitraseferiadi
pushed a commit
to dimitraseferiadi/SuCo
that referenced
this pull request
Mar 16, 2026
Summary: related: facebookresearch#1815, facebookresearch#1880 `vshl` / `vshr` of ARM NEON requires immediate (compiletime constant) value as shift parameter. However, the implementations of those intrinsics on GCC can receive runtime value. Current faiss implementation depends on this, so some correct-behavioring compilers like Clang can't build faiss for aarch64. This PR fix this issue; thus faiss applied this PR can be built with Clang for aarch64 machines like M1 Mac. Pull Request resolved: facebookresearch#1882 Reviewed By: beauby Differential Revision: D28465563 Pulled By: mdouze fbshipit-source-id: e431dfb3b27c9728072f50b4bf9445a3f4a5ac43
dimitraseferiadi
pushed a commit
to dimitraseferiadi/SuCo
that referenced
this pull request
Mar 16, 2026
Summary: related: facebookresearch#1916, facebookresearch#1979, facebookresearch#2009, facebookresearch#2013, facebookresearch#2195, facebookresearch#2210 After my PR facebookresearch#1815 had been merged `-DCMAKE_BUILD_TYPE=Debug` has been invalid on aarch64, and many people have been hit the problem. (sorry to inconvenience...) This PR fixes this. ### Details: Using the function pointers of intrinsics on run-time context causes the link errors. `-DCMAKE_BUILD_TYPE=Release` has been available because compiler optimizer can propagate and collapse the function pointers as constant. However, when the pointers doesn't collapsed the link errors occurred, so `-DCMAKE_BUILD_TYPE=Debug` has been unavailable. To prevent the link errors, I've replaced the function pointers of intrinsics on run-time context to on compile-time context explicitly. Pull Request resolved: facebookresearch#2768 Reviewed By: mdouze Differential Revision: D44296147 Pulled By: alexanderguzhva fbshipit-source-id: 81fa013c5e05a486b6b82cb85d76eeefdefca891
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
related: #1812
This PR improves the performance of
IndexPQFastScanandIndexIVFPQFastScanon aarch64 devices, e.g., 60x faster on an AWS Arm instance with the SIFT1M dataset.The contents of this PR are below:
simdlib_neon.hsimdlib_neon.hhassimdlibcompatible API, and they are implemented with Arm NEON intrinsics.simdlib.hincludessimdlib_neon.hif__aarch64__is defined.geteven,getodd,getlow128, andgethigh128fromdistances_simd.cpptosimdlib_avx2.h.geteven,getodd,getlow128, andgethigh128for non-AVX2 environments.compute_PQ_dis_tables_dsub2for non-AVX2 environments.simdlib_avx2.h,simdlib_emulated.h, andsimdlib_neon.hall have those functions.compute_PQ_dis_tables_dsub2on aarch64compute_PQ_dis_tables_dsub2independent fromgetevenand so on.compute_PQ_dis_tables_dsub2implemented withsimdlib_neon.his little faster than current implementation, so enabling that.compute_PQ_dis_tables_dsub2implemented withsimdlib_emulated.his slower than current implementation, so we have not enabled it in our PR.