Skip to content

Improve IndexPQFastScan and IndexIVFPQFastScan performance for aarch64 devices#1815

Closed
vorj wants to merge 7 commits intofacebookresearch:masterfrom
matsui528:simdlib_neon
Closed

Improve IndexPQFastScan and IndexIVFPQFastScan performance for aarch64 devices#1815
vorj wants to merge 7 commits intofacebookresearch:masterfrom
matsui528:simdlib_neon

Conversation

@vorj
Copy link
Copy Markdown
Contributor

@vorj vorj commented Apr 12, 2021

related: #1812

This PR improves the performance of IndexPQFastScan and IndexIVFPQFastScan on aarch64 devices, e.g., 60x faster on an AWS Arm instance with the SIFT1M dataset.
The contents of this PR are below:

  • Add simdlib_neon.h
    • simdlib_neon.h has simdlib compatible API, and they are implemented with Arm NEON intrinsics.
    • simdlib.h includes simdlib_neon.h if __aarch64__ is defined.
  • Move geteven , getodd , getlow128 , and gethigh128 from distances_simd.cpp to simdlib_avx2.h .
  • Port geteven , getodd , getlow128 , and gethigh128 for non-AVX2 environments.
    • These codes are implemented with AVX2 intrinsics, so they have prevented to implement compute_PQ_dis_tables_dsub2 for non-AVX2 environments.
    • Now simdlib_avx2.h , simdlib_emulated.h , and simdlib_neon.h all have those functions.
  • Enable compute_PQ_dis_tables_dsub2 on aarch64
    • Above change makes compute_PQ_dis_tables_dsub2 independent from geteven and so on.
    • compute_PQ_dis_tables_dsub2 implemented with simdlib_neon.h is little faster than current implementation, so enabling that.
      • In contrast, compute_PQ_dis_tables_dsub2 implemented with simdlib_emulated.h is slower than current implementation, so we have not enabled it in our PR.

@mdouze
Copy link
Copy Markdown
Contributor

mdouze commented Apr 14, 2021

Looks good! Unfortunately there is no continuous build for ARM yet.
But it should be fine to integrate in Faiss.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@mdouze has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@mdouze merged this pull request in fe777d8.

facebook-github-bot pushed a commit that referenced this pull request May 20, 2021
Summary:
related: #1815,  #1880

`vshl` / `vshr` of ARM NEON requires immediate (compiletime constant) value as shift parameter.
However, the implementations of those intrinsics on GCC can receive runtime value.
Current faiss implementation depends on this, so some correct-behavioring compilers like Clang can't build faiss for aarch64.
This PR fix this issue; thus faiss applied this PR can be built with Clang for aarch64 machines like M1 Mac.

Pull Request resolved: #1882

Reviewed By: beauby

Differential Revision: D28465563

Pulled By: mdouze

fbshipit-source-id: e431dfb3b27c9728072f50b4bf9445a3f4a5ac43
@vorj vorj mentioned this pull request Mar 17, 2023
facebook-github-bot pushed a commit that referenced this pull request Mar 23, 2023
Summary:
related: #1916, #1979, #2009, #2013, #2195, #2210

After my PR #1815 had been merged `-DCMAKE_BUILD_TYPE=Debug` has been invalid on aarch64, and many people have been hit the problem. (sorry to inconvenience...)
This PR fixes this.

### Details:

Using the function pointers of intrinsics on run-time context causes the link errors.
`-DCMAKE_BUILD_TYPE=Release` has been available because compiler optimizer can propagate and collapse the function pointers as constant.
However, when the pointers doesn't collapsed the link errors occurred, so `-DCMAKE_BUILD_TYPE=Debug` has been unavailable.
To prevent the link errors, I've replaced the function pointers of intrinsics on run-time context to on compile-time context explicitly.

Pull Request resolved: #2768

Reviewed By: mdouze

Differential Revision: D44296147

Pulled By: alexanderguzhva

fbshipit-source-id: 81fa013c5e05a486b6b82cb85d76eeefdefca891
mdouze added a commit that referenced this pull request Mar 23, 2023
* fix windows test (#2775)

Summary: Pull Request resolved: #2775

Reviewed By: algoriddle

Differential Revision: D44210010

fbshipit-source-id: b9b620a4b0a874e09ee2f6082ff0f9463716fdf4

* faiss/utils/simdlib_avx2.h: avoid C++20 ambiguous overloaded operator (#2772)

Summary:
Pull Request resolved: #2772

Resolves errors from overloaded ambiguous operators:
```
faiss/utils/partitioning.cpp:283:34: error: ISO C++20 considers use of overloaded operator '==' (with operand types 'faiss::simd16uint16' and 'faiss::simd16uint16') to be ambiguous despite there being a unique best viable function [-Werror,-Wambiguous-reversed-operator]
```

Reviewed By: alexanderguzhva, meyering

Differential Revision: D44186458

fbshipit-source-id: 0257fa0aaa4fe74c056bef751591f5f7e5357c9d

* Fix Debug Build on aarch64 (#2768)

Summary:
related: #1916, #1979, #2009, #2013, #2195, #2210

After my PR #1815 had been merged `-DCMAKE_BUILD_TYPE=Debug` has been invalid on aarch64, and many people have been hit the problem. (sorry to inconvenience...)
This PR fixes this.

### Details:

Using the function pointers of intrinsics on run-time context causes the link errors.
`-DCMAKE_BUILD_TYPE=Release` has been available because compiler optimizer can propagate and collapse the function pointers as constant.
However, when the pointers doesn't collapsed the link errors occurred, so `-DCMAKE_BUILD_TYPE=Debug` has been unavailable.
To prevent the link errors, I've replaced the function pointers of intrinsics on run-time context to on compile-time context explicitly.

Pull Request resolved: #2768

Reviewed By: mdouze

Differential Revision: D44296147

Pulled By: alexanderguzhva

fbshipit-source-id: 81fa013c5e05a486b6b82cb85d76eeefdefca891

---------

Co-authored-by: Matthijs Douze <matthijs@meta.com>
Co-authored-by: Jeff Palm <palmje@meta.com>
Co-authored-by: Y.Imaizumi <40021161+vorj@users.noreply.github.com>
BZO95 added a commit to BZO95/faiss that referenced this pull request Apr 10, 2025
…4 devices (#1815)

Summary:
related: facebookresearch/faiss#1812

This PR improves the performance of `IndexPQFastScan` and `IndexIVFPQFastScan` on aarch64 devices, e.g., 60x faster on an AWS Arm instance with the SIFT1M dataset.
The contents of this PR are below:

- Add `simdlib_neon.h`
    - `simdlib_neon.h` has `simdlib` compatible API, and they are implemented with Arm NEON intrinsics.
    - `simdlib.h` includes `simdlib_neon.h` if `__aarch64__` is defined.
- Move `geteven` , `getodd` , `getlow128` , and `gethigh128` from `distances_simd.cpp` to `simdlib_avx2.h` .
- Port `geteven` , `getodd` , `getlow128` , and `gethigh128` for non-AVX2 environments.
    - These codes are implemented with AVX2 intrinsics, so they have prevented to implement `compute_PQ_dis_tables_dsub2` for non-AVX2 environments.
    - Now `simdlib_avx2.h` , `simdlib_emulated.h` , and `simdlib_neon.h` all have those functions.
- Enable `compute_PQ_dis_tables_dsub2` on aarch64
    - Above change makes `compute_PQ_dis_tables_dsub2` independent from `geteven` and so on.
    - `compute_PQ_dis_tables_dsub2` implemented with `simdlib_neon.h` is little faster than current implementation, so enabling that.
        - In contrast, `compute_PQ_dis_tables_dsub2` implemented with `simdlib_emulated.h` is slower than current implementation, so we have not enabled it in our PR.

Pull Request resolved: facebookresearch/faiss#1815

Reviewed By: beauby

Differential Revision: D27760259

Pulled By: mdouze

fbshipit-source-id: 5df6168ac35ae0174bedf04508dbaf19f11fab3f
BZO95 added a commit to BZO95/faiss that referenced this pull request Apr 10, 2025
Summary:
related: facebookresearch/faiss#1815,  facebookresearch/faiss#1880

`vshl` / `vshr` of ARM NEON requires immediate (compiletime constant) value as shift parameter.
However, the implementations of those intrinsics on GCC can receive runtime value.
Current faiss implementation depends on this, so some correct-behavioring compilers like Clang can't build faiss for aarch64.
This PR fix this issue; thus faiss applied this PR can be built with Clang for aarch64 machines like M1 Mac.

Pull Request resolved: facebookresearch/faiss#1882

Reviewed By: beauby

Differential Revision: D28465563

Pulled By: mdouze

fbshipit-source-id: e431dfb3b27c9728072f50b4bf9445a3f4a5ac43
BZO95 added a commit to BZO95/faiss that referenced this pull request Apr 10, 2025
Summary:
related: facebookresearch/faiss#1916, facebookresearch/faiss#1979, facebookresearch/faiss#2009, facebookresearch/faiss#2013, facebookresearch/faiss#2195, facebookresearch/faiss#2210

After my PR facebookresearch/faiss#1815 had been merged `-DCMAKE_BUILD_TYPE=Debug` has been invalid on aarch64, and many people have been hit the problem. (sorry to inconvenience...)
This PR fixes this.

### Details:

Using the function pointers of intrinsics on run-time context causes the link errors.
`-DCMAKE_BUILD_TYPE=Release` has been available because compiler optimizer can propagate and collapse the function pointers as constant.
However, when the pointers doesn't collapsed the link errors occurred, so `-DCMAKE_BUILD_TYPE=Debug` has been unavailable.
To prevent the link errors, I've replaced the function pointers of intrinsics on run-time context to on compile-time context explicitly.

Pull Request resolved: facebookresearch/faiss#2768

Reviewed By: mdouze

Differential Revision: D44296147

Pulled By: alexanderguzhva

fbshipit-source-id: 81fa013c5e05a486b6b82cb85d76eeefdefca891
samanthawaters8882michaeldonovan added a commit to samanthawaters8882michaeldonovan/faiss that referenced this pull request Oct 12, 2025
…4 devices (#1815)

Summary:
related: facebookresearch/faiss#1812

This PR improves the performance of `IndexPQFastScan` and `IndexIVFPQFastScan` on aarch64 devices, e.g., 60x faster on an AWS Arm instance with the SIFT1M dataset.
The contents of this PR are below:

- Add `simdlib_neon.h`
    - `simdlib_neon.h` has `simdlib` compatible API, and they are implemented with Arm NEON intrinsics.
    - `simdlib.h` includes `simdlib_neon.h` if `__aarch64__` is defined.
- Move `geteven` , `getodd` , `getlow128` , and `gethigh128` from `distances_simd.cpp` to `simdlib_avx2.h` .
- Port `geteven` , `getodd` , `getlow128` , and `gethigh128` for non-AVX2 environments.
    - These codes are implemented with AVX2 intrinsics, so they have prevented to implement `compute_PQ_dis_tables_dsub2` for non-AVX2 environments.
    - Now `simdlib_avx2.h` , `simdlib_emulated.h` , and `simdlib_neon.h` all have those functions.
- Enable `compute_PQ_dis_tables_dsub2` on aarch64
    - Above change makes `compute_PQ_dis_tables_dsub2` independent from `geteven` and so on.
    - `compute_PQ_dis_tables_dsub2` implemented with `simdlib_neon.h` is little faster than current implementation, so enabling that.
        - In contrast, `compute_PQ_dis_tables_dsub2` implemented with `simdlib_emulated.h` is slower than current implementation, so we have not enabled it in our PR.

Pull Request resolved: facebookresearch/faiss#1815

Reviewed By: beauby

Differential Revision: D27760259

Pulled By: mdouze

fbshipit-source-id: 5df6168ac35ae0174bedf04508dbaf19f11fab3f
samanthawaters8882michaeldonovan added a commit to samanthawaters8882michaeldonovan/faiss that referenced this pull request Oct 12, 2025
Summary:
related: facebookresearch/faiss#1815,  facebookresearch/faiss#1880

`vshl` / `vshr` of ARM NEON requires immediate (compiletime constant) value as shift parameter.
However, the implementations of those intrinsics on GCC can receive runtime value.
Current faiss implementation depends on this, so some correct-behavioring compilers like Clang can't build faiss for aarch64.
This PR fix this issue; thus faiss applied this PR can be built with Clang for aarch64 machines like M1 Mac.

Pull Request resolved: facebookresearch/faiss#1882

Reviewed By: beauby

Differential Revision: D28465563

Pulled By: mdouze

fbshipit-source-id: e431dfb3b27c9728072f50b4bf9445a3f4a5ac43
samanthawaters8882michaeldonovan added a commit to samanthawaters8882michaeldonovan/faiss that referenced this pull request Oct 12, 2025
Summary:
related: facebookresearch/faiss#1916, facebookresearch/faiss#1979, facebookresearch/faiss#2009, facebookresearch/faiss#2013, facebookresearch/faiss#2195, facebookresearch/faiss#2210

After my PR facebookresearch/faiss#1815 had been merged `-DCMAKE_BUILD_TYPE=Debug` has been invalid on aarch64, and many people have been hit the problem. (sorry to inconvenience...)
This PR fixes this.

### Details:

Using the function pointers of intrinsics on run-time context causes the link errors.
`-DCMAKE_BUILD_TYPE=Release` has been available because compiler optimizer can propagate and collapse the function pointers as constant.
However, when the pointers doesn't collapsed the link errors occurred, so `-DCMAKE_BUILD_TYPE=Debug` has been unavailable.
To prevent the link errors, I've replaced the function pointers of intrinsics on run-time context to on compile-time context explicitly.

Pull Request resolved: facebookresearch/faiss#2768

Reviewed By: mdouze

Differential Revision: D44296147

Pulled By: alexanderguzhva

fbshipit-source-id: 81fa013c5e05a486b6b82cb85d76eeefdefca891
dimitraseferiadi pushed a commit to dimitraseferiadi/SuCo that referenced this pull request Mar 8, 2026
…4 devices (facebookresearch#1815)

Summary:
related: facebookresearch#1812

This PR improves the performance of `IndexPQFastScan` and `IndexIVFPQFastScan` on aarch64 devices, e.g., 60x faster on an AWS Arm instance with the SIFT1M dataset.
The contents of this PR are below:

- Add `simdlib_neon.h`
    - `simdlib_neon.h` has `simdlib` compatible API, and they are implemented with Arm NEON intrinsics.
    - `simdlib.h` includes `simdlib_neon.h` if `__aarch64__` is defined.
- Move `geteven` , `getodd` , `getlow128` , and `gethigh128` from `distances_simd.cpp` to `simdlib_avx2.h` .
- Port `geteven` , `getodd` , `getlow128` , and `gethigh128` for non-AVX2 environments.
    - These codes are implemented with AVX2 intrinsics, so they have prevented to implement `compute_PQ_dis_tables_dsub2` for non-AVX2 environments.
    - Now `simdlib_avx2.h` , `simdlib_emulated.h` , and `simdlib_neon.h` all have those functions.
- Enable `compute_PQ_dis_tables_dsub2` on aarch64
    - Above change makes `compute_PQ_dis_tables_dsub2` independent from `geteven` and so on.
    - `compute_PQ_dis_tables_dsub2` implemented with `simdlib_neon.h` is little faster than current implementation, so enabling that.
        - In contrast, `compute_PQ_dis_tables_dsub2` implemented with `simdlib_emulated.h` is slower than current implementation, so we have not enabled it in our PR.

Pull Request resolved: facebookresearch#1815

Reviewed By: beauby

Differential Revision: D27760259

Pulled By: mdouze

fbshipit-source-id: 5df6168ac35ae0174bedf04508dbaf19f11fab3f
dimitraseferiadi pushed a commit to dimitraseferiadi/SuCo that referenced this pull request Mar 8, 2026
Summary:
related: facebookresearch#1815,  facebookresearch#1880

`vshl` / `vshr` of ARM NEON requires immediate (compiletime constant) value as shift parameter.
However, the implementations of those intrinsics on GCC can receive runtime value.
Current faiss implementation depends on this, so some correct-behavioring compilers like Clang can't build faiss for aarch64.
This PR fix this issue; thus faiss applied this PR can be built with Clang for aarch64 machines like M1 Mac.

Pull Request resolved: facebookresearch#1882

Reviewed By: beauby

Differential Revision: D28465563

Pulled By: mdouze

fbshipit-source-id: e431dfb3b27c9728072f50b4bf9445a3f4a5ac43
dimitraseferiadi pushed a commit to dimitraseferiadi/SuCo that referenced this pull request Mar 8, 2026
Summary:
related: facebookresearch#1916, facebookresearch#1979, facebookresearch#2009, facebookresearch#2013, facebookresearch#2195, facebookresearch#2210

After my PR facebookresearch#1815 had been merged `-DCMAKE_BUILD_TYPE=Debug` has been invalid on aarch64, and many people have been hit the problem. (sorry to inconvenience...)
This PR fixes this.

### Details:

Using the function pointers of intrinsics on run-time context causes the link errors.
`-DCMAKE_BUILD_TYPE=Release` has been available because compiler optimizer can propagate and collapse the function pointers as constant.
However, when the pointers doesn't collapsed the link errors occurred, so `-DCMAKE_BUILD_TYPE=Debug` has been unavailable.
To prevent the link errors, I've replaced the function pointers of intrinsics on run-time context to on compile-time context explicitly.

Pull Request resolved: facebookresearch#2768

Reviewed By: mdouze

Differential Revision: D44296147

Pulled By: alexanderguzhva

fbshipit-source-id: 81fa013c5e05a486b6b82cb85d76eeefdefca891
dimitraseferiadi pushed a commit to dimitraseferiadi/SuCo that referenced this pull request Mar 16, 2026
…4 devices (facebookresearch#1815)

Summary:
related: facebookresearch#1812

This PR improves the performance of `IndexPQFastScan` and `IndexIVFPQFastScan` on aarch64 devices, e.g., 60x faster on an AWS Arm instance with the SIFT1M dataset.
The contents of this PR are below:

- Add `simdlib_neon.h`
    - `simdlib_neon.h` has `simdlib` compatible API, and they are implemented with Arm NEON intrinsics.
    - `simdlib.h` includes `simdlib_neon.h` if `__aarch64__` is defined.
- Move `geteven` , `getodd` , `getlow128` , and `gethigh128` from `distances_simd.cpp` to `simdlib_avx2.h` .
- Port `geteven` , `getodd` , `getlow128` , and `gethigh128` for non-AVX2 environments.
    - These codes are implemented with AVX2 intrinsics, so they have prevented to implement `compute_PQ_dis_tables_dsub2` for non-AVX2 environments.
    - Now `simdlib_avx2.h` , `simdlib_emulated.h` , and `simdlib_neon.h` all have those functions.
- Enable `compute_PQ_dis_tables_dsub2` on aarch64
    - Above change makes `compute_PQ_dis_tables_dsub2` independent from `geteven` and so on.
    - `compute_PQ_dis_tables_dsub2` implemented with `simdlib_neon.h` is little faster than current implementation, so enabling that.
        - In contrast, `compute_PQ_dis_tables_dsub2` implemented with `simdlib_emulated.h` is slower than current implementation, so we have not enabled it in our PR.

Pull Request resolved: facebookresearch#1815

Reviewed By: beauby

Differential Revision: D27760259

Pulled By: mdouze

fbshipit-source-id: 5df6168ac35ae0174bedf04508dbaf19f11fab3f
dimitraseferiadi pushed a commit to dimitraseferiadi/SuCo that referenced this pull request Mar 16, 2026
Summary:
related: facebookresearch#1815,  facebookresearch#1880

`vshl` / `vshr` of ARM NEON requires immediate (compiletime constant) value as shift parameter.
However, the implementations of those intrinsics on GCC can receive runtime value.
Current faiss implementation depends on this, so some correct-behavioring compilers like Clang can't build faiss for aarch64.
This PR fix this issue; thus faiss applied this PR can be built with Clang for aarch64 machines like M1 Mac.

Pull Request resolved: facebookresearch#1882

Reviewed By: beauby

Differential Revision: D28465563

Pulled By: mdouze

fbshipit-source-id: e431dfb3b27c9728072f50b4bf9445a3f4a5ac43
dimitraseferiadi pushed a commit to dimitraseferiadi/SuCo that referenced this pull request Mar 16, 2026
Summary:
related: facebookresearch#1916, facebookresearch#1979, facebookresearch#2009, facebookresearch#2013, facebookresearch#2195, facebookresearch#2210

After my PR facebookresearch#1815 had been merged `-DCMAKE_BUILD_TYPE=Debug` has been invalid on aarch64, and many people have been hit the problem. (sorry to inconvenience...)
This PR fixes this.

### Details:

Using the function pointers of intrinsics on run-time context causes the link errors.
`-DCMAKE_BUILD_TYPE=Release` has been available because compiler optimizer can propagate and collapse the function pointers as constant.
However, when the pointers doesn't collapsed the link errors occurred, so `-DCMAKE_BUILD_TYPE=Debug` has been unavailable.
To prevent the link errors, I've replaced the function pointers of intrinsics on run-time context to on compile-time context explicitly.

Pull Request resolved: facebookresearch#2768

Reviewed By: mdouze

Differential Revision: D44296147

Pulled By: alexanderguzhva

fbshipit-source-id: 81fa013c5e05a486b6b82cb85d76eeefdefca891
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants