HNSW: add prune_headroom to avoid O(n^2) pruning/locking, headroom test (#4847)#4847
Closed
ddrcoder wants to merge 1 commit into
Closed
HNSW: add prune_headroom to avoid O(n^2) pruning/locking, headroom test (#4847)#4847ddrcoder wants to merge 1 commit into
ddrcoder wants to merge 1 commit into
Conversation
Contributor
Contributor
|
@ddrcoder are there any possible side effects, such as a different search time? It looks quite suspicious that there are ones what were mentioned for this optimization |
ddrcoder
added a commit
to ddrcoder/faiss
that referenced
this pull request
Mar 12, 2026
…st (facebookresearch#4847) Summary: Particularly when building large graphs (10M+), insertion performance often plummets in the presence of highly contended nodes. More time is spent doing repeated `O(d*M^2)` pruning, and more time is spent waiting for locks: ```name=Before [6.2%] +624,960 vecs / +10.1s = +62,165 vecs/s [12.5%] +625,152 vecs / +8.3s = +75,316 vecs/s [18.8%] +625,152 vecs / +8.0s = +78,244 vecs/s [25.0%] +625,152 vecs / +8.6s = +72,813 vecs/s [31.3%] +625,024 vecs / +8.6s = +72,796 vecs/s [37.5%] +624,960 vecs / +9.2s = +68,038 vecs/s [43.8%] +624,960 vecs / +9.9s = +63,074 vecs/s [50.0%] +624,960 vecs / +9.3s = +67,249 vecs/s [56.3%] +624,960 vecs / +9.6s = +64,865 vecs/s [62.5%] +624,960 vecs / +9.8s = +63,622 vecs/s [68.8%] +624,960 vecs / +10.1s = +62,016 vecs/s [75.0%] +624,960 vecs / +10.3s = +60,543 vecs/s [81.3%] +624,960 vecs / +23.3s = +26,858 vecs/s [87.5%] +624,960 vecs / +12.0s = +51,891 vecs/s [93.8%] +624,960 vecs / +11.4s = +55,053 vecs/s [100.%] +624,960 vecs / +11.9s = +52,318 vecs/s [done] 10,000,000 vecs / 177.9s = 56,220 vecs/s average ``` By providing a bit more headroom, much useful time is spent actually constructing the graph, leading to ```name=After Building IDMap,HNSW64 with 192 threads... [ 6.2%] +624,960 vecs / +4.0s = +154,359 vecs/s [12.5%] +625,152 vecs / +3.9s = +159,162 vecs/s [18.8%] +625,152 vecs / +4.2s = +150,183 vecs/s [25.0%] +625,152 vecs / +4.8s = +130,363 vecs/s [31.3%] +625,024 vecs / +5.3s = +118,817 vecs/s [37.5%] +624,960 vecs / +4.7s = +133,585 vecs/s [43.8%] +624,960 vecs / +5.5s = +113,630 vecs/s [50.0%] +624,960 vecs / +4.8s = +131,124 vecs/s [56.3%] +624,960 vecs / +4.8s = +129,847 vecs/s [62.5%] +624,960 vecs / +5.0s = +124,726 vecs/s [68.8%] +624,960 vecs / +5.3s = +117,075 vecs/s [75.0%] +624,960 vecs / +5.3s = +118,652 vecs/s [81.3%] +624,960 vecs / +8.1s = +77,497 vecs/s [87.5%] +624,960 vecs / +6.3s = +99,823 vecs/s [93.8%] +624,960 vecs / +6.0s = +103,604 vecs/s [100.%] +624,960 vecs / +5.9s = +106,680 vecs/s [done] 10,000,000 vecs / 89.9s = 111,278 vecs/s average ``` The increased pruning has minimal effect on recall, particularly with the default value of `prune_headroom = 0.2`. Reviewed By: mdouze Differential Revision: D94442964
cc91eaf to
651d20d
Compare
ddrcoder
added a commit
to ddrcoder/faiss
that referenced
this pull request
Mar 12, 2026
…st (facebookresearch#4847) Summary: Particularly when building large graphs (10M+), insertion performance often plummets in the presence of highly contended nodes. More time is spent doing repeated `O(d*M^2)` pruning, and more time is spent waiting for locks: ```name=Before [6.2%] +624,960 vecs / +10.1s = +62,165 vecs/s [12.5%] +625,152 vecs / +8.3s = +75,316 vecs/s [18.8%] +625,152 vecs / +8.0s = +78,244 vecs/s [25.0%] +625,152 vecs / +8.6s = +72,813 vecs/s [31.3%] +625,024 vecs / +8.6s = +72,796 vecs/s [37.5%] +624,960 vecs / +9.2s = +68,038 vecs/s [43.8%] +624,960 vecs / +9.9s = +63,074 vecs/s [50.0%] +624,960 vecs / +9.3s = +67,249 vecs/s [56.3%] +624,960 vecs / +9.6s = +64,865 vecs/s [62.5%] +624,960 vecs / +9.8s = +63,622 vecs/s [68.8%] +624,960 vecs / +10.1s = +62,016 vecs/s [75.0%] +624,960 vecs / +10.3s = +60,543 vecs/s [81.3%] +624,960 vecs / +23.3s = +26,858 vecs/s [87.5%] +624,960 vecs / +12.0s = +51,891 vecs/s [93.8%] +624,960 vecs / +11.4s = +55,053 vecs/s [100.%] +624,960 vecs / +11.9s = +52,318 vecs/s [done] 10,000,000 vecs / 177.9s = 56,220 vecs/s average ``` By providing a bit more headroom, much useful time is spent actually constructing the graph, leading to ```name=After Building IDMap,HNSW64 with 192 threads... [ 6.2%] +624,960 vecs / +4.0s = +154,359 vecs/s [12.5%] +625,152 vecs / +3.9s = +159,162 vecs/s [18.8%] +625,152 vecs / +4.2s = +150,183 vecs/s [25.0%] +625,152 vecs / +4.8s = +130,363 vecs/s [31.3%] +625,024 vecs / +5.3s = +118,817 vecs/s [37.5%] +624,960 vecs / +4.7s = +133,585 vecs/s [43.8%] +624,960 vecs / +5.5s = +113,630 vecs/s [50.0%] +624,960 vecs / +4.8s = +131,124 vecs/s [56.3%] +624,960 vecs / +4.8s = +129,847 vecs/s [62.5%] +624,960 vecs / +5.0s = +124,726 vecs/s [68.8%] +624,960 vecs / +5.3s = +117,075 vecs/s [75.0%] +624,960 vecs / +5.3s = +118,652 vecs/s [81.3%] +624,960 vecs / +8.1s = +77,497 vecs/s [87.5%] +624,960 vecs / +6.3s = +99,823 vecs/s [93.8%] +624,960 vecs / +6.0s = +103,604 vecs/s [100.%] +624,960 vecs / +5.9s = +106,680 vecs/s [done] 10,000,000 vecs / 89.9s = 111,278 vecs/s average ``` The increased pruning has minimal effect on recall, particularly with the default value of `prune_headroom = 0.2`. Reviewed By: mdouze Differential Revision: D94442964
651d20d to
6586bbc
Compare
Contributor
Author
|
@alexanderguzhva in my tests, I've found that the difference in recall is on par with the random variation from run to run, with headroom=0.2 edging out the baseline if anything. |
ddrcoder
added a commit
to ddrcoder/faiss
that referenced
this pull request
Mar 12, 2026
…st (facebookresearch#4847) Summary: Particularly when building large graphs (10M+), insertion performance often plummets in the presence of highly contended nodes. More time is spent doing repeated `O(d*M^2)` pruning, and more time is spent waiting for locks: ```name=Before [6.2%] +624,960 vecs / +10.1s = +62,165 vecs/s [12.5%] +625,152 vecs / +8.3s = +75,316 vecs/s [18.8%] +625,152 vecs / +8.0s = +78,244 vecs/s [25.0%] +625,152 vecs / +8.6s = +72,813 vecs/s [31.3%] +625,024 vecs / +8.6s = +72,796 vecs/s [37.5%] +624,960 vecs / +9.2s = +68,038 vecs/s [43.8%] +624,960 vecs / +9.9s = +63,074 vecs/s [50.0%] +624,960 vecs / +9.3s = +67,249 vecs/s [56.3%] +624,960 vecs / +9.6s = +64,865 vecs/s [62.5%] +624,960 vecs / +9.8s = +63,622 vecs/s [68.8%] +624,960 vecs / +10.1s = +62,016 vecs/s [75.0%] +624,960 vecs / +10.3s = +60,543 vecs/s [81.3%] +624,960 vecs / +23.3s = +26,858 vecs/s [87.5%] +624,960 vecs / +12.0s = +51,891 vecs/s [93.8%] +624,960 vecs / +11.4s = +55,053 vecs/s [100.%] +624,960 vecs / +11.9s = +52,318 vecs/s [done] 10,000,000 vecs / 177.9s = 56,220 vecs/s average ``` By providing a bit more headroom, much useful time is spent actually constructing the graph, leading to ```name=After Building IDMap,HNSW64 with 192 threads... [ 6.2%] +624,960 vecs / +4.0s = +154,359 vecs/s [12.5%] +625,152 vecs / +3.9s = +159,162 vecs/s [18.8%] +625,152 vecs / +4.2s = +150,183 vecs/s [25.0%] +625,152 vecs / +4.8s = +130,363 vecs/s [31.3%] +625,024 vecs / +5.3s = +118,817 vecs/s [37.5%] +624,960 vecs / +4.7s = +133,585 vecs/s [43.8%] +624,960 vecs / +5.5s = +113,630 vecs/s [50.0%] +624,960 vecs / +4.8s = +131,124 vecs/s [56.3%] +624,960 vecs / +4.8s = +129,847 vecs/s [62.5%] +624,960 vecs / +5.0s = +124,726 vecs/s [68.8%] +624,960 vecs / +5.3s = +117,075 vecs/s [75.0%] +624,960 vecs / +5.3s = +118,652 vecs/s [81.3%] +624,960 vecs / +8.1s = +77,497 vecs/s [87.5%] +624,960 vecs / +6.3s = +99,823 vecs/s [93.8%] +624,960 vecs / +6.0s = +103,604 vecs/s [100.%] +624,960 vecs / +5.9s = +106,680 vecs/s [done] 10,000,000 vecs / 89.9s = 111,278 vecs/s average ``` The increased pruning has minimal effect on recall, particularly with the default value of `prune_headroom = 0.2`. Reviewed By: mdouze Differential Revision: D94442964
6586bbc to
5a2513e
Compare
ddrcoder
added a commit
to ddrcoder/faiss
that referenced
this pull request
Mar 12, 2026
…st (facebookresearch#4847) Summary: Particularly when building large graphs (10M+), insertion performance often plummets in the presence of highly contended nodes. More time is spent doing repeated `O(d*M^2)` pruning, and more time is spent waiting for locks: ```name=Before [6.2%] +624,960 vecs / +10.1s = +62,165 vecs/s [12.5%] +625,152 vecs / +8.3s = +75,316 vecs/s [18.8%] +625,152 vecs / +8.0s = +78,244 vecs/s [25.0%] +625,152 vecs / +8.6s = +72,813 vecs/s [31.3%] +625,024 vecs / +8.6s = +72,796 vecs/s [37.5%] +624,960 vecs / +9.2s = +68,038 vecs/s [43.8%] +624,960 vecs / +9.9s = +63,074 vecs/s [50.0%] +624,960 vecs / +9.3s = +67,249 vecs/s [56.3%] +624,960 vecs / +9.6s = +64,865 vecs/s [62.5%] +624,960 vecs / +9.8s = +63,622 vecs/s [68.8%] +624,960 vecs / +10.1s = +62,016 vecs/s [75.0%] +624,960 vecs / +10.3s = +60,543 vecs/s [81.3%] +624,960 vecs / +23.3s = +26,858 vecs/s [87.5%] +624,960 vecs / +12.0s = +51,891 vecs/s [93.8%] +624,960 vecs / +11.4s = +55,053 vecs/s [100.%] +624,960 vecs / +11.9s = +52,318 vecs/s [done] 10,000,000 vecs / 177.9s = 56,220 vecs/s average ``` By providing a bit more headroom, much useful time is spent actually constructing the graph, leading to ```name=After Building IDMap,HNSW64 with 192 threads... [ 6.2%] +624,960 vecs / +4.0s = +154,359 vecs/s [12.5%] +625,152 vecs / +3.9s = +159,162 vecs/s [18.8%] +625,152 vecs / +4.2s = +150,183 vecs/s [25.0%] +625,152 vecs / +4.8s = +130,363 vecs/s [31.3%] +625,024 vecs / +5.3s = +118,817 vecs/s [37.5%] +624,960 vecs / +4.7s = +133,585 vecs/s [43.8%] +624,960 vecs / +5.5s = +113,630 vecs/s [50.0%] +624,960 vecs / +4.8s = +131,124 vecs/s [56.3%] +624,960 vecs / +4.8s = +129,847 vecs/s [62.5%] +624,960 vecs / +5.0s = +124,726 vecs/s [68.8%] +624,960 vecs / +5.3s = +117,075 vecs/s [75.0%] +624,960 vecs / +5.3s = +118,652 vecs/s [81.3%] +624,960 vecs / +8.1s = +77,497 vecs/s [87.5%] +624,960 vecs / +6.3s = +99,823 vecs/s [93.8%] +624,960 vecs / +6.0s = +103,604 vecs/s [100.%] +624,960 vecs / +5.9s = +106,680 vecs/s [done] 10,000,000 vecs / 89.9s = 111,278 vecs/s average ``` The increased pruning has minimal effect on recall, particularly with the default value of `prune_headroom = 0.2`. Reviewed By: mdouze Differential Revision: D94442964
5a2513e to
b89e7a8
Compare
ddrcoder
added a commit
to ddrcoder/faiss
that referenced
this pull request
Mar 18, 2026
…st (facebookresearch#4847) Summary: Particularly when building large graphs (10M+), insertion performance often plummets in the presence of highly contended nodes. More time is spent doing repeated `O(d*M^2)` pruning, and more time is spent waiting for locks: ```name=Before [6.2%] +624,960 vecs / +10.1s = +62,165 vecs/s [12.5%] +625,152 vecs / +8.3s = +75,316 vecs/s [18.8%] +625,152 vecs / +8.0s = +78,244 vecs/s [25.0%] +625,152 vecs / +8.6s = +72,813 vecs/s [31.3%] +625,024 vecs / +8.6s = +72,796 vecs/s [37.5%] +624,960 vecs / +9.2s = +68,038 vecs/s [43.8%] +624,960 vecs / +9.9s = +63,074 vecs/s [50.0%] +624,960 vecs / +9.3s = +67,249 vecs/s [56.3%] +624,960 vecs / +9.6s = +64,865 vecs/s [62.5%] +624,960 vecs / +9.8s = +63,622 vecs/s [68.8%] +624,960 vecs / +10.1s = +62,016 vecs/s [75.0%] +624,960 vecs / +10.3s = +60,543 vecs/s [81.3%] +624,960 vecs / +23.3s = +26,858 vecs/s [87.5%] +624,960 vecs / +12.0s = +51,891 vecs/s [93.8%] +624,960 vecs / +11.4s = +55,053 vecs/s [100.%] +624,960 vecs / +11.9s = +52,318 vecs/s [done] 10,000,000 vecs / 177.9s = 56,220 vecs/s average ``` By providing a bit more headroom, much useful time is spent actually constructing the graph, leading to ```name=After Building IDMap,HNSW64 with 192 threads... [ 6.2%] +624,960 vecs / +4.0s = +154,359 vecs/s [12.5%] +625,152 vecs / +3.9s = +159,162 vecs/s [18.8%] +625,152 vecs / +4.2s = +150,183 vecs/s [25.0%] +625,152 vecs / +4.8s = +130,363 vecs/s [31.3%] +625,024 vecs / +5.3s = +118,817 vecs/s [37.5%] +624,960 vecs / +4.7s = +133,585 vecs/s [43.8%] +624,960 vecs / +5.5s = +113,630 vecs/s [50.0%] +624,960 vecs / +4.8s = +131,124 vecs/s [56.3%] +624,960 vecs / +4.8s = +129,847 vecs/s [62.5%] +624,960 vecs / +5.0s = +124,726 vecs/s [68.8%] +624,960 vecs / +5.3s = +117,075 vecs/s [75.0%] +624,960 vecs / +5.3s = +118,652 vecs/s [81.3%] +624,960 vecs / +8.1s = +77,497 vecs/s [87.5%] +624,960 vecs / +6.3s = +99,823 vecs/s [93.8%] +624,960 vecs / +6.0s = +103,604 vecs/s [100.%] +624,960 vecs / +5.9s = +106,680 vecs/s [done] 10,000,000 vecs / 89.9s = 111,278 vecs/s average ``` The increased pruning has minimal effect on recall, particularly with the default value of `prune_headroom = 0.2`. Reviewed By: mdouze Differential Revision: D94442964
b89e7a8 to
688a019
Compare
…st (facebookresearch#4847) Summary: Pull Request resolved: facebookresearch#4847 Particularly when building large graphs (10M+), insertion performance often plummets in the presence of highly contended nodes. More time is spent doing repeated `O(d*M^2)` pruning, and more time is spent waiting for locks: ```name=Before [6.2%] +624,960 vecs / +10.1s = +62,165 vecs/s [12.5%] +625,152 vecs / +8.3s = +75,316 vecs/s [18.8%] +625,152 vecs / +8.0s = +78,244 vecs/s [25.0%] +625,152 vecs / +8.6s = +72,813 vecs/s [31.3%] +625,024 vecs / +8.6s = +72,796 vecs/s [37.5%] +624,960 vecs / +9.2s = +68,038 vecs/s [43.8%] +624,960 vecs / +9.9s = +63,074 vecs/s [50.0%] +624,960 vecs / +9.3s = +67,249 vecs/s [56.3%] +624,960 vecs / +9.6s = +64,865 vecs/s [62.5%] +624,960 vecs / +9.8s = +63,622 vecs/s [68.8%] +624,960 vecs / +10.1s = +62,016 vecs/s [75.0%] +624,960 vecs / +10.3s = +60,543 vecs/s [81.3%] +624,960 vecs / +23.3s = +26,858 vecs/s [87.5%] +624,960 vecs / +12.0s = +51,891 vecs/s [93.8%] +624,960 vecs / +11.4s = +55,053 vecs/s [100.%] +624,960 vecs / +11.9s = +52,318 vecs/s [done] 10,000,000 vecs / 177.9s = 56,220 vecs/s average ``` By providing a bit more headroom, much useful time is spent actually constructing the graph, leading to ```name=After Building IDMap,HNSW64 with 192 threads... [ 6.2%] +624,960 vecs / +4.0s = +154,359 vecs/s [12.5%] +625,152 vecs / +3.9s = +159,162 vecs/s [18.8%] +625,152 vecs / +4.2s = +150,183 vecs/s [25.0%] +625,152 vecs / +4.8s = +130,363 vecs/s [31.3%] +625,024 vecs / +5.3s = +118,817 vecs/s [37.5%] +624,960 vecs / +4.7s = +133,585 vecs/s [43.8%] +624,960 vecs / +5.5s = +113,630 vecs/s [50.0%] +624,960 vecs / +4.8s = +131,124 vecs/s [56.3%] +624,960 vecs / +4.8s = +129,847 vecs/s [62.5%] +624,960 vecs / +5.0s = +124,726 vecs/s [68.8%] +624,960 vecs / +5.3s = +117,075 vecs/s [75.0%] +624,960 vecs / +5.3s = +118,652 vecs/s [81.3%] +624,960 vecs / +8.1s = +77,497 vecs/s [87.5%] +624,960 vecs / +6.3s = +99,823 vecs/s [93.8%] +624,960 vecs / +6.0s = +103,604 vecs/s [100.%] +624,960 vecs / +5.9s = +106,680 vecs/s [done] 10,000,000 vecs / 89.9s = 111,278 vecs/s average ``` The increased pruning has minimal effect on recall, particularly with the default value of `prune_headroom = 0.2`. Reviewed By: mdouze Differential Revision: D94442964
688a019 to
65502ce
Compare
Contributor
|
This pull request has been merged in 748c031. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Particularly when building large graphs (10M+), insertion performance often plummets in the presence of highly contended nodes. More time is spent doing repeated
O(d*M^2)pruning, and more time is spent waiting for locks:By providing a bit more headroom, much useful time is spent actually constructing the graph, leading to
The increased pruning has minimal effect on recall, particularly with the default value of
prune_headroom = 0.2.Reviewed By: mdouze
Differential Revision: D94442964