HNSW: add prune_headroom to avoid O(n^2) pruning/locking, headroom test (#4847) by ddrcoder · Pull Request #4847 · facebookresearch/faiss

ddrcoder · 2026-02-27T01:14:50Z

Summary:

Particularly when building large graphs (10M+), insertion performance often plummets in the presence of highly contended nodes. More time is spent doing repeated O(d*M^2) pruning, and more time is spent waiting for locks:

 [6.2%] +624,960 vecs / +10.1s = +62,165 vecs/s
[12.5%] +625,152 vecs /  +8.3s = +75,316 vecs/s
[18.8%] +625,152 vecs /  +8.0s = +78,244 vecs/s
[25.0%] +625,152 vecs /  +8.6s = +72,813 vecs/s
[31.3%] +625,024 vecs /  +8.6s = +72,796 vecs/s
[37.5%] +624,960 vecs /  +9.2s = +68,038 vecs/s
[43.8%] +624,960 vecs /  +9.9s = +63,074 vecs/s
[50.0%] +624,960 vecs /  +9.3s = +67,249 vecs/s
[56.3%] +624,960 vecs /  +9.6s = +64,865 vecs/s
[62.5%] +624,960 vecs /  +9.8s = +63,622 vecs/s
[68.8%] +624,960 vecs / +10.1s = +62,016 vecs/s
[75.0%] +624,960 vecs / +10.3s = +60,543 vecs/s
[81.3%] +624,960 vecs / +23.3s = +26,858 vecs/s
[87.5%] +624,960 vecs / +12.0s = +51,891 vecs/s
[93.8%] +624,960 vecs / +11.4s = +55,053 vecs/s
[100.%] +624,960 vecs / +11.9s = +52,318 vecs/s
[done] 10,000,000 vecs / 177.9s = 56,220 vecs/s average

By providing a bit more headroom, much useful time is spent actually constructing the graph, leading to

Building IDMap,HNSW64 with 192 threads...
[ 6.2%] +624,960 vecs / +4.0s = +154,359 vecs/s
[12.5%] +625,152 vecs / +3.9s = +159,162 vecs/s
[18.8%] +625,152 vecs / +4.2s = +150,183 vecs/s
[25.0%] +625,152 vecs / +4.8s = +130,363 vecs/s
[31.3%] +625,024 vecs / +5.3s = +118,817 vecs/s
[37.5%] +624,960 vecs / +4.7s = +133,585 vecs/s
[43.8%] +624,960 vecs / +5.5s = +113,630 vecs/s
[50.0%] +624,960 vecs / +4.8s = +131,124 vecs/s
[56.3%] +624,960 vecs / +4.8s = +129,847 vecs/s
[62.5%] +624,960 vecs / +5.0s = +124,726 vecs/s
[68.8%] +624,960 vecs / +5.3s = +117,075 vecs/s
[75.0%] +624,960 vecs / +5.3s = +118,652 vecs/s
[81.3%] +624,960 vecs / +8.1s =  +77,497 vecs/s
[87.5%] +624,960 vecs / +6.3s =  +99,823 vecs/s
[93.8%] +624,960 vecs / +6.0s = +103,604 vecs/s
[100.%] +624,960 vecs / +5.9s = +106,680 vecs/s
[done] 10,000,000 vecs / 89.9s = 111,278 vecs/s average

The increased pruning has minimal effect on recall, particularly with the default value of prune_headroom = 0.2.

Reviewed By: mdouze

Differential Revision: D94442964

meta-codesync · 2026-02-27T01:14:58Z

@ddrcoder has exported this pull request. If you are a Meta employee, you can view the originating Diff in D94442964.

alexanderguzhva · 2026-02-27T16:36:40Z

@ddrcoder are there any possible side effects, such as a different search time? It looks quite suspicious that there are ones what were mentioned for this optimization

…st (facebookresearch#4847) Summary: Particularly when building large graphs (10M+), insertion performance often plummets in the presence of highly contended nodes. More time is spent doing repeated `O(d*M^2)` pruning, and more time is spent waiting for locks: ```name=Before [6.2%] +624,960 vecs / +10.1s = +62,165 vecs/s [12.5%] +625,152 vecs / +8.3s = +75,316 vecs/s [18.8%] +625,152 vecs / +8.0s = +78,244 vecs/s [25.0%] +625,152 vecs / +8.6s = +72,813 vecs/s [31.3%] +625,024 vecs / +8.6s = +72,796 vecs/s [37.5%] +624,960 vecs / +9.2s = +68,038 vecs/s [43.8%] +624,960 vecs / +9.9s = +63,074 vecs/s [50.0%] +624,960 vecs / +9.3s = +67,249 vecs/s [56.3%] +624,960 vecs / +9.6s = +64,865 vecs/s [62.5%] +624,960 vecs / +9.8s = +63,622 vecs/s [68.8%] +624,960 vecs / +10.1s = +62,016 vecs/s [75.0%] +624,960 vecs / +10.3s = +60,543 vecs/s [81.3%] +624,960 vecs / +23.3s = +26,858 vecs/s [87.5%] +624,960 vecs / +12.0s = +51,891 vecs/s [93.8%] +624,960 vecs / +11.4s = +55,053 vecs/s [100.%] +624,960 vecs / +11.9s = +52,318 vecs/s [done] 10,000,000 vecs / 177.9s = 56,220 vecs/s average ``` By providing a bit more headroom, much useful time is spent actually constructing the graph, leading to ```name=After Building IDMap,HNSW64 with 192 threads... [ 6.2%] +624,960 vecs / +4.0s = +154,359 vecs/s [12.5%] +625,152 vecs / +3.9s = +159,162 vecs/s [18.8%] +625,152 vecs / +4.2s = +150,183 vecs/s [25.0%] +625,152 vecs / +4.8s = +130,363 vecs/s [31.3%] +625,024 vecs / +5.3s = +118,817 vecs/s [37.5%] +624,960 vecs / +4.7s = +133,585 vecs/s [43.8%] +624,960 vecs / +5.5s = +113,630 vecs/s [50.0%] +624,960 vecs / +4.8s = +131,124 vecs/s [56.3%] +624,960 vecs / +4.8s = +129,847 vecs/s [62.5%] +624,960 vecs / +5.0s = +124,726 vecs/s [68.8%] +624,960 vecs / +5.3s = +117,075 vecs/s [75.0%] +624,960 vecs / +5.3s = +118,652 vecs/s [81.3%] +624,960 vecs / +8.1s = +77,497 vecs/s [87.5%] +624,960 vecs / +6.3s = +99,823 vecs/s [93.8%] +624,960 vecs / +6.0s = +103,604 vecs/s [100.%] +624,960 vecs / +5.9s = +106,680 vecs/s [done] 10,000,000 vecs / 89.9s = 111,278 vecs/s average ``` The increased pruning has minimal effect on recall, particularly with the default value of `prune_headroom = 0.2`. Reviewed By: mdouze Differential Revision: D94442964

ddrcoder · 2026-03-12T18:17:12Z

@alexanderguzhva in my tests, I've found that the difference in recall is on par with the random variation from run to run, with headroom=0.2 edging out the baseline if anything.

> buck2 run @mode/opt //faiss/benchs:bench_hnsw_prune_headroom
...
Generating synthetic dataset: d=128, nb=50000, nq=10000
Computing ground truth for k=10
HNSW32(prune_headroom=0.00): build_time=0.24s, ndis_search=12636.1
HNSW32(prune_headroom=0.00): build_time=0.19s, ndis_search=12630.4
HNSW32(prune_headroom=0.00): build_time=0.33s, ndis_search=12631.3
HNSW32(prune_headroom=0.04): build_time=0.11s, ndis_search=12620.3
HNSW32(prune_headroom=0.04): build_time=0.20s, ndis_search=12609.8
HNSW32(prune_headroom=0.04): build_time=0.14s, ndis_search=12622.2
HNSW32(prune_headroom=0.08): build_time=0.25s, ndis_search=12598.5
HNSW32(prune_headroom=0.08): build_time=0.28s, ndis_search=12600.5
HNSW32(prune_headroom=0.08): build_time=0.20s, ndis_search=12588.5
HNSW32(prune_headroom=0.12): build_time=0.61s, ndis_search=12595.7
HNSW32(prune_headroom=0.12): build_time=0.21s, ndis_search=12582.4
HNSW32(prune_headroom=0.12): build_time=0.18s, ndis_search=12586.9
HNSW32(prune_headroom=0.16): build_time=0.17s, ndis_search=12570.2
HNSW32(prune_headroom=0.16): build_time=0.17s, ndis_search=12557.2
HNSW32(prune_headroom=0.16): build_time=0.18s, ndis_search=12586.5
HNSW32(prune_headroom=0.20): build_time=0.09s, ndis_search=12554.4
HNSW32(prune_headroom=0.20): build_time=0.21s, ndis_search=12553.0
HNSW32(prune_headroom=0.20): build_time=0.18s, ndis_search=12553.9

k= 1   | ef= 16 | ef= 32 | ef= 64 | ef=128 | ef=256
---------------------------------------------------
h=0.00 | 0.9003 | 0.9231 | 0.9354 | 0.9395 | 0.9413
h=0.00 | 0.9016 | 0.9254 | 0.9356 | 0.9394 | 0.9413
h=0.00 | 0.9006 | 0.9242 | 0.9347 | 0.9394 | 0.9416
h=0.04 | 0.9005 | 0.9252 | 0.9358 | 0.9400 | 0.9414
h=0.04 | 0.9035 | 0.9268 | 0.9353 | 0.9402 | 0.9413
h=0.04 | 0.9029 | 0.9259 | 0.9346 | 0.9392 | 0.9413
h=0.08 | 0.9044 | 0.9270 | 0.9352 | 0.9396 | 0.9414
h=0.08 | 0.9050 | 0.9271 | 0.9359 | 0.9398 | 0.9414
h=0.08 | 0.9000 | 0.9247 | 0.9349 | 0.9400 | 0.9415
h=0.12 | 0.9007 | 0.9263 | 0.9356 | 0.9397 | 0.9414
h=0.12 | 0.9025 | 0.9257 | 0.9354 | 0.9400 | 0.9415
h=0.12 | 0.9035 | 0.9275 | 0.9364 | 0.9401 | 0.9416
h=0.16 | 0.9021 | 0.9259 | 0.9359 | 0.9399 | 0.9415
h=0.16 | 0.9019 | 0.9250 | 0.9354 | 0.9402 | 0.9415
h=0.16 | 0.9021 | 0.9254 | 0.9354 | 0.9399 | 0.9415
h=0.20 | 0.8989 | 0.9248 | 0.9350 | 0.9400 | 0.9414
h=0.20 | 0.9034 | 0.9252 | 0.9359 | 0.9399 | 0.9416
h=0.20 | 0.9019 | 0.9267 | 0.9356 | 0.9399 | 0.9416

k=10   | ef= 16 | ef= 32 | ef= 64 | ef=128 | ef=256
---------------------------------------------------
h=0.00 | 0.8709 | 0.9102 | 0.9292 | 0.9370 | 0.9401
h=0.00 | 0.8708 | 0.9107 | 0.9289 | 0.9373 | 0.9401
h=0.00 | 0.8705 | 0.9101 | 0.9288 | 0.9374 | 0.9400
h=0.04 | 0.8705 | 0.9094 | 0.9283 | 0.9372 | 0.9401
h=0.04 | 0.8710 | 0.9103 | 0.9287 | 0.9371 | 0.9400
h=0.04 | 0.8708 | 0.9096 | 0.9284 | 0.9371 | 0.9401
h=0.08 | 0.8716 | 0.9101 | 0.9285 | 0.9371 | 0.9401
h=0.08 | 0.8725 | 0.9104 | 0.9288 | 0.9371 | 0.9403
h=0.08 | 0.8711 | 0.9093 | 0.9290 | 0.9372 | 0.9400
h=0.12 | 0.8701 | 0.9098 | 0.9286 | 0.9372 | 0.9400
h=0.12 | 0.8711 | 0.9099 | 0.9292 | 0.9372 | 0.9402
h=0.12 | 0.8707 | 0.9099 | 0.9291 | 0.9373 | 0.9400
h=0.16 | 0.8705 | 0.9097 | 0.9286 | 0.9370 | 0.9401
h=0.16 | 0.8714 | 0.9095 | 0.9289 | 0.9372 | 0.9401
h=0.16 | 0.8697 | 0.9091 | 0.9285 | 0.9373 | 0.9401
h=0.20 | 0.8693 | 0.9099 | 0.9288 | 0.9371 | 0.9402
h=0.20 | 0.8711 | 0.9098 | 0.9289 | 0.9373 | 0.9400
h=0.20 | 0.8711 | 0.9104 | 0.9289 | 0.9372 | 0.9401

…st (facebookresearch#4847) Summary: Particularly when building large graphs (10M+), insertion performance often plummets in the presence of highly contended nodes. More time is spent doing repeated `O(d*M^2)` pruning, and more time is spent waiting for locks: ```name=Before [6.2%] +624,960 vecs / +10.1s = +62,165 vecs/s [12.5%] +625,152 vecs / +8.3s = +75,316 vecs/s [18.8%] +625,152 vecs / +8.0s = +78,244 vecs/s [25.0%] +625,152 vecs / +8.6s = +72,813 vecs/s [31.3%] +625,024 vecs / +8.6s = +72,796 vecs/s [37.5%] +624,960 vecs / +9.2s = +68,038 vecs/s [43.8%] +624,960 vecs / +9.9s = +63,074 vecs/s [50.0%] +624,960 vecs / +9.3s = +67,249 vecs/s [56.3%] +624,960 vecs / +9.6s = +64,865 vecs/s [62.5%] +624,960 vecs / +9.8s = +63,622 vecs/s [68.8%] +624,960 vecs / +10.1s = +62,016 vecs/s [75.0%] +624,960 vecs / +10.3s = +60,543 vecs/s [81.3%] +624,960 vecs / +23.3s = +26,858 vecs/s [87.5%] +624,960 vecs / +12.0s = +51,891 vecs/s [93.8%] +624,960 vecs / +11.4s = +55,053 vecs/s [100.%] +624,960 vecs / +11.9s = +52,318 vecs/s [done] 10,000,000 vecs / 177.9s = 56,220 vecs/s average ``` By providing a bit more headroom, much useful time is spent actually constructing the graph, leading to ```name=After Building IDMap,HNSW64 with 192 threads... [ 6.2%] +624,960 vecs / +4.0s = +154,359 vecs/s [12.5%] +625,152 vecs / +3.9s = +159,162 vecs/s [18.8%] +625,152 vecs / +4.2s = +150,183 vecs/s [25.0%] +625,152 vecs / +4.8s = +130,363 vecs/s [31.3%] +625,024 vecs / +5.3s = +118,817 vecs/s [37.5%] +624,960 vecs / +4.7s = +133,585 vecs/s [43.8%] +624,960 vecs / +5.5s = +113,630 vecs/s [50.0%] +624,960 vecs / +4.8s = +131,124 vecs/s [56.3%] +624,960 vecs / +4.8s = +129,847 vecs/s [62.5%] +624,960 vecs / +5.0s = +124,726 vecs/s [68.8%] +624,960 vecs / +5.3s = +117,075 vecs/s [75.0%] +624,960 vecs / +5.3s = +118,652 vecs/s [81.3%] +624,960 vecs / +8.1s = +77,497 vecs/s [87.5%] +624,960 vecs / +6.3s = +99,823 vecs/s [93.8%] +624,960 vecs / +6.0s = +103,604 vecs/s [100.%] +624,960 vecs / +5.9s = +106,680 vecs/s [done] 10,000,000 vecs / 89.9s = 111,278 vecs/s average ``` The increased pruning has minimal effect on recall, particularly with the default value of `prune_headroom = 0.2`. Reviewed By: mdouze Differential Revision: D94442964

…st (facebookresearch#4847) Summary: Pull Request resolved: facebookresearch#4847 Particularly when building large graphs (10M+), insertion performance often plummets in the presence of highly contended nodes. More time is spent doing repeated `O(d*M^2)` pruning, and more time is spent waiting for locks: ```name=Before [6.2%] +624,960 vecs / +10.1s = +62,165 vecs/s [12.5%] +625,152 vecs / +8.3s = +75,316 vecs/s [18.8%] +625,152 vecs / +8.0s = +78,244 vecs/s [25.0%] +625,152 vecs / +8.6s = +72,813 vecs/s [31.3%] +625,024 vecs / +8.6s = +72,796 vecs/s [37.5%] +624,960 vecs / +9.2s = +68,038 vecs/s [43.8%] +624,960 vecs / +9.9s = +63,074 vecs/s [50.0%] +624,960 vecs / +9.3s = +67,249 vecs/s [56.3%] +624,960 vecs / +9.6s = +64,865 vecs/s [62.5%] +624,960 vecs / +9.8s = +63,622 vecs/s [68.8%] +624,960 vecs / +10.1s = +62,016 vecs/s [75.0%] +624,960 vecs / +10.3s = +60,543 vecs/s [81.3%] +624,960 vecs / +23.3s = +26,858 vecs/s [87.5%] +624,960 vecs / +12.0s = +51,891 vecs/s [93.8%] +624,960 vecs / +11.4s = +55,053 vecs/s [100.%] +624,960 vecs / +11.9s = +52,318 vecs/s [done] 10,000,000 vecs / 177.9s = 56,220 vecs/s average ``` By providing a bit more headroom, much useful time is spent actually constructing the graph, leading to ```name=After Building IDMap,HNSW64 with 192 threads... [ 6.2%] +624,960 vecs / +4.0s = +154,359 vecs/s [12.5%] +625,152 vecs / +3.9s = +159,162 vecs/s [18.8%] +625,152 vecs / +4.2s = +150,183 vecs/s [25.0%] +625,152 vecs / +4.8s = +130,363 vecs/s [31.3%] +625,024 vecs / +5.3s = +118,817 vecs/s [37.5%] +624,960 vecs / +4.7s = +133,585 vecs/s [43.8%] +624,960 vecs / +5.5s = +113,630 vecs/s [50.0%] +624,960 vecs / +4.8s = +131,124 vecs/s [56.3%] +624,960 vecs / +4.8s = +129,847 vecs/s [62.5%] +624,960 vecs / +5.0s = +124,726 vecs/s [68.8%] +624,960 vecs / +5.3s = +117,075 vecs/s [75.0%] +624,960 vecs / +5.3s = +118,652 vecs/s [81.3%] +624,960 vecs / +8.1s = +77,497 vecs/s [87.5%] +624,960 vecs / +6.3s = +99,823 vecs/s [93.8%] +624,960 vecs / +6.0s = +103,604 vecs/s [100.%] +624,960 vecs / +5.9s = +106,680 vecs/s [done] 10,000,000 vecs / 89.9s = 111,278 vecs/s average ``` The increased pruning has minimal effect on recall, particularly with the default value of `prune_headroom = 0.2`. Reviewed By: mdouze Differential Revision: D94442964

meta-codesync · 2026-03-18T20:42:52Z

This pull request has been merged in 748c031.

meta-cla Bot added the CLA Signed label Feb 27, 2026

meta-codesync Bot added fb-exported meta-exported labels Feb 27, 2026

ddrcoder force-pushed the export-D94442964 branch from cc91eaf to 651d20d Compare March 12, 2026 17:50

ddrcoder force-pushed the export-D94442964 branch from 651d20d to 6586bbc Compare March 12, 2026 18:15

ddrcoder force-pushed the export-D94442964 branch from 6586bbc to 5a2513e Compare March 12, 2026 18:17

ddrcoder force-pushed the export-D94442964 branch from 5a2513e to b89e7a8 Compare March 12, 2026 19:09

mnorris11 added the to-benchmark label Mar 17, 2026

meta-codesync Bot changed the title ~~HNSW: add prune_headroom to avoid O(n^2) pruning/locking~~ HNSW: add prune_headroom to avoid O(n^2) pruning/locking, headroom test (#4847) Mar 18, 2026

ddrcoder force-pushed the export-D94442964 branch from b89e7a8 to 688a019 Compare March 18, 2026 15:19

ddrcoder force-pushed the export-D94442964 branch from 688a019 to 65502ce Compare March 18, 2026 15:25

meta-codesync Bot closed this in 748c031 Mar 18, 2026

facebook-github-tools Bot added the Merged label Mar 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HNSW: add prune_headroom to avoid O(n^2) pruning/locking, headroom test (#4847)#4847

HNSW: add prune_headroom to avoid O(n^2) pruning/locking, headroom test (#4847)#4847
ddrcoder wants to merge 1 commit into
facebookresearch:mainfrom
ddrcoder:export-D94442964

ddrcoder commented Feb 27, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

meta-codesync Bot commented Feb 27, 2026

Uh oh!

alexanderguzhva commented Feb 27, 2026

Uh oh!

ddrcoder commented Mar 12, 2026

Uh oh!

meta-codesync Bot commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ddrcoder commented Feb 27, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync Bot commented Feb 27, 2026

Uh oh!

alexanderguzhva commented Feb 27, 2026

Uh oh!

ddrcoder commented Mar 12, 2026

Uh oh!

meta-codesync Bot commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ddrcoder commented Feb 27, 2026 •

edited by meta-codesync Bot

Loading