Skip to content

HNSW: add prune_headroom to avoid O(n^2) pruning/locking, headroom test (#4847)#4847

Closed
ddrcoder wants to merge 1 commit into
facebookresearch:mainfrom
ddrcoder:export-D94442964
Closed

HNSW: add prune_headroom to avoid O(n^2) pruning/locking, headroom test (#4847)#4847
ddrcoder wants to merge 1 commit into
facebookresearch:mainfrom
ddrcoder:export-D94442964

Conversation

@ddrcoder
Copy link
Copy Markdown
Contributor

@ddrcoder ddrcoder commented Feb 27, 2026

Summary:

Particularly when building large graphs (10M+), insertion performance often plummets in the presence of highly contended nodes. More time is spent doing repeated O(d*M^2) pruning, and more time is spent waiting for locks:

 [6.2%] +624,960 vecs / +10.1s = +62,165 vecs/s
[12.5%] +625,152 vecs /  +8.3s = +75,316 vecs/s
[18.8%] +625,152 vecs /  +8.0s = +78,244 vecs/s
[25.0%] +625,152 vecs /  +8.6s = +72,813 vecs/s
[31.3%] +625,024 vecs /  +8.6s = +72,796 vecs/s
[37.5%] +624,960 vecs /  +9.2s = +68,038 vecs/s
[43.8%] +624,960 vecs /  +9.9s = +63,074 vecs/s
[50.0%] +624,960 vecs /  +9.3s = +67,249 vecs/s
[56.3%] +624,960 vecs /  +9.6s = +64,865 vecs/s
[62.5%] +624,960 vecs /  +9.8s = +63,622 vecs/s
[68.8%] +624,960 vecs / +10.1s = +62,016 vecs/s
[75.0%] +624,960 vecs / +10.3s = +60,543 vecs/s
[81.3%] +624,960 vecs / +23.3s = +26,858 vecs/s
[87.5%] +624,960 vecs / +12.0s = +51,891 vecs/s
[93.8%] +624,960 vecs / +11.4s = +55,053 vecs/s
[100.%] +624,960 vecs / +11.9s = +52,318 vecs/s
[done] 10,000,000 vecs / 177.9s = 56,220 vecs/s average

By providing a bit more headroom, much useful time is spent actually constructing the graph, leading to

Building IDMap,HNSW64 with 192 threads...
[ 6.2%] +624,960 vecs / +4.0s = +154,359 vecs/s
[12.5%] +625,152 vecs / +3.9s = +159,162 vecs/s
[18.8%] +625,152 vecs / +4.2s = +150,183 vecs/s
[25.0%] +625,152 vecs / +4.8s = +130,363 vecs/s
[31.3%] +625,024 vecs / +5.3s = +118,817 vecs/s
[37.5%] +624,960 vecs / +4.7s = +133,585 vecs/s
[43.8%] +624,960 vecs / +5.5s = +113,630 vecs/s
[50.0%] +624,960 vecs / +4.8s = +131,124 vecs/s
[56.3%] +624,960 vecs / +4.8s = +129,847 vecs/s
[62.5%] +624,960 vecs / +5.0s = +124,726 vecs/s
[68.8%] +624,960 vecs / +5.3s = +117,075 vecs/s
[75.0%] +624,960 vecs / +5.3s = +118,652 vecs/s
[81.3%] +624,960 vecs / +8.1s =  +77,497 vecs/s
[87.5%] +624,960 vecs / +6.3s =  +99,823 vecs/s
[93.8%] +624,960 vecs / +6.0s = +103,604 vecs/s
[100.%] +624,960 vecs / +5.9s = +106,680 vecs/s
[done] 10,000,000 vecs / 89.9s = 111,278 vecs/s average

The increased pruning has minimal effect on recall, particularly with the default value of prune_headroom = 0.2.

Reviewed By: mdouze

Differential Revision: D94442964

@meta-cla meta-cla Bot added the CLA Signed label Feb 27, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Feb 27, 2026

@ddrcoder has exported this pull request. If you are a Meta employee, you can view the originating Diff in D94442964.

@alexanderguzhva
Copy link
Copy Markdown
Contributor

@ddrcoder are there any possible side effects, such as a different search time? It looks quite suspicious that there are ones what were mentioned for this optimization

ddrcoder added a commit to ddrcoder/faiss that referenced this pull request Mar 12, 2026
…st (facebookresearch#4847)

Summary:

Particularly when building large graphs (10M+), insertion performance often plummets in the presence of highly contended nodes. More time is spent doing repeated `O(d*M^2)` pruning, and more time is spent waiting for locks:

```name=Before
 [6.2%] +624,960 vecs / +10.1s = +62,165 vecs/s
[12.5%] +625,152 vecs /  +8.3s = +75,316 vecs/s
[18.8%] +625,152 vecs /  +8.0s = +78,244 vecs/s
[25.0%] +625,152 vecs /  +8.6s = +72,813 vecs/s
[31.3%] +625,024 vecs /  +8.6s = +72,796 vecs/s
[37.5%] +624,960 vecs /  +9.2s = +68,038 vecs/s
[43.8%] +624,960 vecs /  +9.9s = +63,074 vecs/s
[50.0%] +624,960 vecs /  +9.3s = +67,249 vecs/s
[56.3%] +624,960 vecs /  +9.6s = +64,865 vecs/s
[62.5%] +624,960 vecs /  +9.8s = +63,622 vecs/s
[68.8%] +624,960 vecs / +10.1s = +62,016 vecs/s
[75.0%] +624,960 vecs / +10.3s = +60,543 vecs/s
[81.3%] +624,960 vecs / +23.3s = +26,858 vecs/s
[87.5%] +624,960 vecs / +12.0s = +51,891 vecs/s
[93.8%] +624,960 vecs / +11.4s = +55,053 vecs/s
[100.%] +624,960 vecs / +11.9s = +52,318 vecs/s
[done] 10,000,000 vecs / 177.9s = 56,220 vecs/s average
```

By providing a bit more headroom, much useful time is spent actually constructing the graph, leading to
```name=After
Building IDMap,HNSW64 with 192 threads...
[ 6.2%] +624,960 vecs / +4.0s = +154,359 vecs/s
[12.5%] +625,152 vecs / +3.9s = +159,162 vecs/s
[18.8%] +625,152 vecs / +4.2s = +150,183 vecs/s
[25.0%] +625,152 vecs / +4.8s = +130,363 vecs/s
[31.3%] +625,024 vecs / +5.3s = +118,817 vecs/s
[37.5%] +624,960 vecs / +4.7s = +133,585 vecs/s
[43.8%] +624,960 vecs / +5.5s = +113,630 vecs/s
[50.0%] +624,960 vecs / +4.8s = +131,124 vecs/s
[56.3%] +624,960 vecs / +4.8s = +129,847 vecs/s
[62.5%] +624,960 vecs / +5.0s = +124,726 vecs/s
[68.8%] +624,960 vecs / +5.3s = +117,075 vecs/s
[75.0%] +624,960 vecs / +5.3s = +118,652 vecs/s
[81.3%] +624,960 vecs / +8.1s =  +77,497 vecs/s
[87.5%] +624,960 vecs / +6.3s =  +99,823 vecs/s
[93.8%] +624,960 vecs / +6.0s = +103,604 vecs/s
[100.%] +624,960 vecs / +5.9s = +106,680 vecs/s
[done] 10,000,000 vecs / 89.9s = 111,278 vecs/s average
```

The increased pruning has minimal effect on recall, particularly with the default value of `prune_headroom = 0.2`.

Reviewed By: mdouze

Differential Revision: D94442964
ddrcoder added a commit to ddrcoder/faiss that referenced this pull request Mar 12, 2026
…st (facebookresearch#4847)

Summary:

Particularly when building large graphs (10M+), insertion performance often plummets in the presence of highly contended nodes. More time is spent doing repeated `O(d*M^2)` pruning, and more time is spent waiting for locks:

```name=Before
 [6.2%] +624,960 vecs / +10.1s = +62,165 vecs/s
[12.5%] +625,152 vecs /  +8.3s = +75,316 vecs/s
[18.8%] +625,152 vecs /  +8.0s = +78,244 vecs/s
[25.0%] +625,152 vecs /  +8.6s = +72,813 vecs/s
[31.3%] +625,024 vecs /  +8.6s = +72,796 vecs/s
[37.5%] +624,960 vecs /  +9.2s = +68,038 vecs/s
[43.8%] +624,960 vecs /  +9.9s = +63,074 vecs/s
[50.0%] +624,960 vecs /  +9.3s = +67,249 vecs/s
[56.3%] +624,960 vecs /  +9.6s = +64,865 vecs/s
[62.5%] +624,960 vecs /  +9.8s = +63,622 vecs/s
[68.8%] +624,960 vecs / +10.1s = +62,016 vecs/s
[75.0%] +624,960 vecs / +10.3s = +60,543 vecs/s
[81.3%] +624,960 vecs / +23.3s = +26,858 vecs/s
[87.5%] +624,960 vecs / +12.0s = +51,891 vecs/s
[93.8%] +624,960 vecs / +11.4s = +55,053 vecs/s
[100.%] +624,960 vecs / +11.9s = +52,318 vecs/s
[done] 10,000,000 vecs / 177.9s = 56,220 vecs/s average
```

By providing a bit more headroom, much useful time is spent actually constructing the graph, leading to
```name=After
Building IDMap,HNSW64 with 192 threads...
[ 6.2%] +624,960 vecs / +4.0s = +154,359 vecs/s
[12.5%] +625,152 vecs / +3.9s = +159,162 vecs/s
[18.8%] +625,152 vecs / +4.2s = +150,183 vecs/s
[25.0%] +625,152 vecs / +4.8s = +130,363 vecs/s
[31.3%] +625,024 vecs / +5.3s = +118,817 vecs/s
[37.5%] +624,960 vecs / +4.7s = +133,585 vecs/s
[43.8%] +624,960 vecs / +5.5s = +113,630 vecs/s
[50.0%] +624,960 vecs / +4.8s = +131,124 vecs/s
[56.3%] +624,960 vecs / +4.8s = +129,847 vecs/s
[62.5%] +624,960 vecs / +5.0s = +124,726 vecs/s
[68.8%] +624,960 vecs / +5.3s = +117,075 vecs/s
[75.0%] +624,960 vecs / +5.3s = +118,652 vecs/s
[81.3%] +624,960 vecs / +8.1s =  +77,497 vecs/s
[87.5%] +624,960 vecs / +6.3s =  +99,823 vecs/s
[93.8%] +624,960 vecs / +6.0s = +103,604 vecs/s
[100.%] +624,960 vecs / +5.9s = +106,680 vecs/s
[done] 10,000,000 vecs / 89.9s = 111,278 vecs/s average
```

The increased pruning has minimal effect on recall, particularly with the default value of `prune_headroom = 0.2`.

Reviewed By: mdouze

Differential Revision: D94442964
@ddrcoder
Copy link
Copy Markdown
Contributor Author

@alexanderguzhva in my tests, I've found that the difference in recall is on par with the random variation from run to run, with headroom=0.2 edging out the baseline if anything.

> buck2 run @mode/opt //faiss/benchs:bench_hnsw_prune_headroom
...
Generating synthetic dataset: d=128, nb=50000, nq=10000
Computing ground truth for k=10
HNSW32(prune_headroom=0.00): build_time=0.24s, ndis_search=12636.1
HNSW32(prune_headroom=0.00): build_time=0.19s, ndis_search=12630.4
HNSW32(prune_headroom=0.00): build_time=0.33s, ndis_search=12631.3
HNSW32(prune_headroom=0.04): build_time=0.11s, ndis_search=12620.3
HNSW32(prune_headroom=0.04): build_time=0.20s, ndis_search=12609.8
HNSW32(prune_headroom=0.04): build_time=0.14s, ndis_search=12622.2
HNSW32(prune_headroom=0.08): build_time=0.25s, ndis_search=12598.5
HNSW32(prune_headroom=0.08): build_time=0.28s, ndis_search=12600.5
HNSW32(prune_headroom=0.08): build_time=0.20s, ndis_search=12588.5
HNSW32(prune_headroom=0.12): build_time=0.61s, ndis_search=12595.7
HNSW32(prune_headroom=0.12): build_time=0.21s, ndis_search=12582.4
HNSW32(prune_headroom=0.12): build_time=0.18s, ndis_search=12586.9
HNSW32(prune_headroom=0.16): build_time=0.17s, ndis_search=12570.2
HNSW32(prune_headroom=0.16): build_time=0.17s, ndis_search=12557.2
HNSW32(prune_headroom=0.16): build_time=0.18s, ndis_search=12586.5
HNSW32(prune_headroom=0.20): build_time=0.09s, ndis_search=12554.4
HNSW32(prune_headroom=0.20): build_time=0.21s, ndis_search=12553.0
HNSW32(prune_headroom=0.20): build_time=0.18s, ndis_search=12553.9

k= 1   | ef= 16 | ef= 32 | ef= 64 | ef=128 | ef=256
---------------------------------------------------
h=0.00 | 0.9003 | 0.9231 | 0.9354 | 0.9395 | 0.9413
h=0.00 | 0.9016 | 0.9254 | 0.9356 | 0.9394 | 0.9413
h=0.00 | 0.9006 | 0.9242 | 0.9347 | 0.9394 | 0.9416
h=0.04 | 0.9005 | 0.9252 | 0.9358 | 0.9400 | 0.9414
h=0.04 | 0.9035 | 0.9268 | 0.9353 | 0.9402 | 0.9413
h=0.04 | 0.9029 | 0.9259 | 0.9346 | 0.9392 | 0.9413
h=0.08 | 0.9044 | 0.9270 | 0.9352 | 0.9396 | 0.9414
h=0.08 | 0.9050 | 0.9271 | 0.9359 | 0.9398 | 0.9414
h=0.08 | 0.9000 | 0.9247 | 0.9349 | 0.9400 | 0.9415
h=0.12 | 0.9007 | 0.9263 | 0.9356 | 0.9397 | 0.9414
h=0.12 | 0.9025 | 0.9257 | 0.9354 | 0.9400 | 0.9415
h=0.12 | 0.9035 | 0.9275 | 0.9364 | 0.9401 | 0.9416
h=0.16 | 0.9021 | 0.9259 | 0.9359 | 0.9399 | 0.9415
h=0.16 | 0.9019 | 0.9250 | 0.9354 | 0.9402 | 0.9415
h=0.16 | 0.9021 | 0.9254 | 0.9354 | 0.9399 | 0.9415
h=0.20 | 0.8989 | 0.9248 | 0.9350 | 0.9400 | 0.9414
h=0.20 | 0.9034 | 0.9252 | 0.9359 | 0.9399 | 0.9416
h=0.20 | 0.9019 | 0.9267 | 0.9356 | 0.9399 | 0.9416

k=10   | ef= 16 | ef= 32 | ef= 64 | ef=128 | ef=256
---------------------------------------------------
h=0.00 | 0.8709 | 0.9102 | 0.9292 | 0.9370 | 0.9401
h=0.00 | 0.8708 | 0.9107 | 0.9289 | 0.9373 | 0.9401
h=0.00 | 0.8705 | 0.9101 | 0.9288 | 0.9374 | 0.9400
h=0.04 | 0.8705 | 0.9094 | 0.9283 | 0.9372 | 0.9401
h=0.04 | 0.8710 | 0.9103 | 0.9287 | 0.9371 | 0.9400
h=0.04 | 0.8708 | 0.9096 | 0.9284 | 0.9371 | 0.9401
h=0.08 | 0.8716 | 0.9101 | 0.9285 | 0.9371 | 0.9401
h=0.08 | 0.8725 | 0.9104 | 0.9288 | 0.9371 | 0.9403
h=0.08 | 0.8711 | 0.9093 | 0.9290 | 0.9372 | 0.9400
h=0.12 | 0.8701 | 0.9098 | 0.9286 | 0.9372 | 0.9400
h=0.12 | 0.8711 | 0.9099 | 0.9292 | 0.9372 | 0.9402
h=0.12 | 0.8707 | 0.9099 | 0.9291 | 0.9373 | 0.9400
h=0.16 | 0.8705 | 0.9097 | 0.9286 | 0.9370 | 0.9401
h=0.16 | 0.8714 | 0.9095 | 0.9289 | 0.9372 | 0.9401
h=0.16 | 0.8697 | 0.9091 | 0.9285 | 0.9373 | 0.9401
h=0.20 | 0.8693 | 0.9099 | 0.9288 | 0.9371 | 0.9402
h=0.20 | 0.8711 | 0.9098 | 0.9289 | 0.9373 | 0.9400
h=0.20 | 0.8711 | 0.9104 | 0.9289 | 0.9372 | 0.9401

ddrcoder added a commit to ddrcoder/faiss that referenced this pull request Mar 12, 2026
…st (facebookresearch#4847)

Summary:

Particularly when building large graphs (10M+), insertion performance often plummets in the presence of highly contended nodes. More time is spent doing repeated `O(d*M^2)` pruning, and more time is spent waiting for locks:

```name=Before
 [6.2%] +624,960 vecs / +10.1s = +62,165 vecs/s
[12.5%] +625,152 vecs /  +8.3s = +75,316 vecs/s
[18.8%] +625,152 vecs /  +8.0s = +78,244 vecs/s
[25.0%] +625,152 vecs /  +8.6s = +72,813 vecs/s
[31.3%] +625,024 vecs /  +8.6s = +72,796 vecs/s
[37.5%] +624,960 vecs /  +9.2s = +68,038 vecs/s
[43.8%] +624,960 vecs /  +9.9s = +63,074 vecs/s
[50.0%] +624,960 vecs /  +9.3s = +67,249 vecs/s
[56.3%] +624,960 vecs /  +9.6s = +64,865 vecs/s
[62.5%] +624,960 vecs /  +9.8s = +63,622 vecs/s
[68.8%] +624,960 vecs / +10.1s = +62,016 vecs/s
[75.0%] +624,960 vecs / +10.3s = +60,543 vecs/s
[81.3%] +624,960 vecs / +23.3s = +26,858 vecs/s
[87.5%] +624,960 vecs / +12.0s = +51,891 vecs/s
[93.8%] +624,960 vecs / +11.4s = +55,053 vecs/s
[100.%] +624,960 vecs / +11.9s = +52,318 vecs/s
[done] 10,000,000 vecs / 177.9s = 56,220 vecs/s average
```

By providing a bit more headroom, much useful time is spent actually constructing the graph, leading to
```name=After
Building IDMap,HNSW64 with 192 threads...
[ 6.2%] +624,960 vecs / +4.0s = +154,359 vecs/s
[12.5%] +625,152 vecs / +3.9s = +159,162 vecs/s
[18.8%] +625,152 vecs / +4.2s = +150,183 vecs/s
[25.0%] +625,152 vecs / +4.8s = +130,363 vecs/s
[31.3%] +625,024 vecs / +5.3s = +118,817 vecs/s
[37.5%] +624,960 vecs / +4.7s = +133,585 vecs/s
[43.8%] +624,960 vecs / +5.5s = +113,630 vecs/s
[50.0%] +624,960 vecs / +4.8s = +131,124 vecs/s
[56.3%] +624,960 vecs / +4.8s = +129,847 vecs/s
[62.5%] +624,960 vecs / +5.0s = +124,726 vecs/s
[68.8%] +624,960 vecs / +5.3s = +117,075 vecs/s
[75.0%] +624,960 vecs / +5.3s = +118,652 vecs/s
[81.3%] +624,960 vecs / +8.1s =  +77,497 vecs/s
[87.5%] +624,960 vecs / +6.3s =  +99,823 vecs/s
[93.8%] +624,960 vecs / +6.0s = +103,604 vecs/s
[100.%] +624,960 vecs / +5.9s = +106,680 vecs/s
[done] 10,000,000 vecs / 89.9s = 111,278 vecs/s average
```

The increased pruning has minimal effect on recall, particularly with the default value of `prune_headroom = 0.2`.

Reviewed By: mdouze

Differential Revision: D94442964
ddrcoder added a commit to ddrcoder/faiss that referenced this pull request Mar 12, 2026
…st (facebookresearch#4847)

Summary:

Particularly when building large graphs (10M+), insertion performance often plummets in the presence of highly contended nodes. More time is spent doing repeated `O(d*M^2)` pruning, and more time is spent waiting for locks:

```name=Before
 [6.2%] +624,960 vecs / +10.1s = +62,165 vecs/s
[12.5%] +625,152 vecs /  +8.3s = +75,316 vecs/s
[18.8%] +625,152 vecs /  +8.0s = +78,244 vecs/s
[25.0%] +625,152 vecs /  +8.6s = +72,813 vecs/s
[31.3%] +625,024 vecs /  +8.6s = +72,796 vecs/s
[37.5%] +624,960 vecs /  +9.2s = +68,038 vecs/s
[43.8%] +624,960 vecs /  +9.9s = +63,074 vecs/s
[50.0%] +624,960 vecs /  +9.3s = +67,249 vecs/s
[56.3%] +624,960 vecs /  +9.6s = +64,865 vecs/s
[62.5%] +624,960 vecs /  +9.8s = +63,622 vecs/s
[68.8%] +624,960 vecs / +10.1s = +62,016 vecs/s
[75.0%] +624,960 vecs / +10.3s = +60,543 vecs/s
[81.3%] +624,960 vecs / +23.3s = +26,858 vecs/s
[87.5%] +624,960 vecs / +12.0s = +51,891 vecs/s
[93.8%] +624,960 vecs / +11.4s = +55,053 vecs/s
[100.%] +624,960 vecs / +11.9s = +52,318 vecs/s
[done] 10,000,000 vecs / 177.9s = 56,220 vecs/s average
```

By providing a bit more headroom, much useful time is spent actually constructing the graph, leading to
```name=After
Building IDMap,HNSW64 with 192 threads...
[ 6.2%] +624,960 vecs / +4.0s = +154,359 vecs/s
[12.5%] +625,152 vecs / +3.9s = +159,162 vecs/s
[18.8%] +625,152 vecs / +4.2s = +150,183 vecs/s
[25.0%] +625,152 vecs / +4.8s = +130,363 vecs/s
[31.3%] +625,024 vecs / +5.3s = +118,817 vecs/s
[37.5%] +624,960 vecs / +4.7s = +133,585 vecs/s
[43.8%] +624,960 vecs / +5.5s = +113,630 vecs/s
[50.0%] +624,960 vecs / +4.8s = +131,124 vecs/s
[56.3%] +624,960 vecs / +4.8s = +129,847 vecs/s
[62.5%] +624,960 vecs / +5.0s = +124,726 vecs/s
[68.8%] +624,960 vecs / +5.3s = +117,075 vecs/s
[75.0%] +624,960 vecs / +5.3s = +118,652 vecs/s
[81.3%] +624,960 vecs / +8.1s =  +77,497 vecs/s
[87.5%] +624,960 vecs / +6.3s =  +99,823 vecs/s
[93.8%] +624,960 vecs / +6.0s = +103,604 vecs/s
[100.%] +624,960 vecs / +5.9s = +106,680 vecs/s
[done] 10,000,000 vecs / 89.9s = 111,278 vecs/s average
```

The increased pruning has minimal effect on recall, particularly with the default value of `prune_headroom = 0.2`.

Reviewed By: mdouze

Differential Revision: D94442964
@meta-codesync meta-codesync Bot changed the title HNSW: add prune_headroom to avoid O(n^2) pruning/locking HNSW: add prune_headroom to avoid O(n^2) pruning/locking, headroom test (#4847) Mar 18, 2026
ddrcoder added a commit to ddrcoder/faiss that referenced this pull request Mar 18, 2026
…st (facebookresearch#4847)

Summary:

Particularly when building large graphs (10M+), insertion performance often plummets in the presence of highly contended nodes. More time is spent doing repeated `O(d*M^2)` pruning, and more time is spent waiting for locks:

```name=Before
 [6.2%] +624,960 vecs / +10.1s = +62,165 vecs/s
[12.5%] +625,152 vecs /  +8.3s = +75,316 vecs/s
[18.8%] +625,152 vecs /  +8.0s = +78,244 vecs/s
[25.0%] +625,152 vecs /  +8.6s = +72,813 vecs/s
[31.3%] +625,024 vecs /  +8.6s = +72,796 vecs/s
[37.5%] +624,960 vecs /  +9.2s = +68,038 vecs/s
[43.8%] +624,960 vecs /  +9.9s = +63,074 vecs/s
[50.0%] +624,960 vecs /  +9.3s = +67,249 vecs/s
[56.3%] +624,960 vecs /  +9.6s = +64,865 vecs/s
[62.5%] +624,960 vecs /  +9.8s = +63,622 vecs/s
[68.8%] +624,960 vecs / +10.1s = +62,016 vecs/s
[75.0%] +624,960 vecs / +10.3s = +60,543 vecs/s
[81.3%] +624,960 vecs / +23.3s = +26,858 vecs/s
[87.5%] +624,960 vecs / +12.0s = +51,891 vecs/s
[93.8%] +624,960 vecs / +11.4s = +55,053 vecs/s
[100.%] +624,960 vecs / +11.9s = +52,318 vecs/s
[done] 10,000,000 vecs / 177.9s = 56,220 vecs/s average
```

By providing a bit more headroom, much useful time is spent actually constructing the graph, leading to
```name=After
Building IDMap,HNSW64 with 192 threads...
[ 6.2%] +624,960 vecs / +4.0s = +154,359 vecs/s
[12.5%] +625,152 vecs / +3.9s = +159,162 vecs/s
[18.8%] +625,152 vecs / +4.2s = +150,183 vecs/s
[25.0%] +625,152 vecs / +4.8s = +130,363 vecs/s
[31.3%] +625,024 vecs / +5.3s = +118,817 vecs/s
[37.5%] +624,960 vecs / +4.7s = +133,585 vecs/s
[43.8%] +624,960 vecs / +5.5s = +113,630 vecs/s
[50.0%] +624,960 vecs / +4.8s = +131,124 vecs/s
[56.3%] +624,960 vecs / +4.8s = +129,847 vecs/s
[62.5%] +624,960 vecs / +5.0s = +124,726 vecs/s
[68.8%] +624,960 vecs / +5.3s = +117,075 vecs/s
[75.0%] +624,960 vecs / +5.3s = +118,652 vecs/s
[81.3%] +624,960 vecs / +8.1s =  +77,497 vecs/s
[87.5%] +624,960 vecs / +6.3s =  +99,823 vecs/s
[93.8%] +624,960 vecs / +6.0s = +103,604 vecs/s
[100.%] +624,960 vecs / +5.9s = +106,680 vecs/s
[done] 10,000,000 vecs / 89.9s = 111,278 vecs/s average
```

The increased pruning has minimal effect on recall, particularly with the default value of `prune_headroom = 0.2`.

Reviewed By: mdouze

Differential Revision: D94442964
…st (facebookresearch#4847)

Summary:
Pull Request resolved: facebookresearch#4847

Particularly when building large graphs (10M+), insertion performance often plummets in the presence of highly contended nodes. More time is spent doing repeated `O(d*M^2)` pruning, and more time is spent waiting for locks:

```name=Before
 [6.2%] +624,960 vecs / +10.1s = +62,165 vecs/s
[12.5%] +625,152 vecs /  +8.3s = +75,316 vecs/s
[18.8%] +625,152 vecs /  +8.0s = +78,244 vecs/s
[25.0%] +625,152 vecs /  +8.6s = +72,813 vecs/s
[31.3%] +625,024 vecs /  +8.6s = +72,796 vecs/s
[37.5%] +624,960 vecs /  +9.2s = +68,038 vecs/s
[43.8%] +624,960 vecs /  +9.9s = +63,074 vecs/s
[50.0%] +624,960 vecs /  +9.3s = +67,249 vecs/s
[56.3%] +624,960 vecs /  +9.6s = +64,865 vecs/s
[62.5%] +624,960 vecs /  +9.8s = +63,622 vecs/s
[68.8%] +624,960 vecs / +10.1s = +62,016 vecs/s
[75.0%] +624,960 vecs / +10.3s = +60,543 vecs/s
[81.3%] +624,960 vecs / +23.3s = +26,858 vecs/s
[87.5%] +624,960 vecs / +12.0s = +51,891 vecs/s
[93.8%] +624,960 vecs / +11.4s = +55,053 vecs/s
[100.%] +624,960 vecs / +11.9s = +52,318 vecs/s
[done] 10,000,000 vecs / 177.9s = 56,220 vecs/s average
```

By providing a bit more headroom, much useful time is spent actually constructing the graph, leading to
```name=After
Building IDMap,HNSW64 with 192 threads...
[ 6.2%] +624,960 vecs / +4.0s = +154,359 vecs/s
[12.5%] +625,152 vecs / +3.9s = +159,162 vecs/s
[18.8%] +625,152 vecs / +4.2s = +150,183 vecs/s
[25.0%] +625,152 vecs / +4.8s = +130,363 vecs/s
[31.3%] +625,024 vecs / +5.3s = +118,817 vecs/s
[37.5%] +624,960 vecs / +4.7s = +133,585 vecs/s
[43.8%] +624,960 vecs / +5.5s = +113,630 vecs/s
[50.0%] +624,960 vecs / +4.8s = +131,124 vecs/s
[56.3%] +624,960 vecs / +4.8s = +129,847 vecs/s
[62.5%] +624,960 vecs / +5.0s = +124,726 vecs/s
[68.8%] +624,960 vecs / +5.3s = +117,075 vecs/s
[75.0%] +624,960 vecs / +5.3s = +118,652 vecs/s
[81.3%] +624,960 vecs / +8.1s =  +77,497 vecs/s
[87.5%] +624,960 vecs / +6.3s =  +99,823 vecs/s
[93.8%] +624,960 vecs / +6.0s = +103,604 vecs/s
[100.%] +624,960 vecs / +5.9s = +106,680 vecs/s
[done] 10,000,000 vecs / 89.9s = 111,278 vecs/s average
```

The increased pruning has minimal effect on recall, particularly with the default value of `prune_headroom = 0.2`.

Reviewed By: mdouze

Differential Revision: D94442964
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Mar 18, 2026

This pull request has been merged in 748c031.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants