Don't store graph offsets for HNSW graph #536

mayya-sharipova · 2021-12-13T02:59:45Z

Currently we store for each node an offset in the graph
neighbours file from where to read this node's neighbours.
We also load these offsets into the heap.

This patch instead of storing offsets calculates them when needed.
This allows to save heap space and disk space for offsets.

To make offsets calculable:

we write neighbours as Int instead of the current VInt to
have predictable offsets (some extra space here)
for each node we allocate ((maxConn + 1) * 4) bytes for
storing the node's neighbours, where "maxConn" is the maximum number
of connections the node can have. If a node has less than maxConn we
add extra padding to fill the leftover space (some extra space here).
In big graphs most nodes have "maxConn" of neighbours so there
should not be much of a waste space.

Currently we store for each node an offset in the graph neighbours file from where to read this node's neighbours. We also load these offsets into the heap. This patch instead of storing offsets calculates them when needed. This allows to save heap space and disk space for offsets. To make offsets calculatable: 1) we write neighbours as Int instead of the current VInt (some extra space here) 2) for each node we allocate ((maxConn + 1) * 4) bytes for storing the node's neighbours, where "maxConn" is the maximum number of connections the node can have. If a node has less than maxConn we add extra padding to fill the leftover space (some extra space here). In big graphs most nodes have "maxConn" of neighbours so there should not be much of a waste space.

lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java

msokolov · 2021-12-13T16:31:06Z

I seem to remember that when I checked (you can use -fanout parameter to KnnGraphTester IIRC) most nodes were not fully populated; ie they had fewer than maxConn connections. Why? This seems counterintuitive, but I think it would be prudent to check the size increase/decrease from this change for some dataset/parameter choices. Certainly it's appealing to avoid the extra offset/lookup data structure

mayya-sharipova · 2021-12-14T01:52:51Z

@msokolov Thanks for the initial review, it is good to know that we are ok with this idea. I will do the comparison of size and also the maxConn numbers.

mayya-sharipova · 2021-12-31T22:16:23Z

@msokolov

I seem to remember that when I checked (you can use -fanout parameter to KnnGraphTester IIRC) most nodes were not fully populated; ie they had fewer than maxConn connections. Why? This seems counterintuitive

I have checked fanout on two datasets using KnnGraphTesters -stats option, and most nodes turned out to be fully connected, but there are some that are not:

glove-100-angular M:16
Graph level=0 size=1183514, Fanout min=1, mean=15.90, max=16

Number of connections (column 1) and number of nodes with this number of connections (column 2)

Histogram:

0%	10%	20%	30%	40%	50%	60%	70%	80%	90%	100%
0	16	16	16	16	16	16	16	16	16	16

sift-128-euclidean M:16
Graph level=0 size=1000000, Fanout min=1, mean=15.52, max=16
Histogram:

0%	10%	20%	30%	40%	50%	60%	70%	80%	90%	100%
0	16	16	16	16	16	16	16	16	16	16

mayya-sharipova · 2021-12-31T22:38:50Z

I think it would be prudent to check the size increase/decrease from this change for some dataset/parameter choices

I've checked the index sizes and the size actually increased by 4-5%:

glove-100-angular
Before the change: 517 Mb; after the change: 542 Mb

sift-128-euclidean
Before the change: 542M; after change: 564M

With a proposed design even if we save space by not storing offsets, we encode each node's neighbours as Int instead of the current VInt, which causes more disk usage.

On the upside:

we save heap memory as we don't need to load offsets, which saves for an index with 1M docs approximately : 1'000'000 nodes * 8 bytes for each node = 8M bytes (doesn't look that much but if there are many indices with many vector fields and more docs this increases proportionally; for a field with 100M docs it will be 800Mb)
no noticeable performance degradation because of an extra step to calculate offsets:

glove-100-angular

	baseline recall	baseline QPS	candidate recall	candidate QPS
n_cands=10	0.496	3549.216	0.481	4027.582
n_cands=20	0.560	3423.073	0.553	3369.245
n_cands=40	0.635	2686.622	0.631	2633.146
n_cands=80	0.708	1889.805	0.707	1890.202
n_cands=120	0.747	1476.286	0.748	1451.970
n_cands=200	0.790	1037.742	0.791	1013.580
n_cands=400	0.840	607.183	0.841	572.152
n_cands=600	0.865	433.513	0.865	402.504
n_cands=800	0.880	341.052	0.881	320.057

sift-128-euclidean

	baseline recall	baseline QPS	candidate recall	candidate QPS
n_cands=10	0.747	3891.531	0.745	3926.015
n_cands=20	0.817	3359.364	0.817	3365.934
n_cands=40	0.889	2590.605	0.889	2568.544
n_cands=80	0.944	1798.558	0.944	1806.776
n_cands=120	0.964	1383.721	0.964	1425.713
n_cands=200	0.983	973.862	0.983	1002.114
n_cands=400	0.994	586.816	0.994	599.229
n_cands=600	0.997	427.128	0.997	437.296
n_cands=800	0.998	341.178	0.998	349.690

mayya-sharipova · 2022-01-05T15:49:41Z

I've also run the comparison on a bigger dataset: deep-image-96-angular of 10M docs.
M: 16; efConstruction: 500

Disk size before the change: 4.2G; after change: 4.3G => 2% increase
Not much affect on search performance:

	baseline recall	baseline QPS	candidate recall	candidate QPS
n_cands=10	0.726	1527.894	0.728	870.721
n_cands=20	0.793	1350.206	0.794	1364.301
n_cands=40	0.862	1053.906	0.862	1068.798
n_cands=80	0.917	737.711	0.918	741.551
n_cands=120	0.942	573.783	0.942	589.756
n_cands=200	0.964	402.166	0.964	414.730
n_cands=400	0.982	237.545	0.982	251.678
n_cands=600	0.988	174.223	0.988	177.968
n_cands=800	0.991	137.420	0.991	143.290

msokolov · 2022-01-05T16:17:06Z

Thanks for the thorough testing, @mayya-sharipova. I think we want to minimize heap usage, the index size cost is small; basically we are trading off on-heap for on-disk/off-heap, which is always a tradeoff we like. The search time change seems like noise? So +1 from me.

Also, glad to see the fanout numbers are sane :)

jtibshirani · 2022-01-10T19:06:26Z

@mayya-sharipova this looks like a nice improvement. We should make sure it's done in a backwards compatible way though, so we can still read vectors that were written in Lucene 9.0. Here's a guide for making index format changes: https://github.com/apache/lucene/blob/main/lucene/backward-codecs/README.md. I think we'll want to create a new class Lucene91HnswVectorsFormat.

jtibshirani · 2022-01-10T22:03:35Z

Oh sorry, I didn't see this was merged into the hnsw branch. Ignore my comment, I guess we'll handle the new format in another PR.

mayya-sharipova · 2022-01-11T14:45:10Z

@jtibshirani Thanks for the guide on the format change, I will study it and follow it.
Indeed this PR was merged into the hnsw branch, so we will do the format change there.

Currently HNSW has only a single layer. This patch makes HNSW graph multi-layered. This PR is based on the following PRs: apache#250, apache#267, apache#287, apache#315, apache#536, apache#416 Main changes: - Multi layers are introduced into HnswGraph and HnswGraphBuilder - A new Lucene91HnswVectorsFormat with new Lucene91HnswVectorsReader and Lucene91HnswVectorsWriter are introduced to encode graph layers' information - Lucene90Codec, Lucene90HnswVectorsFormat, and the reading logic of Lucene90HnswVectorsReader and Lucene90HnswGraph are moved to backward_codecs to support reading and searching of graphs built in pre 9.1 version. Lucene90HnswVectorsWriter is deleted. - For backwards compatible tests, previous Lucene90 graph reading and writing logic was copied into test files of Lucene90RWHnswVectorsFormat, Lucene90HnswVectorsWriter, Lucene90HnswGraphBuilder and Lucene90HnswRWGraph.

Currently HNSW has only a single layer. This patch makes HNSW graph multi-layered. This PR is based on the following PRs: apache#250, apache#267, apache#287, apache#315, apache#536, apache#416 Main changes: - Multi layers are introduced into HnswGraph and HnswGraphBuilder - A new Lucene91HnswVectorsFormat with new Lucene91HnswVectorsReader and Lucene91HnswVectorsWriter are introduced to encode graph layers' information - Lucene90Codec, Lucene90HnswVectorsFormat, and the reading logic of Lucene90HnswVectorsReader and Lucene90HnswGraph are moved to backward_codecs to support reading and searching of graphs built in pre 9.1 version. Lucene90HnswVectorsWriter is deleted. - For backwards compatible tests, previous Lucene90 graph reading and writing logic was copied into test files of Lucene90RWHnswVectorsFormat, Lucene90HnswVectorsWriter, Lucene90HnswGraphBuilder and Lucene90HnswRWGraph. TODO: tests for KNN search for graphs built in pre 9.1 version; tests for merge of indices of pre 9.1 + current versions.

Currently HNSW has only a single layer. This patch makes HNSW graph multi-layered. This PR is based on the following PRs: #250, #267, #287, #315, #536, #416 Main changes: - Multi layers are introduced into HnswGraph and HnswGraphBuilder - A new Lucene91HnswVectorsFormat with new Lucene91HnswVectorsReader and Lucene91HnswVectorsWriter are introduced to encode graph layers' information - Lucene90Codec, Lucene90HnswVectorsFormat, and the reading logic of Lucene90HnswVectorsReader and Lucene90HnswGraph are moved to backward_codecs to support reading and searching of graphs built in pre 9.1 version. Lucene90HnswVectorsWriter is deleted. - For backwards compatible tests, previous Lucene90 graph reading and writing logic was copied into test files of Lucene90RWHnswVectorsFormat, Lucene90HnswVectorsWriter, Lucene90HnswGraphBuilder and Lucene90HnswRWGraph. TODO: tests for KNN search for graphs built in pre 9.1 version; tests for merge of indices of pre 9.1 + current versions.

Currently HNSW has only a single layer. This patch makes HNSW graph multi-layered. This PR is based on the following PRs: apache#250, apache#267, apache#287, apache#315, apache#536, apache#416 Main changes: - Multi layers are introduced into HnswGraph and HnswGraphBuilder - A new Lucene91HnswVectorsFormat with new Lucene91HnswVectorsReader and Lucene91HnswVectorsWriter are introduced to encode graph layers' information - Lucene90Codec, Lucene90HnswVectorsFormat, and the reading logic of Lucene90HnswVectorsReader and Lucene90HnswGraph are moved to backward_codecs to support reading and searching of graphs built in pre 9.1 version. Lucene90HnswVectorsWriter is deleted. - For backwards compatible tests, previous Lucene90 graph reading and writing logic was copied into test files of Lucene90RWHnswVectorsFormat, Lucene90HnswVectorsWriter, Lucene90HnswGraphBuilder and Lucene90HnswRWGraph. TODO: tests for KNN search for graphs built in pre 9.1 version; tests for merge of indices of pre 9.1 + current versions.

Currently HNSW has only a single layer. This patch makes HNSW graph multi-layered. This PR is based on the following PRs: #250, #267, #287, #315, #536, #416 Main changes: - Multi layers are introduced into HnswGraph and HnswGraphBuilder - A new Lucene91HnswVectorsFormat with new Lucene91HnswVectorsReader and Lucene91HnswVectorsWriter are introduced to encode graph layers' information - Lucene90Codec, Lucene90HnswVectorsFormat, and the reading logic of Lucene90HnswVectorsReader and Lucene90HnswGraph are moved to backward_codecs to support reading and searching of graphs built in pre 9.1 version. Lucene90HnswVectorsWriter is deleted. - For backwards compatible tests, previous Lucene90 graph reading and writing logic was copied into test files of Lucene90RWHnswVectorsFormat, Lucene90HnswVectorsWriter, Lucene90HnswGraphBuilder and Lucene90HnswRWGraph. TODO: tests for KNN search for graphs built in pre 9.1 version; tests for merge of indices of pre 9.1 + current versions.

mayya-sharipova mentioned this pull request Dec 13, 2021

LUCENE-10054 Make HnswGraph hierarchical #416

Closed

sonatype-lift bot reviewed Dec 13, 2021

View reviewed changes

lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java Outdated Show resolved Hide resolved

mayya-sharipova added 3 commits December 28, 2021 07:49

Correct integer overflow

eba61ef

Adjust KnnGraphTester for multiple levels

9315f03

Add an option for euclidean metric

2f28ab9

Reverting changes to KnnGraphTester, they are addressed in a separate PR

1e5ba3f

mayya-sharipova merged commit cd9afac into apache:hnsw Jan 10, 2022

mayya-sharipova deleted the hnsw-offsets branch January 10, 2022 15:28

mayya-sharipova mentioned this pull request Jan 17, 2022

LUCENE-10054 Make HnswGraph hierarchical #608

Merged

mayya-sharipova mentioned this pull request Jan 27, 2022

LUCENE-10054 Make HnswGraph hierarchical (#608) #629

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Don't store graph offsets for HNSW graph #536

Don't store graph offsets for HNSW graph #536

Uh oh!

mayya-sharipova commented Dec 13, 2021 •

edited

Loading

Uh oh!

Uh oh!

msokolov commented Dec 13, 2021

Uh oh!

mayya-sharipova commented Dec 14, 2021

Uh oh!

mayya-sharipova commented Dec 31, 2021

Uh oh!

mayya-sharipova commented Dec 31, 2021 •

edited

Loading

Uh oh!

mayya-sharipova commented Jan 5, 2022

Uh oh!

msokolov commented Jan 5, 2022

Uh oh!

jtibshirani commented Jan 10, 2022

Uh oh!

jtibshirani commented Jan 10, 2022

Uh oh!

mayya-sharipova commented Jan 11, 2022

Uh oh!

Uh oh!

Don't store graph offsets for HNSW graph #536

Don't store graph offsets for HNSW graph #536

Uh oh!

Conversation

mayya-sharipova commented Dec 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

msokolov commented Dec 13, 2021

Uh oh!

mayya-sharipova commented Dec 14, 2021

Uh oh!

mayya-sharipova commented Dec 31, 2021

Uh oh!

mayya-sharipova commented Dec 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mayya-sharipova commented Jan 5, 2022

Uh oh!

msokolov commented Jan 5, 2022

Uh oh!

jtibshirani commented Jan 10, 2022

Uh oh!

jtibshirani commented Jan 10, 2022

Uh oh!

mayya-sharipova commented Jan 11, 2022

Uh oh!

Uh oh!

mayya-sharipova commented Dec 13, 2021 •

edited

Loading

mayya-sharipova commented Dec 31, 2021 •

edited

Loading