Skip to content

Don't store graph offsets for HNSW graph #536

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jan 10, 2022

Conversation

mayya-sharipova
Copy link
Contributor

@mayya-sharipova mayya-sharipova commented Dec 13, 2021

Currently we store for each node an offset in the graph
neighbours file from where to read this node's neighbours.
We also load these offsets into the heap.

This patch instead of storing offsets calculates them when needed.
This allows to save heap space and disk space for offsets.

To make offsets calculable:

  1. we write neighbours as Int instead of the current VInt to
    have predictable offsets (some extra space here)
  2. for each node we allocate ((maxConn + 1) * 4) bytes for
    storing the node's neighbours, where "maxConn" is the maximum number
    of connections the node can have. If a node has less than maxConn we
    add extra padding to fill the leftover space (some extra space here).
    In big graphs most nodes have "maxConn" of neighbours so there
    should not be much of a waste space.

Currently we store for each node an offset in the graph
neighbours file from where to read this node's neighbours.
We also load these offsets into the heap.

This patch instead of storing offsets calculates them when needed.
This allows to save heap space and disk space for offsets.

To make offsets calculatable:
1) we write neighbours as Int instead of the current VInt
 (some extra space here)
2) for each node we allocate ((maxConn + 1) * 4) bytes for
storing the node's neighbours, where "maxConn" is the maximum number
of connections the node can have. If a node has less than maxConn we
add extra padding to fill the leftover space (some extra space here).
In big graphs most nodes have "maxConn" of neighbours so there
should not be much of a waste space.
@msokolov
Copy link
Contributor

I seem to remember that when I checked (you can use -fanout parameter to KnnGraphTester IIRC) most nodes were not fully populated; ie they had fewer than maxConn connections. Why? This seems counterintuitive, but I think it would be prudent to check the size increase/decrease from this change for some dataset/parameter choices. Certainly it's appealing to avoid the extra offset/lookup data structure

@mayya-sharipova
Copy link
Contributor Author

@msokolov Thanks for the initial review, it is good to know that we are ok with this idea. I will do the comparison of size and also the maxConn numbers.

@mayya-sharipova
Copy link
Contributor Author

@msokolov

I seem to remember that when I checked (you can use -fanout parameter to KnnGraphTester IIRC) most nodes were not fully populated; ie they had fewer than maxConn connections. Why? This seems counterintuitive

I have checked fanout on two datasets using KnnGraphTesters -stats option, and most nodes turned out to be fully connected, but there are some that are not:

glove-100-angular M:16
Graph level=0 size=1183514, Fanout min=1, mean=15.90, max=16

Number of connections (column 1) and number of nodes with this number of connections (column 2)

Histogram:

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
0 16 16 16 16 16 16 16 16 16 16

sift-128-euclidean M:16
Graph level=0 size=1000000, Fanout min=1, mean=15.52, max=16
Histogram:

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
0 16 16 16 16 16 16 16 16 16 16

@mayya-sharipova
Copy link
Contributor Author

mayya-sharipova commented Dec 31, 2021

I think it would be prudent to check the size increase/decrease from this change for some dataset/parameter choices

I've checked the index sizes and the size actually increased by 4-5%:

glove-100-angular
Before the change: 517 Mb; after the change: 542 Mb

sift-128-euclidean
Before the change: 542M; after change: 564M

With a proposed design even if we save space by not storing offsets, we encode each node's neighbours as Int instead of the current VInt, which causes more disk usage.

On the upside:

  • we save heap memory as we don't need to load offsets, which saves for an index with 1M docs approximately : 1'000'000 nodes * 8 bytes for each node = 8M bytes (doesn't look that much but if there are many indices with many vector fields and more docs this increases proportionally; for a field with 100M docs it will be 800Mb)
  • no noticeable performance degradation because of an extra step to calculate offsets:

glove-100-angular

baseline recall baseline QPS candidate recall candidate QPS
n_cands=10 0.496 3549.216 0.481 4027.582
n_cands=20 0.560 3423.073 0.553 3369.245
n_cands=40 0.635 2686.622 0.631 2633.146
n_cands=80 0.708 1889.805 0.707 1890.202
n_cands=120 0.747 1476.286 0.748 1451.970
n_cands=200 0.790 1037.742 0.791 1013.580
n_cands=400 0.840 607.183 0.841 572.152
n_cands=600 0.865 433.513 0.865 402.504
n_cands=800 0.880 341.052 0.881 320.057

sift-128-euclidean

baseline recall baseline QPS candidate recall candidate QPS
n_cands=10 0.747 3891.531 0.745 3926.015
n_cands=20 0.817 3359.364 0.817 3365.934
n_cands=40 0.889 2590.605 0.889 2568.544
n_cands=80 0.944 1798.558 0.944 1806.776
n_cands=120 0.964 1383.721 0.964 1425.713
n_cands=200 0.983 973.862 0.983 1002.114
n_cands=400 0.994 586.816 0.994 599.229
n_cands=600 0.997 427.128 0.997 437.296
n_cands=800 0.998 341.178 0.998 349.690

@mayya-sharipova
Copy link
Contributor Author

I've also run the comparison on a bigger dataset: deep-image-96-angular of 10M docs.
M: 16; efConstruction: 500

Disk size before the change: 4.2G; after change: 4.3G => 2% increase
Not much affect on search performance:

baseline recall baseline QPS candidate recall candidate QPS
n_cands=10 0.726 1527.894 0.728 870.721
n_cands=20 0.793 1350.206 0.794 1364.301
n_cands=40 0.862 1053.906 0.862 1068.798
n_cands=80 0.917 737.711 0.918 741.551
n_cands=120 0.942 573.783 0.942 589.756
n_cands=200 0.964 402.166 0.964 414.730
n_cands=400 0.982 237.545 0.982 251.678
n_cands=600 0.988 174.223 0.988 177.968
n_cands=800 0.991 137.420 0.991 143.290

@msokolov
Copy link
Contributor

msokolov commented Jan 5, 2022

Thanks for the thorough testing, @mayya-sharipova. I think we want to minimize heap usage, the index size cost is small; basically we are trading off on-heap for on-disk/off-heap, which is always a tradeoff we like. The search time change seems like noise? So +1 from me.

Also, glad to see the fanout numbers are sane :)

@mayya-sharipova mayya-sharipova merged commit cd9afac into apache:hnsw Jan 10, 2022
@mayya-sharipova mayya-sharipova deleted the hnsw-offsets branch January 10, 2022 15:28
@jtibshirani
Copy link
Member

@mayya-sharipova this looks like a nice improvement. We should make sure it's done in a backwards compatible way though, so we can still read vectors that were written in Lucene 9.0. Here's a guide for making index format changes: https://github.com/apache/lucene/blob/main/lucene/backward-codecs/README.md. I think we'll want to create a new class Lucene91HnswVectorsFormat.

@jtibshirani
Copy link
Member

Oh sorry, I didn't see this was merged into the hnsw branch. Ignore my comment, I guess we'll handle the new format in another PR.

@mayya-sharipova
Copy link
Contributor Author

@jtibshirani Thanks for the guide on the format change, I will study it and follow it.
Indeed this PR was merged into the hnsw branch, so we will do the format change there.

mayya-sharipova added a commit to mayya-sharipova/lucene that referenced this pull request Jan 17, 2022
Currently HNSW has only a single layer.
This patch makes HNSW graph multi-layered.

This PR is based on the following PRs:
 apache#250, apache#267, apache#287, apache#315, apache#536, apache#416

Main changes:
- Multi layers are introduced into HnswGraph and HnswGraphBuilder
- A new Lucene91HnswVectorsFormat with new Lucene91HnswVectorsReader
and Lucene91HnswVectorsWriter are introduced to encode graph
layers' information
- Lucene90Codec, Lucene90HnswVectorsFormat, and the reading logic of
Lucene90HnswVectorsReader and Lucene90HnswGraph are moved to
backward_codecs to support reading and searching of graphs built
in pre 9.1 version. Lucene90HnswVectorsWriter is deleted.
- For backwards compatible tests, previous Lucene90 graph reading and
writing logic was copied into test files of
Lucene90RWHnswVectorsFormat, Lucene90HnswVectorsWriter,
Lucene90HnswGraphBuilder and Lucene90HnswRWGraph.
mayya-sharipova added a commit to mayya-sharipova/lucene that referenced this pull request Jan 17, 2022
Currently HNSW has only a single layer.
This patch makes HNSW graph multi-layered.

This PR is based on the following PRs:
 apache#250, apache#267, apache#287, apache#315, apache#536, apache#416

Main changes:
- Multi layers are introduced into HnswGraph and HnswGraphBuilder
- A new Lucene91HnswVectorsFormat with new Lucene91HnswVectorsReader
and Lucene91HnswVectorsWriter are introduced to encode graph
layers' information
- Lucene90Codec, Lucene90HnswVectorsFormat, and the reading logic of
Lucene90HnswVectorsReader and Lucene90HnswGraph are moved to
backward_codecs to support reading and searching of graphs built
in pre 9.1 version. Lucene90HnswVectorsWriter is deleted.
- For backwards compatible tests, previous Lucene90 graph reading and
writing logic was copied into test files of
Lucene90RWHnswVectorsFormat, Lucene90HnswVectorsWriter,
Lucene90HnswGraphBuilder and Lucene90HnswRWGraph.

TODO: tests for KNN search for graphs built in pre 9.1 version;
tests for merge of indices of pre 9.1 + current versions.
mayya-sharipova added a commit to mayya-sharipova/lucene that referenced this pull request Jan 17, 2022
Currently HNSW has only a single layer.
This patch makes HNSW graph multi-layered.

This PR is based on the following PRs:
 apache#250, apache#267, apache#287, apache#315, apache#536, apache#416

Main changes:
- Multi layers are introduced into HnswGraph and HnswGraphBuilder
- A new Lucene91HnswVectorsFormat with new Lucene91HnswVectorsReader
and Lucene91HnswVectorsWriter are introduced to encode graph
layers' information
- Lucene90Codec, Lucene90HnswVectorsFormat, and the reading logic of
Lucene90HnswVectorsReader and Lucene90HnswGraph are moved to
backward_codecs to support reading and searching of graphs built
in pre 9.1 version. Lucene90HnswVectorsWriter is deleted.
- For backwards compatible tests, previous Lucene90 graph reading and
writing logic was copied into test files of
Lucene90RWHnswVectorsFormat, Lucene90HnswVectorsWriter,
Lucene90HnswGraphBuilder and Lucene90HnswRWGraph.

TODO: tests for KNN search for graphs built in pre 9.1 version;
tests for merge of indices of pre 9.1 + current versions.
mayya-sharipova added a commit that referenced this pull request Jan 25, 2022
Currently HNSW has only a single layer.
This patch makes HNSW graph multi-layered.

This PR is based on the following PRs:
 #250, #267, #287, #315, #536, #416

Main changes:
- Multi layers are introduced into HnswGraph and HnswGraphBuilder
- A new Lucene91HnswVectorsFormat with new Lucene91HnswVectorsReader
and Lucene91HnswVectorsWriter are introduced to encode graph
layers' information
- Lucene90Codec, Lucene90HnswVectorsFormat, and the reading logic of
Lucene90HnswVectorsReader and Lucene90HnswGraph are moved to
backward_codecs to support reading and searching of graphs built
in pre 9.1 version. Lucene90HnswVectorsWriter is deleted.
- For backwards compatible tests, previous Lucene90 graph reading and
writing logic was copied into test files of
Lucene90RWHnswVectorsFormat, Lucene90HnswVectorsWriter,
Lucene90HnswGraphBuilder and Lucene90HnswRWGraph.

TODO: tests for KNN search for graphs built in pre 9.1 version;
tests for merge of indices of pre 9.1 + current versions.
mayya-sharipova added a commit to mayya-sharipova/lucene that referenced this pull request Jan 27, 2022
Currently HNSW has only a single layer.
This patch makes HNSW graph multi-layered.

This PR is based on the following PRs:
 apache#250, apache#267, apache#287, apache#315, apache#536, apache#416

Main changes:
- Multi layers are introduced into HnswGraph and HnswGraphBuilder
- A new Lucene91HnswVectorsFormat with new Lucene91HnswVectorsReader
and Lucene91HnswVectorsWriter are introduced to encode graph
layers' information
- Lucene90Codec, Lucene90HnswVectorsFormat, and the reading logic of
Lucene90HnswVectorsReader and Lucene90HnswGraph are moved to
backward_codecs to support reading and searching of graphs built
in pre 9.1 version. Lucene90HnswVectorsWriter is deleted.
- For backwards compatible tests, previous Lucene90 graph reading and
writing logic was copied into test files of
Lucene90RWHnswVectorsFormat, Lucene90HnswVectorsWriter,
Lucene90HnswGraphBuilder and Lucene90HnswRWGraph.

TODO: tests for KNN search for graphs built in pre 9.1 version;
tests for merge of indices of pre 9.1 + current versions.
mayya-sharipova added a commit that referenced this pull request Jan 28, 2022
Currently HNSW has only a single layer.
This patch makes HNSW graph multi-layered.

This PR is based on the following PRs:
 #250, #267, #287, #315, #536, #416

Main changes:
- Multi layers are introduced into HnswGraph and HnswGraphBuilder
- A new Lucene91HnswVectorsFormat with new Lucene91HnswVectorsReader
and Lucene91HnswVectorsWriter are introduced to encode graph
layers' information
- Lucene90Codec, Lucene90HnswVectorsFormat, and the reading logic of
Lucene90HnswVectorsReader and Lucene90HnswGraph are moved to
backward_codecs to support reading and searching of graphs built
in pre 9.1 version. Lucene90HnswVectorsWriter is deleted.
- For backwards compatible tests, previous Lucene90 graph reading and
writing logic was copied into test files of
Lucene90RWHnswVectorsFormat, Lucene90HnswVectorsWriter,
Lucene90HnswGraphBuilder and Lucene90HnswRWGraph.

TODO: tests for KNN search for graphs built in pre 9.1 version;
tests for merge of indices of pre 9.1 + current versions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants