Conversation
…where the node is present
vels to candidates pool in layer 0
… no dupes are submitted
…e in all the corresponding layers.
jkni
left a comment
There was a problem hiding this comment.
A few very minor comments, and one tiny bug I missed in the first pass of review.
jvector-base/src/main/java/io/github/jbellis/jvector/graph/GraphIndex.java
Outdated
Show resolved
Hide resolved
jvector-base/src/main/java/io/github/jbellis/jvector/graph/GraphSearcher.java
Outdated
Show resolved
Hide resolved
jvector-base/src/main/java/io/github/jbellis/jvector/graph/NodeQueue.java
Outdated
Show resolved
Hide resolved
jvector-base/src/main/java/io/github/jbellis/jvector/graph/GraphIndex.java
Outdated
Show resolved
Hide resolved
| for (int i = 0; i < other.graph.getIdUpperBound(); i++) { | ||
| if (!other.graph.containsNode(i)) { | ||
| continue; | ||
| IntStream.range(0, other.graph.getIdUpperBound()).parallel().forEach(i -> { |
There was a problem hiding this comment.
Yes, confirming the change captured my intent.
jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/OnDiskGraphIndexWriter.java
Outdated
Show resolved
Hide resolved
| rerankedResults.setMaxSize(topK); | ||
|
|
||
| // add evicted results from the last call back to the candidates | ||
| var previouslyEvicted = evictedResults.size() > 0 ? new SparseBits() : Bits.NONE; |
There was a problem hiding this comment.
We use evictedResults, but not the previouslyEvicted bitset created/populated here (https://github.com/datastax/jvector/blob/main/jvector-base/src/main/java/io/github/jbellis/jvector/graph/GraphSearcher.java#L302). We used to use this bitset in resume (https://github.com/datastax/jvector/blob/main/jvector-base/src/main/java/io/github/jbellis/jvector/graph/GraphSearcher.java#L302), but we don't in the new hierarchical structure. I think we can keep evictedResults but don't need to spend the CPU to create/update previouslyEvicted (it's set but never read).
| static List<Integer> sortedNodes(GraphIndex h) { | ||
| var graphNodes = h.getNodes(); | ||
| static List<Integer> sortedNodes(GraphIndex h, int level) { | ||
| var graphNodes = h.getNodes(level); // TODO |
There was a problem hiding this comment.
This is marked TODO without explanation. Is there something remaining?
| builder.addGraphNode(1, ravv.getVector(1)); | ||
| builder.addGraphNode(2, ravv.getVector(2)); | ||
| var neighbors = builder.graph.getNeighbors(0); | ||
| var neighbors = builder.graph.getNeighbors(0, 0); // TODO |
There was a problem hiding this comment.
TODO? Something missing?
| var ordinal = getRandom().nextInt(graph.size()); | ||
| // first pass compares fused ADC's direct similarity to reranker's similarity, used for comparisons to a specific node | ||
| var neighbors = cachedOnDiskView.getNeighborsIterator(ordinal); | ||
| var neighbors = cachedOnDiskView.getNeighborsIterator(0, ordinal); // TODO |
There was a problem hiding this comment.
TODO? Something left to do?
tlwillke
left a comment
There was a problem hiding this comment.
Some TODOs remaining but seem minor. A major contribution. Will affect public interfaces but it is the path forward, so worth it!
This PR introduces a hybrid HNSW-DiskANN graph. From HNSW, we take the idea of using multiple layers. This adds robustness to particularly hard datasets. Each layer is a Vamana graph. The upper layers reside in memory while the base layer resides on disk (in DiskANN style). It also enable using a single layer. In that case, it is a plain Vamana graph.