-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement 11236 lazy compute similarity score #12480
Enhancement 11236 lazy compute similarity score #12480
Conversation
@benwtrent @zhaih please take a look when you get a chance! I couldn't figure out how to update the existing PR, so I created a new one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for running the benchmark.
I have two optimizations you should apply. I suggest the following to re-measure:
- Use realistic vector sizes. You should be testing with at least 768 Float32 dimensions. If you are not, please make sure that you are. This way the distance calculation has the typical expense.
- If the performance still isn't improved (with my optimizations & on 768 dim vectors), I suggest utilizing JFR to see where the time is spent. It could be that creating and deleting objects/adding removing from the hashmap washes out any real improvement :/
|
||
public NeighborArray(int maxSize, boolean descOrder) { | ||
node = new int[maxSize]; | ||
score = new float[maxSize]; | ||
this.scoresDescOrder = descOrder; | ||
scoringContext = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pre-allocate to maxSize
. We should avoid making the hashmap grow as we add things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need hashmap actually, I believe an array with length of maxSize
is more than enough, as we're at most mapping idx
to ScoringFunction
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhaih yes! you are correct. Just have an array of maxSize
works and null
means there is no scoring function.
// Check if we need to compute score | ||
if (scoringContext.containsKey(sortedNodeSize)) { | ||
score[sortedNodeSize] = scoringContext.get(sortedNodeSize).calculateScore(); | ||
scoringContext.remove(sortedNodeSize); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should simplify and only call the map once.
// Check if we need to compute score | |
if (scoringContext.containsKey(sortedNodeSize)) { | |
score[sortedNodeSize] = scoringContext.get(sortedNodeSize).calculateScore(); | |
scoringContext.remove(sortedNodeSize); | |
} | |
// Check if we need to compute score | |
ScoringFunction maybeScoringFunction = scoringContext.remove(sortedNodeSize); | |
if (maybeScoringFunction != null) { | |
score[sortedNodeSize] = maybeScoringFunction.calculateScore(); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Jack, thanks for doing the benchmark and I think the count of scoring computation does indicate the optimization is worth trying.
To further squeeze the performance and show the benefit, I think we need to optimize the data structure so that it doesn't create too much overhead during the graph build.
I have several suggestions below, and let me know what you think!
Also +1 to @benwtrent 's suggestion about using 768 vectors to further testing.
Thanks again for the hardwork!
|
||
public NeighborArray(int maxSize, boolean descOrder) { | ||
node = new int[maxSize]; | ||
score = new float[maxSize]; | ||
this.scoresDescOrder = descOrder; | ||
scoringContext = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need hashmap actually, I believe an array with length of maxSize
is more than enough, as we're at most mapping idx
to ScoringFunction
right?
@@ -111,6 +129,12 @@ public int[] sort() { | |||
private int insertSortedInternal() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To further optimize the memory usage and eliminate potential GC overhead, I would suggest not storing ScoringFunction
at all.
In theory, to compute a score, we need 3 things: SimilarityFunction, node_emb_1, node_emb_2. Where node embeddings can be get by calling vectors.vectorValue(nodeId)
from HnswGraphBuilder
, and also SimilarityFunction
is held by HnswGraphBuilder
.
That means, we don't need to even store the similarityFunction, and two bytes array beforehand, we can let the HnswGraphBuilder
pass in a BiFunction<Integer, Integer, Float>
to perform scoring execution when we need it and inside NeighborArray
we just need to give two nodes' id to that function. That way, we don't need to store extra context inside NeighborArray
and also avoid holding a lot of byte[]
for too long. (If you think about it, we're holding all byte/float arrays necessary for score computation until we computed it or the graph is constructed, that's a huge GC load if your computer doesn't have enough memory)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea.
@zhaih instead of adding a function reference to the NeighborArray#ctor
, why not pass the function to the NeighborArray#sort()
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This obviates the need for any additional structures, if score==-1
calculate via the method provided to sort()
. Though it probably won't be BiFunction<Integer, Integer, Float>
but instead a Function<Integer, Float>
as only the caller will know which is the "current" node for the neighbor array. Or even better, a custom functional interface that works on native values int -> float
this way we don't box the integer and scoring results unnecessarily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not pass the function to the NeighborArray#sort() method?
Yeah +1
This obviates the need for any additional structures, if score==-1 calculate via the method provided to sort().
I have one concern for this, that -1
might be a valid score itself, but we can always have an additional integer, like noScoreIndex
or so, to indicate upto which index we haven't calculated the score. (Because those nodes will always be sit in the front of the score array)
Though it probably won't be BiFunction<Integer, Integer, Float> but instead a Function<Integer, Float> as only the caller will know which is the "current" node for the neighbor array. Or even better, a custom functional interface that works on native values int -> float
+1, thanks for a more detailed thinking!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Score "NaN" is probably best to cover our bases :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right, I forgot Java has NaN
, yeah then I guess NaN
will be the best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a wonderful thread, I am learning a lot! Let me take yall's suggestions and see how much more performance we can gain. Again, really appreciate your help!
Here is a quick re-run of benchmark(100 dim vectors) on the optimized code with a 90% - 10% split on documents addition: Baseline -> old candidate -> optimized candidate: I will run the benchmark on 768 dim vectors later. |
Thank you @Jackyrie2 for a lot of benchmarking! Since we have already made it lightweight enough (requires no extra memory usage on NeighborArray) and the benchmark has shown mostly positive results(Benchmark does have quite a lot of noises, especially if you're running it with your local machine, if you want , maybe write some simple bash loop and run the evaluation 3 - 5 times and use the average), I would say this PR is a good optimization. But it does need some more documentation I think as the NeighborArray is already complex enough, so I will try to carefully review it again recently. Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some small comments, overall it looks good to me!
float[] vectorValue = null; | ||
byte[] binaryValue = null; | ||
switch (this.vectorEncoding) { | ||
case FLOAT32 -> vectorValue = (float[]) vectors.vectorValue(nodeOrd); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's take this part outside of lambda to reduce number of times we call vectorValue
, this operation involves some seek and parse operation on off-heap memory.
case BYTE -> this.similarityFunction.compare( | ||
binaryValue, (byte[]) vectorsCopy.vectorValue(newNeighbor)); | ||
}; | ||
// float score = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove those commented code?
|
||
if (Float.isNaN(tmpScore)){ | ||
tmpScore = scoringFunction.computeScore(tmpNode); | ||
System.out.println("Node: " + tmpNode + " Score: " + tmpScore); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this?
Ah sorry I almost forget: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I don't think there's a need to make it Lucene 10 only? (we're not breaking any APIs)
lucene/CHANGES.txt
Outdated
@@ -90,6 +90,8 @@ Optimizations | |||
|
|||
* GITHUB#12408: Lazy initialization improvements for Facets implementations when there are segments with no hits | |||
to count. (Greg Miller) | |||
|
|||
* GITHUB##12371: enable lazy computation of similarity score during initializeFromGraph (Jack Wang) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put it under Lucene 9.8?
vectorValue, (float[]) vectorsCopy.vectorValue(newNeighbor)); | ||
case BYTE -> this.similarityFunction.compare( | ||
binaryValue, (byte[]) vectorsCopy.vectorValue(newNeighbor)); | ||
}; | ||
// we are not sure whether the previous graph contains |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update this comment like
We will compute those scores lazily when we need to pop out the non-diverse nodes
?
…ps://github.com/Jackyrie2/lucene into enhancement-11236-lazy-compute-similarity-score
@Jackyrie2 Seems the precommit fails? |
Merged and backported |
Description
This is an update to the previous PR #12371 , a few issues were found.
scoringContext
map instead of using size as indexscoringContext
was evaluated during call tosortInternal
instead of just checking if the new index being sorted has a pre-computed scoreTwo new unit tests were added to demonstrate the bugs, this PR fixes the issues above.
Benchmarking Result
To measure any meaningful latency improvements, we have to first create a big index and other small indexes, then once we force an index merge, the index writer will invoke
HNSWGraphBuilder.initializeFromGraph
. KnnGraphTester was modified as the following:From the benchmark results, we are calculating significantly fewer scores using the lazy eval enhancement. However, the indexMergeTime did not decrease as expected.