Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement 11236 lazy compute similarity score #12480

Merged

Conversation

Jackyrie2
Copy link
Contributor

@Jackyrie2 Jackyrie2 commented Aug 1, 2023

Description

This is an update to the previous PR #12371 , a few issues were found.

  • ordinal of newNode was used as index to the scoringContext map instead of using size as index
  • the entire scoringContext was evaluated during call to sortInternal instead of just checking if the new index being sorted has a pre-computed score

Two new unit tests were added to demonstrate the bugs, this PR fixes the issues above.

Benchmarking Result

To measure any meaningful latency improvements, we have to first create a big index and other small indexes, then once we force an index merge, the index writer will invokeHNSWGraphBuilder.initializeFromGraph. KnnGraphTester was modified as the following:

  1. first add 90% of documents
  2. iw.commit()
  3. forceMerge into 1 segment
  4. set merge policy to NoMergePolicy and add the rest of the documents
  5. set merge policy to LogDocMergePolicy
  6. forceMerge into 1 segment again <- This step is specifically captured in the benchmark and I have verified in logs that initializeFromGraph is called exactly once in this step

From the benchmark results, we are calculating significantly fewer scores using the lazy eval enhancement. However, the indexMergeTime did not decrease as expected.
benchmark

@Jackyrie2
Copy link
Contributor Author

@benwtrent @zhaih please take a look when you get a chance! I couldn't figure out how to update the existing PR, so I created a new one.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for running the benchmark.

I have two optimizations you should apply. I suggest the following to re-measure:

  • Use realistic vector sizes. You should be testing with at least 768 Float32 dimensions. If you are not, please make sure that you are. This way the distance calculation has the typical expense.
  • If the performance still isn't improved (with my optimizations & on 768 dim vectors), I suggest utilizing JFR to see where the time is spent. It could be that creating and deleting objects/adding removing from the hashmap washes out any real improvement :/


public NeighborArray(int maxSize, boolean descOrder) {
node = new int[maxSize];
score = new float[maxSize];
this.scoresDescOrder = descOrder;
scoringContext = new HashMap<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-allocate to maxSize. We should avoid making the hashmap grow as we add things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need hashmap actually, I believe an array with length of maxSize is more than enough, as we're at most mapping idx to ScoringFunction right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhaih yes! you are correct. Just have an array of maxSize works and null means there is no scoring function.

Comment on lines 132 to 136
// Check if we need to compute score
if (scoringContext.containsKey(sortedNodeSize)) {
score[sortedNodeSize] = scoringContext.get(sortedNodeSize).calculateScore();
scoringContext.remove(sortedNodeSize);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should simplify and only call the map once.

Suggested change
// Check if we need to compute score
if (scoringContext.containsKey(sortedNodeSize)) {
score[sortedNodeSize] = scoringContext.get(sortedNodeSize).calculateScore();
scoringContext.remove(sortedNodeSize);
}
// Check if we need to compute score
ScoringFunction maybeScoringFunction = scoringContext.remove(sortedNodeSize);
if (maybeScoringFunction != null) {
score[sortedNodeSize] = maybeScoringFunction.calculateScore();
}

Copy link
Contributor

@zhaih zhaih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Jack, thanks for doing the benchmark and I think the count of scoring computation does indicate the optimization is worth trying.

To further squeeze the performance and show the benefit, I think we need to optimize the data structure so that it doesn't create too much overhead during the graph build.
I have several suggestions below, and let me know what you think!

Also +1 to @benwtrent 's suggestion about using 768 vectors to further testing.
Thanks again for the hardwork!


public NeighborArray(int maxSize, boolean descOrder) {
node = new int[maxSize];
score = new float[maxSize];
this.scoresDescOrder = descOrder;
scoringContext = new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need hashmap actually, I believe an array with length of maxSize is more than enough, as we're at most mapping idx to ScoringFunction right?

@@ -111,6 +129,12 @@ public int[] sort() {
private int insertSortedInternal() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To further optimize the memory usage and eliminate potential GC overhead, I would suggest not storing ScoringFunction at all.
In theory, to compute a score, we need 3 things: SimilarityFunction, node_emb_1, node_emb_2. Where node embeddings can be get by calling vectors.vectorValue(nodeId) from HnswGraphBuilder, and also SimilarityFunction is held by HnswGraphBuilder.
That means, we don't need to even store the similarityFunction, and two bytes array beforehand, we can let the HnswGraphBuilder pass in a BiFunction<Integer, Integer, Float> to perform scoring execution when we need it and inside NeighborArray we just need to give two nodes' id to that function. That way, we don't need to store extra context inside NeighborArray and also avoid holding a lot of byte[] for too long. (If you think about it, we're holding all byte/float arrays necessary for score computation until we computed it or the graph is constructed, that's a huge GC load if your computer doesn't have enough memory)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea.

@zhaih instead of adding a function reference to the NeighborArray#ctor, why not pass the function to the NeighborArray#sort() method?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This obviates the need for any additional structures, if score==-1 calculate via the method provided to sort(). Though it probably won't be BiFunction<Integer, Integer, Float> but instead a Function<Integer, Float> as only the caller will know which is the "current" node for the neighbor array. Or even better, a custom functional interface that works on native values int -> float this way we don't box the integer and scoring results unnecessarily.

Copy link
Contributor

@zhaih zhaih Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not pass the function to the NeighborArray#sort() method?

Yeah +1

This obviates the need for any additional structures, if score==-1 calculate via the method provided to sort().

I have one concern for this, that -1 might be a valid score itself, but we can always have an additional integer, like noScoreIndex or so, to indicate upto which index we haven't calculated the score. (Because those nodes will always be sit in the front of the score array)

Though it probably won't be BiFunction<Integer, Integer, Float> but instead a Function<Integer, Float> as only the caller will know which is the "current" node for the neighbor array. Or even better, a custom functional interface that works on native values int -> float

+1, thanks for a more detailed thinking!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Score "NaN" is probably best to cover our bases :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, I forgot Java has NaN, yeah then I guess NaN will be the best.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a wonderful thread, I am learning a lot! Let me take yall's suggestions and see how much more performance we can gain. Again, really appreciate your help!

@zhaih zhaih linked an issue Aug 1, 2023 that may be closed by this pull request
@Jackyrie2
Copy link
Contributor Author

Here is a quick re-run of benchmark(100 dim vectors) on the optimized code with a 90% - 10% split on documents addition:

Baseline -> old candidate -> optimized candidate:
253,234,833 -> 265,153,500 -> 246,373,375
3,845,039,583 -> 4,338,130,875 -> 3,674,674,583
10,131,094,959 -> 9,869,441,083 -> 9,569,986,875

I will run the benchmark on 768 dim vectors later.

@Jackyrie2
Copy link
Contributor Author

Benchmark with 768-dimension vectors

Screenshot 2023-08-07 at 10 15 01 AM

@zhaih
Copy link
Contributor

zhaih commented Aug 7, 2023

Thank you @Jackyrie2 for a lot of benchmarking! Since we have already made it lightweight enough (requires no extra memory usage on NeighborArray) and the benchmark has shown mostly positive results(Benchmark does have quite a lot of noises, especially if you're running it with your local machine, if you want , maybe write some simple bash loop and run the evaluation 3 - 5 times and use the average), I would say this PR is a good optimization.

But it does need some more documentation I think as the NeighborArray is already complex enough, so I will try to carefully review it again recently.

Thank you!

Copy link
Contributor

@zhaih zhaih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some small comments, overall it looks good to me!

float[] vectorValue = null;
byte[] binaryValue = null;
switch (this.vectorEncoding) {
case FLOAT32 -> vectorValue = (float[]) vectors.vectorValue(nodeOrd);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's take this part outside of lambda to reduce number of times we call vectorValue, this operation involves some seek and parse operation on off-heap memory.

case BYTE -> this.similarityFunction.compare(
binaryValue, (byte[]) vectorsCopy.vectorValue(newNeighbor));
};
// float score =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove those commented code?


if (Float.isNaN(tmpScore)){
tmpScore = scoringFunction.computeScore(tmpNode);
System.out.println("Node: " + tmpNode + " Score: " + tmpScore);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this?

@zhaih
Copy link
Contributor

zhaih commented Aug 22, 2023

Ah sorry I almost forget:
don't forget to create a CHANGES.txt entry :)

Copy link
Contributor

@zhaih zhaih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I don't think there's a need to make it Lucene 10 only? (we're not breaking any APIs)

@@ -90,6 +90,8 @@ Optimizations

* GITHUB#12408: Lazy initialization improvements for Facets implementations when there are segments with no hits
to count. (Greg Miller)

* GITHUB##12371: enable lazy computation of similarity score during initializeFromGraph (Jack Wang)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put it under Lucene 9.8?

vectorValue, (float[]) vectorsCopy.vectorValue(newNeighbor));
case BYTE -> this.similarityFunction.compare(
binaryValue, (byte[]) vectorsCopy.vectorValue(newNeighbor));
};
// we are not sure whether the previous graph contains
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update this comment like
We will compute those scores lazily when we need to pop out the non-diverse nodes?

@zhaih
Copy link
Contributor

zhaih commented Sep 1, 2023

@Jackyrie2 Seems the precommit fails?

@zhaih zhaih merged commit 9fd45e3 into apache:main Sep 1, 2023
4 checks passed
@zhaih
Copy link
Contributor

zhaih commented Sep 1, 2023

Merged and backported

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Lazily compute similarity score when reuse the old HNSW graph
3 participants