Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduced the Word2VecSynonymFilter #12169

Merged
merged 6 commits into from
Apr 24, 2023

Conversation

dantuzi
Copy link
Contributor

@dantuzi dantuzi commented Feb 24, 2023

If you want to expand your query/documents with synonyms in Apache Lucene, you need a predefined file containing the list of terms that share the same semantics.
It's not always easy to find a list of basic synonyms for a language and, even if you find it, this doesn’t necessarily match your contextual domain.
The term "daemon" in the domain of operating system articles is not a synonym of "devil" but it's closer to the term "process".

Word2Vec is a two-layer neural network that takes as input a text and outputs a vector representation for each word in the dictionary.
Two words with similar meanings are identified with two vectors close to each other.

This contribution integrates this technique with the text analysis pipeline. It automatically generates synonyms on the fly from a Word2Vec model generated using the library DL4J.
Please see our presentation at the Berlin Buzzwords conference: https://pretalx.com/bbuzz22/talk/UYZAUX/

For the reviewer:

Almost all contribution consists of new code

Issue

Word2Vec model to generate synonyms on the fly

@dantuzi dantuzi marked this pull request as draft February 24, 2023 08:38
@alessandrobenedetti alessandrobenedetti linked an issue Feb 24, 2023 that may be closed by this pull request
@dantuzi dantuzi marked this pull request as ready for review February 24, 2023 08:40
Copy link
Member

@rmuir rmuir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The boostattribute must not be used in this way. Its documentation explicitly says in bold not to do this.

It also makes me ask the question, what happens when this filter is run at index-time?

@@ -221,6 +223,12 @@ public static void assertTokenStreamContents(
flagsAtt = ts.getAttribute(FlagsAttribute.class);
}

BoostAttribute boostAtt = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BoostAttribute is not for use in tokenstreams. it is for multitermquery.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rmuir, thank you for notice it.
Actually the BoostAttribute is already used in the DelimitedBoostTokenFilter:

private final BoostAttribute boostAtt = addAttribute(BoostAttribute.class);

How about if we do some refactoring (and update the documentation) to use the BoostAttribute in both cases?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong, wrong wrong. Remove any use of boostattribute outside of MultiTermQuery. It documents not to do this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to update my PR removing the BoostAttribute as you suggest

@tteofili
Copy link
Contributor

will have a look in the next few days 🙇 @alessandrobenedetti

@dantuzi
Copy link
Contributor Author

dantuzi commented Mar 1, 2023

@rmuir we did some tests at both query and index time.
We tried to index some documents using the following CustomAnalyzer which includes our Word2VecSynonymFilter and we verified the content of the index using Luke.

Analyzer analyzer = CustomAnalyzer.builder()
      .withTokenizer(ClassicTokenizerFactory.NAME)
      .addTokenFilter(Word2VecSynonymFilterFactory.NAME, "model", "wiki-ita-w2v-model.zip")
      .addTokenFilter(FlattenGraphFilterFactory.NAME)
      .addTokenFilter(LowerCaseFilterFactory.NAME)
      .build();

Please find some screenshots in our presentation from slide 32
https://www.slideshare.net/SeaseLtd/word2vec-model-to-generate-synonyms-on-the-fly-in-apache-lucenepdf

@rmuir
Copy link
Member

rmuir commented Mar 1, 2023

I think you misunderstand the question. What happens to BoostAttribute at index-time? absolutely nothing.

@rmuir
Copy link
Member

rmuir commented Mar 1, 2023

From what I can tell, this probably shouldnt be an analyzer at all. Seems it only works at query-time and will simply do the wrong thing at index-time. The attempted boost manipulation by an analyzer is what gives you the hint that the design is wrong.

I think what it is doing is more like query expansion, where it formulates query directly. Then there is no need to mess around with any boostattribute or anything like that.

@alessandrobenedetti
Copy link
Contributor

I'll leave the BoostAttribute discussion for another time as I don't have access to the code right now, but javaDocs or not, it seems extremely suspicious to have a public class that shouldn't be used, it smells bad practice, quite frankly.

Regarding the "should this be a token filter or something else", I don't see any particular difference with the regular Synonym token filter ( it's just a difference way of providing synonyms from a machine learned model rather than the explicit file).
The boost side of things (which should be optional) is designed for the query time usage only (if I remember correctly we don't have any kind of indexing time boost anymore nor there was an index time per term boost at all).
Not much difference from the weighted synonyms approach.

@dantuzi
Copy link
Contributor Author

dantuzi commented Mar 14, 2023

@rmuir we spent some time analyzing the code after your comments.

Index Time

At index time the BoostAttribute is ignored as you said but, it looks like we "waste" 4 bytes to store a float and an assignment for each term

Query Time

At query time, everything works as expected because in this PR, @alessandrobenedetti, @dsmiley, and @romseygeek agreed to use BoostAttribute as a container for a float boost to be used at query time.

That being said, independently of the Javadoc (that doesn't really explain why a public class can not be used), is there any concrete issue with using the BoostAttribute?

@mkhludnev
Copy link
Member

For matter of fact, PositionLengthAttribute is also ignored by indexer.

@rmuir
Copy link
Member

rmuir commented Mar 14, 2023

I'll leave the BoostAttribute discussion for another time as I don't have access to the code right now, but javaDocs or not, it seems extremely suspicious to have a public class that shouldn't be used, it smells bad practice, quite frankly.

The problem is how attributes work internally. that's why fuzzy's internal boostattribute is public. It is really important that this only be used by fuzzyquery, it is only for multitermquery. That's why it is documented as such.

@rmuir
Copy link
Member

rmuir commented Mar 14, 2023

Here is what the documentation says:

/**
 * Add this {@link Attribute} to a {@link TermsEnum} returned by {@link
 * MultiTermQuery#getTermsEnum(Terms,AttributeSource)} and update the boost on each returned term.
 * This enables to control the boost factor for each matching term in {@link
 * MultiTermQuery#SCORING_BOOLEAN_REWRITE} or {@link TopTermsRewrite} mode. {@link FuzzyQuery} is
 * using this to take the edit distance into account.
 *
 * <p><b>Please note:</b> This attribute is intended to be added only by the TermsEnum to itself in
 * its constructor and consumed by the {@link MultiTermQuery.RewriteMethod}.
 *
 * @lucene.internal
 */

I am not going to leave the discussion for another time. This needs to be fixed.

@dsmiley
Copy link
Contributor

dsmiley commented Mar 14, 2023

@rmuir Why is it important that BoostAttribute's usage be constrained to only MultiTermQuery?

@alessandrobenedetti
Copy link
Contributor

alessandrobenedetti commented Mar 14, 2023

I'll leave the BoostAttribute discussion for another time as I don't have access to the code right now, but javaDocs or not, it seems extremely suspicious to have a public class that shouldn't be used, it smells bad practice, quite frankly.

The problem is how attributes work internally. that's why fuzzy's internal boostattribute is public. It is really important that this only be used by fuzzyquery, it is only for multitermquery. That's why it is documented as such.

I am not going to leave the discussion for another time. This needs to be fixed.

Hi @rmuir, first of all, I deeply appreciate the time you are taking to help us on this issue, thank you for that.
When I said "'ll leave the BoostAttribute discussion for another time" I meant today (I was on holiday last week and I had the chance to look at the code only today with @dantuzi and @ilariapet .
I appreciate the javaDoc comments, but as also @dsmiley mentioned, I would like to understand practical reasons to avoid the usage of the BoostAttribute in other areas.
It seems to me just a simple wrapper of a float value.
Having a JavaDoc comment to restrict the usage of a public class is not ideal, but if you can elaborate we are more than happy to improve the contribution.

@alessandrobenedetti
Copy link
Contributor

Just to let you know that if we don't get any concrete motivation on why the BoostAttribute should not be used, I'll proceed with an in-depth review and merge in the next few days.
Any additional review and comment is hugely welcome!

*
* @lucene.experimental
*/
public class Dl4jModelReader implements Word2VecModelReader {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of context, I worked with @dantuzi on deciding if implementing a custom reader(this one) or using dl4j as an imported library.
Multiple attempts were done to include dl4j as a library in Lucene, but the effort and impact was not worth it so we reverted to a simple custom reader class.

There are downsides for this of course, ma it's much more lightweight

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to avoid dependencies whenever practical

}

@Override
public RandomAccessVectorValues<float[]> copy() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this copy is meant to do a deep copy, I suspect we'll need to handle it differently, I am not sure it's copying the internal elements, but reusing them?
So a copy could end up adding elements to data structures used by the original Object?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not need to do a deep copy; the purpose of this method is to enable multiple concurrent accesses to the underlying data. Since this implementation doesn't have any temporary variable into which vectors are decoded (which could be overwritten), I think it's safe to simply return this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msokolov I tried to implement your suggestion but it looks like the method HnswGraphBuilder::build doesn't want the same reference passed to the HnswGraphBuilder.create. [1]
To be honest I still don't understand why this check [2] is required

[1]

Vectors to build must be independent of the source of vectors provided to HnswGraphBuilder()
java.lang.IllegalArgumentException: Vectors to build must be independent of the source of vectors provided to HnswGraphBuilder()
	at __randomizedtesting.SeedInfo.seed([994075DD4398F0A4:E100BB05917EA0E6]:0)
	at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.util.hnsw.HnswGraphBuilder.build(HnswGraphBuilder.java:165)
	at org.apache.lucene.analysis.synonym.word2vec.Word2VecSynonymProvider.<init>(Word2VecSynonymProvider.java:64)
	at org.apache.lucene.analysis.synonym.word2vec.TestWord2VecSynonymProvider.<init>(TestWord2VecSynonymProvider.java:39)

[2]

if (vectorsToAdd == this.vectors) {
throw new IllegalArgumentException(
"Vectors to build must be independent of the source of vectors provided to HnswGraphBuilder()");
}

Copy link
Contributor

@alessandrobenedetti alessandrobenedetti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contribution is clean, simple to read, and very well-tested.
I left some minor considerations, that shoud be addressed.

I'll wait for some additional review and potential elaboration on the BoostAttribute and then merge.
This is just a first step for a new feature, so some future iterations are expected (for example I foresee the optional use of similarity as boost, the support for multi term synonyms, other models, ect

Copy link
Contributor

@msokolov msokolov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this nice contribution! I think it needs a bit of work, and it would be helpful to explain the benefit somewhere. It's possible I missed some discussion about that sorry. Update: OK I see you have a summary in this PR :) I added some suggestions how to tighten up some implementations and a recommendation to eliminate some of the abstraction layers. I am traveling and might take a few days to respond -- please be patient, sorry!

*
* @lucene.experimental
*/
public class Dl4jModelReader implements Word2VecModelReader {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to avoid dependencies whenever practical

int vectorDimension = Integer.parseInt(headerValues[1]);

Word2VecModel model = new Word2VecModel(dictionarySize, vectorDimension);
reader
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to me this would read much clearer using a traditional while loop:

while ((line = reader.readLine()) != null) ...


private static final String MODEL_FILE_NAME_PREFIX = "syn0";

private final String word2vecModelFilePath;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why pass in the path when it is only used in toString()? Can we choose between accepting a java.nio.Path that does its own open/close of the zip file and an (anonymous) InputStream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything comes from the Word2VecSynonymFilterFactory that implements ResourceLoaderAware. This interface provides us a org.apache.lucene.util.ResourceLoader and the possibility to obtain an anonymous InputStream.
I decided to pass also the model file path to enrich the Exception message and make the user's life easier.
BTW I don't have a strong opinion about this. I can easily remove that string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed comment by removing the string word2vecModelFilePath and changing the exception message

return model;
}
}
throw new UnsupportedEncodingException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this exception is really intended for use with character set encodings only. Maybe IllegalArgumentException would fit better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we use the library DL4J to train a model and we export it, we obtain a compressed zip file.
This zip contains multiple files but we are only interested in file syn0. The exception is thrown if the passed zip does not contain any syn0 file.
I guess IllegalArgumentException would fit

*
* @lucene.experimental
*/
public interface Word2VecModelReader extends Closeable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have more than one implementation? I don't think this interface is necessary. Later we can always add it if we have multiple implementations and need to abstract. For now it's just extra stuff to maintain

TermAndBoost synonym = synonymBuffer.pollFirst();
clearAttributes();
restoreState(this.lastState);
termAtt.setEmpty();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be cleaner if we used BytesTermAttribute

Copy link
Contributor Author

@dantuzi dantuzi Apr 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea, I tried your suggestion and it is much cleaner but it doesn't pass the unit tests because the BytesTermAttribute contains a null ByteRef. I don't have much more time now to dedicate to this contribution. Feel free to change it once merged

* SynonymProvider constructor
*
* @param term we want to find the synonyms
* @param maxSynonymsPerTerm maximum number of result returned by the synonym search
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see that we need this interface if its only use is in this one Dl4jWord2VecSynonymFilter. Can't we simply refer directly to the implementing class?

}

public BytesRef binaryValue(int targetOrd) throws IOException {
return data[targetOrd].getTerm();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect - the purpose of this method is to return some bytes representing the vector value. I think instead you ought to simply throw UnsupportedOperationException since this implementation will never be used in an indexing context where this method is required.

Copy link
Contributor Author

@dantuzi dantuzi Apr 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you can see, this method is not @Override so this is not an implementation of RandomAccessVectorValues interface. This is a custom method used in our implementation

import org.apache.lucene.util.TermAndBoost;

/**
* Applies single-token synonyms from a Word2Vec trained network to an incoming {@link TokenStream}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any reference to word2vec in here? We just use this abstract SynonymFilter that could conceivably be implemented using a thesaurus. That makes this more powerful than I think we are claiming it to be. Again, let's simply use the concrete classes throughout and call the classes Dl4jWord2VecXXX

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you suggested, I'm replacing the general SynonymFilter interface with the "concrete" corresponding class

@rmuir
Copy link
Member

rmuir commented Mar 22, 2023

Hi @rmuir, first of all, I deeply appreciate the time you are taking to help us on this issue, thank you for that.
When I said "'ll leave the BoostAttribute discussion for another time" I meant today (I was on holiday last week and I had the chance to look at the code only today with @dantuzi and @ilariapet .
I appreciate the javaDoc comments, but as also @dsmiley mentioned, I would like to understand practical reasons to avoid the usage of the BoostAttribute in other areas.
It seems to me just a simple wrapper of a float value.
Having a JavaDoc comment to restrict the usage of a public class is not ideal, but if you can elaborate we are more than happy to improve the contribution.

The class is public due to the way attributes are implemented. AttributesImpl classes must be public, there's no way to make them package private! Does this make sense? It is in the search package for the fuzzy query to use for this reason, with lots of warnings in its documentation about the usage.

The attribute is needed to manage the priority queue of fuzzy query's rewrite method. Because of this, there are some odd things about its behavior (e.g. allowing negative boosts). It is really not suitable for usage anywhere else in the system.

please address my concern and don't ignore it. i'm -1 on this PR being merged.

@alessandrobenedetti
Copy link
Contributor

Hi @rmuir, we definitely don't want to ignore improvement recommendations, rest assured.
Sorry if I am pedantic, I just want to understand why it shouldn't be used for additional things(on top of what it was originally designed to).
Do you have more info on the "odd things about its behavior"?
Allowing negative numbers shouldn't be an issue, the vector similarity score is 0<=x<=1.
Thank you again for your time, it's genuinely much appreciated

@rmuir
Copy link
Member

rmuir commented Mar 22, 2023

Hi @rmuir, we definitely don't want to ignore improvement recommendations, rest assured.
Sorry if I am pedantic, I just want to understand why it shouldn't be used for additional things(on top of what it was originally designed to).
Do you have more info on the "odd things about its behavior"?
Allowing negative numbers shouldn't be an issue, the vector similarity score is 0<=x<=1.
Thank you again for your time, it's genuinely much appreciated

The class documents that it should not be used for this purpose. I honestly feel the documentation is good: the problem is you didn't read it. Please don't make me repeat myself here.

To be clear, i'm -1 on the issue and this constitutes a veto. it can't be ignored, just fix the issue instead of merging the PR.

@rmuir
Copy link
Member

rmuir commented Mar 22, 2023

Allowing negative numbers shouldn't be an issue, the vector similarity score is 0<=x<=1.

You don't understand. Some day, someone may want to add such safety as a check to the attribute, but they won't be able to because fuzzy query relies on such things. That's a great big reason this attribute should not be used for other purposes!

It was created to be applied to TermsEnum and not a tokenstream at all. By using an attribute on TermsEnum, we are able to handle this corner-case needed to accelerate top-N selection of fuzzyquery, without adding unnecessary bloat to TermsEnum itself.

@romseygeek
Copy link
Contributor

Would your worries be assuaged if we created a separate QueryBoostAttribute class @rmuir? Then QueryBuilder can use that, and we can add checks for negative boosts, javadocs that say it should only be used for query time analysis, etc.

@rmuir
Copy link
Member

rmuir commented Mar 22, 2023

Would your worries be assuaged if we created a separate QueryBoostAttribute class @rmuir? Then QueryBuilder can use that, and we can add checks for negative boosts, javadocs that say it should only be used for query time analysis, etc.

It would certainly be a LOT less offensive than using FuzzyQuery's BoostAtt?

But I still don't think analysis pipeline should be involved in this way, in query formulation. It is confusing in many ways, especially with boosts that will get silently dropped at index-time, confusing users. So I'd really rather us avoid analysis chain doing this stuff. it is mixing separate concerns in a very bad way: analysis chain is supposed to be about tokenization and boosts are supposed to be in the domain of higher level query parsers/MLT as it represents query intent (and also makes no sense at index time). See my comment above about whether this should even be an analyzer :)

I really see it closer to "query expansion" (e.g. closer in functionality to MLT).

@alessandrobenedetti
Copy link
Contributor

Ok, let me rephrase my question then:
Let's assume:

  • we don't care why that class was originally created and its JavaDocs comments
  • we don't intend to modify it

Quoting @dsmiley "Why is it important that BoostAttribute's usage be constrained to only MultiTermQuery?" (from the code is just containing a float payload).
What makes the BoostAttribute different from the org.apache.lucene.analysis.tokenattributes.PayloadAttribute for example? (aside from the different types of payload BytesRef vs float).
Let's just give a constructive explanation if you know, rather than citing again that JavaDoc :)

N.B. I am not a strenuous defender of using that class, to be clear, I want to do best in terms of readability and quality for the project, I just want my review to improve the work of the original contributors of this patch!

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a nice addition to Lucene, thanks @alessandrobenedetti .
I left one comment regarding the multi-thread aspect. I don't think it works currently and that could lead to weird behaviour/bugs. The OnHeapHnswGraph does not handle concurrency so it's not possible to use a single instance with multiple threads. It would require some changes in the APIs to make it possible unless I missed something?

@dantuzi
Copy link
Contributor Author

dantuzi commented Apr 24, 2023

@romseygeek, @alessandrobenedetti and me found the bug and we are fixing it right now

dantuzi added a commit to SeaseLtd/lucene that referenced this pull request Apr 24, 2023
asfgit pushed a commit that referenced this pull request Apr 24, 2023
Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io>
@dantuzi
Copy link
Contributor Author

dantuzi commented Apr 24, 2023

@romseygeek We pushed the fix to the main branch and we cherry-pick it to the branch 9.x.

@jimczi Thank you for your comments and suggestions, for the time being, we don't have enough capacity to work again on this contribution. Feel free to publish a PR with the improvement you think are useful and we will be happy to review it.

@jimczi
Copy link
Contributor

jimczi commented Apr 24, 2023

for the time being, we don't have enough capacity to work again on this contribution

It's not a suggestion it's a bug ;). As it is my suggestion would be to revert this change and to work on a fix. If I am right, this filter is not usable at all and would break in weird ways so please don't ignore it.

@msokolov
Copy link
Contributor

Thanks for calling this out @jimczi. Yeah, that's true -- the javadoc for HnswGraph is incorrect. It says " The graph may be searched by multiple threads concurrently, but updates are not thread-safe." but it this class doesn't actually provide any searching methods, only some methods for iteration that modify internal state, which are clearly not thread-safe.

We might be able to clean up the API to make it clearer what's going on; multiple people seem to be getting confused by this. But I agree this change as it stands is not usable.

Adding a clone() / copy() method to HnswGraph seems like a fairly unobtrusive way to support this. Javadoc definitely needs to be clearer.

@alessandrobenedetti
Copy link
Contributor

Hi @jimczi , @msokolov , I don't think it was @dantuzi intention to ignore you, I think he just misread your comment as a suggestion of improvement rather than a blocker.

I had Daniele and Ilaria extensively testing and benchmarking the contribution in the last few months, we left the PR open for various months and we thought it was in a good state to merge (being a new feature, not impacting existing works).

I really appreciate you spending the time to find this problem and we'll take a look in detail in 2 days (Wednesday 26th).
Given that, and not entering the technical merit until the 26th (I don't have the time right now to process your insights), is it worth reverting or we can simply fix it/commit a workaround as soon as we find it in the next few days?

@jimczi
Copy link
Contributor

jimczi commented Apr 25, 2023

Adding a clone() / copy() method to HnswGraph seems like a fairly unobtrusive way to support this. Javadoc definitely needs to be clearer.

I looked at this and I am a bit reluctant to add this method since the OnHeapHnswGraph is mutable. That would be confusing and could lead to the same issue we have in this PR. I'd say that in this current form the HnswGraph is an internal class that shouldn't be used outside of the indexing primitives.

I am also curious to hear what's the expectation for this filter. I checked some resources created with word2vec or GloVe and they output a large vocabulary size. With a single thread, building a graph of 1M vectors with hundreds of dimensions would take several minutes (16 minutes at 1000 docs/s). That seems overkill for an application to rebuild this graph on every restart. Could we instead build the resource offline (the hnsw graph) and just load it when the filter starts?

is it worth reverting or we can simply fix it/commit a workaround as soon as we find it in the next few days?

I don't have a strong feeling here but I am not sure we should add this filter in the common analysis module. As it stands it feels more experimental so the sandbox might be more appropriate?

@romseygeek
Copy link
Contributor

I'm planning on cutting a release branch for 9x tomorrow. Given the issues that Jim has identified, I think it's worth reverting at least from 9x?

asfgit pushed a commit that referenced this pull request Apr 25, 2023
asfgit pushed a commit that referenced this pull request Apr 25, 2023
@alessandrobenedetti
Copy link
Contributor

I agree @romseygeek and done that, we'll think about the other issues tomorrow on main!

SIMILARITY_FUNCTION,
hnswGraph,
null,
word2VecModel.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setting this limit will actually cause inaccurate result, since this limit is actually limit on "total number of nodes you visited on all levels" so when you reaches the bottom level you're not guaranteed to be able to visit every possible node. (see update of the limit here and reset of visited nodes for each level here)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I get it, can you elaborate a bit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, you are referring to line 88 and 'word2VecModel.size()' @dantuzi

dantuzi added a commit to SeaseLtd/lucene that referenced this pull request Apr 27, 2023
This is a quick mitigation to avoid unexpected behaviour if multiple queries
are executed concurrently.

This commit wants to address some comments received in the PR apache#12169
dantuzi added a commit to SeaseLtd/lucene that referenced this pull request Apr 27, 2023
This is a quick mitigation to avoid unexpected behaviour if multiple queries
are executed concurrently.

This commit wants to address some comments received in the PR apache#12169
asfgit pushed a commit that referenced this pull request May 30, 2023
Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io>
jpountz added a commit to jpountz/lucene that referenced this pull request Jun 13, 2023
PR apache#12169 accidentally moved the `TermAndBoost` class to a different location,
which would break custom sub-classes of `QueryBuilder`. This commit moves it
back to its original location.
jpountz added a commit that referenced this pull request Jun 14, 2023
PR #12169 accidentally moved the `TermAndBoost` class to a different location,
which would break custom sub-classes of `QueryBuilder`. This commit moves it
back to its original location.
jpountz added a commit that referenced this pull request Jun 14, 2023
PR #12169 accidentally moved the `TermAndBoost` class to a different location,
which would break custom sub-classes of `QueryBuilder`. This commit moves it
back to its original location.
hiteshk25 pushed a commit to cowpaths/lucene that referenced this pull request Jul 18, 2023
…dc8ca633e8bcf`) (#20)

* Add next minor version 9.7.0

* Fix SynonymQuery equals implementation (apache#12260)

The term member of TermAndBoost used to be a Term instance and became a
BytesRef with apache#11941, which means its equals impl won't take the field
name into account. The SynonymQuery equals impl needs to be updated
accordingly to take the field into account as well, otherwise synonym
queries with same term and boost across different fields are equal which
is a bug.

* Fix MMapDirectory documentation for Java 20 (apache#12265)

* Don't generate stacktrace in CollectionTerminatedException (apache#12270)

CollectionTerminatedException is always caught and never exposed to users so there's no point in filling
in a stack-trace for it.

* add missing changelog entry for apache#12260

* Add missing author to changelog entry for apache#12220

* Make query timeout members final in ExitableDirectoryReader (apache#12274)

There's a couple of places in the Exitable wrapper classes where
queryTimeout is set within the constructor and never modified. This
commit makes such members final.

* Update javadocs for QueryTimeout (apache#12272)

QueryTimeout was introduced together with ExitableDirectoryReader but is
now also optionally set to the IndexSearcher to wrap the bulk scorer
with a TimeLimitingBulkScorer. Its javadocs needs updating.

* Make TimeExceededException members final (apache#12271)

TimeExceededException has three members that are set within its constructor and never modified. They can be made final.

* DOAP changes for release 9.6.0

* Add back-compat indices for 9.6.0

* `ToParentBlockJoinQuery` Explain Support Score Mode (apache#12245) (apache#12283)

* `ToParentBlockJoinQuery` Explain Support Score Mode

---------

Co-authored-by: Marcus <marcuseagan@gmail.com>

* Simplify SliceExecutor and QueueSizeBasedExecutor (apache#12285)

The only behaviour that QueueSizeBasedExecutor overrides from SliceExecutor is when to execute on the caller thread. There is no need to override the whole invokeAll method for that. Instead, this commit introduces a shouldExecuteOnCallerThread method that can be overridden.

* [Backport] GITHUB-11838 Add api to allow concurrent query rewrite (apache#12197)

* GITHUB-11838 Change API to allow concurrent query rewrite (apache#11840)

Replace Query#rewrite(IndexReader) with Query#rewrite(IndexSearcher)

Co-authored-by: Patrick Zhai <zhaih@users.noreply.github.com>
Co-authored-by: Adrien Grand <jpountz@gmail.com>

Backport of apache#11840

Changes from original:
 - Query keeps `rewrite(IndexReader)`, but it is now deprecated
 - VirtualMethod is used to correct delegate to the overridden methods
 - The changes to `RewriteMethod` type classes are reverted, this increased the backwards compatibility impact. 

------------------------------

### Description
Issue: apache#11838 

#### Updated Proposal
 * Change signature of rewrite to `rewrite(IndexSearcher)`
 * How did I migrate the usage:
   * Use Intellij to do preliminary refactoring for me
   * For test usage, use searcher whenever is available, otherwise create one using `newSearcher(reader)`
   * For very few non-test classes which doesn't have IndexSearcher available but called rewrite, create a searcher using `new IndexSearcher(reader)`, tried my best to avoid creating it recurrently (Especially in `FieldQuery`)
   * For queries who have implemented the rewrite and uses some part of reader's functionality, use shortcut method when possible, otherwise pull out the reader from indexSearcher.

* Backport: Concurrent rewrite for KnnVectorQuery (apache#12160) (apache#12288)

* Concurrent rewrite for KnnVectorQuery (apache#12160)


- Reduce overhead of non-concurrent search by preserving original execution
- Improve readability by factoring into separate functions

---------

Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com>

* adjusting for backport

---------

Co-authored-by: Kaival Parikh <46070017+kaivalnp@users.noreply.github.com>
Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com>

* toposort use iterator to avoid stackoverflow (apache#12286)

Co-authored-by: tangdonghai <tangdonghai@meituan.com>
# Conflicts:
#	lucene/CHANGES.txt

* Fix test to compile with Java 11 after backport of apache#12286

* Update Javadoc for topoSortStates method after apache#12286 (apache#12292)

* Optimize HNSW diversity calculation (apache#12235)

* Minor cleanup and improvements to DaciukMihovAutomatonBuilder (apache#12305)

* GITHUB-12291: Skip blank lines from stopwords list. (apache#12299)

* Wrap Query rewrite backwards layer with AccessController (apache#12308)

* Make sure APIJAR reproduces with different timezone (unfortunately java encodes the date using local timezone) (apache#12315)

* Add multi-thread searchability to OnHeapHnswGraph (apache#12257)

* Fix backport error

* [MINOR] Update javadoc in Query class (apache#12233)

- add a few missing full stops
- update wording in the description of Query#equals method

* [Backport] Integrate the Incubating Panama Vector API apache#12311 (apache#12327)

Leverage accelerated vector hardware instructions in Vector Search.

Lucene already has a mechanism that enables the use of non-final JDK APIs, currently used for the Previewing Pamana Foreign API. This change expands this mechanism to include the Incubating Pamana Vector API. When the jdk.incubator.vector module is present at run time the Panamaized version of the low-level primitives used by Vector Search is enabled. If not present, the default scalar version of these low-level primitives is used (as it was previously).

Currently, we're only targeting support for JDK 20. A subsequent PR should evaluate JDK 21.
---------

Co-authored-by: Uwe Schindler <uschindler@apache.org>
Co-authored-by: Robert Muir <rmuir@apache.org>

* Parallelize knn query rewrite across slices rather than segments (apache#12325)

The concurrent query rewrite for knn vectory query introduced with apache#12160
requests one thread per segment to the executor. To align this with the
IndexSearcher parallel behaviour, we should rather parallelize across
slices. Also, we can reuse the same slice executor instance that the
index searcher already holds, in that way we are using a
QueueSizeBasedExecutor when a thread pool executor is provided.

* Optimize ConjunctionDISI.createConjunction (apache#12328)

This method is showing up as a little hot when profiling some queries.
Almost all the time spent in this method is just burnt on ceremony
around stream indirections that don't inline.
Moving this to iterators, simplifying the check for same doc id and also saving one iteration (for the min
cost) makes this method far cheaper and easier to read.

* Update changes to be correct with ARM (it is called NEON there)

* GH#12321: Marked DaciukMihovAutomatonBuilder as deprecated (apache#12332)

Preparing to reduce visibility of this class in a future release

* add BitSet.clear() (apache#12268)

# Conflicts:
#	lucene/CHANGES.txt

* Clenaup and update changes and synchronize with 9.x

* Update TestVectorUtilProviders.java (apache#12338)

* Don't generate stacktrace for TimeExceededException (apache#12335)

The exception is package private and never rethrown, we can avoid
generating a stacktrace for it.

* Introduced the Word2VecSynonymFilter (apache#12169)

Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io>

* Word2VecSynonymFilter constructor null check (apache#12169)

* Use thread-safe search version of HnswGraphSearcher (apache#12246)

Addressing comment received in the PR apache#12246

* Word2VecSynonymProvider to use standard Integer max value for hnsw searches (apache#12235)
We observed this change was not ported previously from main in an old cherry-pick

* Fix searchafter high latency when after value is out of range for segment (apache#12334)

* Make memory fence in `ByteBufferGuard` explicit (apache#12290)

* Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit (apache#12320)

* Add updateDocuments API which accept a query (reopen) (apache#12346)

* GITHUB#11350: Handle backward compatibility when merging segments with different FieldInfo

This commits restores Lucene 9's ability to handle indices created with Lucene 8 where there are discrepancies in FieldInfos, such as different IndexOptions

* [Tessellator] Improve the checks that validate the diagonal between two polygon nodes (apache#12353)

# Conflicts:
#	lucene/CHANGES.txt

* feat: soft delete optimize (apache#12339)

* Better paging when random reads go backwards (apache#12357)

When reading data from outside the buffer, BufferedIndexInput always resets
its buffer to start at the new read position. If we are reading backwards (for example,
using an OffHeapFSTStore for a terms dictionary) then this can have the effect of
re-reading the same data over and over again.

This commit changes BufferedIndexInput to use paging when reading backwards,
so that if we ask for a byte that is before the current buffer, we read a block of data
of bufferSize that ends at the previous buffer start.

Fixes apache#12356

* Work around SecurityManager issues during initialization of vector api (JDK-8309727) (apache#12362)

* Restrict GraphTokenStreamFiniteStrings#articulationPointsRecurse recursion depth (apache#12249)

* Implement MMapDirectory with Java 21 Project Panama Preview API (apache#12294)
Backport incl JDK21 apijar file with java.util.Objects regenerated

* remove relic in apijar folder caused by vector additions

* Speed up IndexedDISI Sparse #AdvanceExactWithinBlock for tiny step advance (apache#12324)

* Add checks in KNNVectorField / KNNVectorQuery to only allow non-null, non-empty and finite vectors (apache#12281)


---------

Co-authored-by: Uwe Schindler <uschindler@apache.org>

* Implement VectorUtilProvider with Java 21 Project Panama Vector API (apache#12363) (apache#12365)

This commit enables the Panama Vector API for Java 21. The version of
VectorUtilPanamaProvider for Java 21 is identical to that of Java 20.
As such, there is no specific 21 version - the Java 20 version will be
loaded from the MRJAR.

* Add CHANGES.txt for apache#12334 Honor after value for skipping documents even if queue is not full for PagingFieldCollector (apache#12368)

Signed-off-by: gashutos <gashutos@amazon.com>

* Move TermAndBoost back to its original location. (apache#12366)

PR apache#12169 accidentally moved the `TermAndBoost` class to a different location,
which would break custom sub-classes of `QueryBuilder`. This commit moves it
back to its original location.

* GITHUB-12252: Add function queries for computing similarity scores between knn vectors (apache#12253)

Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io>

* hunspell (minor): reduce allocations when processing compound rules (apache#12316)

(cherry picked from commit a454388)

* hunspell (minor): reduce allocations when reading the dictionary's morphological data (apache#12323)

there can be many entries with morph data, so we'd better avoid compiling and matching regexes and even stream allocation

(cherry picked from commit 4bf1b94)

* TestHunspell: reduce the flakiness probability (apache#12351)

* TestHunspell: reduce the flakiness probability

We need to check how the timeout interacts with custom exception-throwing checkCanceled.
The default timeout seems not enough for some CI agents, so let's increase it.

Co-authored-by: Dawid Weiss <dawid.weiss@gmail.com>
(cherry picked from commit 5b63a18)

* This allows VectorUtilProvider tests to be executed although hardware may not fully support vectorization or if C2 is not enabled (apache#12376)

---------

Signed-off-by: gashutos <gashutos@amazon.com>
Co-authored-by: Alan Woodward <romseygeek@apache.org>
Co-authored-by: Luca Cavanna <javanna@apache.org>
Co-authored-by: Uwe Schindler <uschindler@apache.org>
Co-authored-by: Armin Braun <me@obrown.io>
Co-authored-by: Mikhail Khludnev <mkhludnev@users.noreply.github.com>
Co-authored-by: Marcus <marcuseagan@gmail.com>
Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com>
Co-authored-by: Kaival Parikh <46070017+kaivalnp@users.noreply.github.com>
Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com>
Co-authored-by: tang donghai <tangdhcs@gmail.com>
Co-authored-by: Patrick Zhai <zhaih@users.noreply.github.com>
Co-authored-by: Greg Miller <gsmiller@gmail.com>
Co-authored-by: Jerry Chin <metrxqin@gmail.com>
Co-authored-by: Patrick Zhai <zhai7631@gmail.com>
Co-authored-by: Andrey Bozhko <andybozhko@gmail.com>
Co-authored-by: Chris Hegarty <62058229+ChrisHegarty@users.noreply.github.com>
Co-authored-by: Robert Muir <rmuir@apache.org>
Co-authored-by: Jonathan Ellis <jbellis@datastax.com>
Co-authored-by: Daniele Antuzi <daniele.antuzi@gmail.com>
Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io>
Co-authored-by: Chaitanya Gohel <104654647+gashutos@users.noreply.github.com>
Co-authored-by: Petr Portnov | PROgrm_JARvis <pportnov@ozon.ru>
Co-authored-by: Tomas Eduardo Fernandez Lobbe <tflobbe@apache.org>
Co-authored-by: Ignacio Vera <ivera@apache.org>
Co-authored-by: fudongying <30896830+fudongyingluck@users.noreply.github.com>
Co-authored-by: Chris Fournier <chris.fournier@shopify.com>
Co-authored-by: gf2121 <52390227+gf2121@users.noreply.github.com>
Co-authored-by: Adrien Grand <jpountz@gmail.com>
Co-authored-by: Elia Porciani <e.porciani@sease.io>
Co-authored-by: Peter Gromov <peter@jetbrains.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Word2Vec model to generate synonyms on the fly
10 participants