Introduced the Word2VecSynonymFilter #12169

dantuzi · 2023-02-24T08:35:39Z

If you want to expand your query/documents with synonyms in Apache Lucene, you need a predefined file containing the list of terms that share the same semantics.
It's not always easy to find a list of basic synonyms for a language and, even if you find it, this doesn’t necessarily match your contextual domain.
The term "daemon" in the domain of operating system articles is not a synonym of "devil" but it's closer to the term "process".

Word2Vec is a two-layer neural network that takes as input a text and outputs a vector representation for each word in the dictionary.
Two words with similar meanings are identified with two vectors close to each other.

This contribution integrates this technique with the text analysis pipeline. It automatically generates synonyms on the fly from a Word2Vec model generated using the library DL4J.
Please see our presentation at the Berlin Buzzwords conference: https://pretalx.com/bbuzz22/talk/UYZAUX/

For the reviewer:

Almost all contribution consists of new code

Issue

Word2Vec model to generate synonyms on the fly

rmuir

The boostattribute must not be used in this way. Its documentation explicitly says in bold not to do this.

It also makes me ask the question, what happens when this filter is run at index-time?

rmuir · 2023-02-24T11:10:52Z

lucene/test-framework/src/java/org/apache/lucene/tests/analysis/BaseTokenStreamTestCase.java

@@ -221,6 +223,12 @@ public static void assertTokenStreamContents(
      flagsAtt = ts.getAttribute(FlagsAttribute.class);
    }

+    BoostAttribute boostAtt = null;


BoostAttribute is not for use in tokenstreams. it is for multitermquery.

Hi @rmuir, thank you for notice it.
Actually the BoostAttribute is already used in the DelimitedBoostTokenFilter:

lucene/lucene/analysis/common/src/java/org/apache/lucene/analysis/boost/DelimitedBoostTokenFilter.java

Line 39 in 475fbd0

private final BoostAttribute boostAtt = addAttribute(BoostAttribute.class);

How about if we do some refactoring (and update the documentation) to use the BoostAttribute in both cases?

This is wrong, wrong wrong. Remove any use of boostattribute outside of MultiTermQuery. It documents not to do this.

I'm going to update my PR removing the BoostAttribute as you suggest

tteofili · 2023-02-27T11:31:45Z

will have a look in the next few days 🙇 @alessandrobenedetti

dantuzi · 2023-03-01T10:17:23Z

@rmuir we did some tests at both query and index time.
We tried to index some documents using the following CustomAnalyzer which includes our Word2VecSynonymFilter and we verified the content of the index using Luke.

Analyzer analyzer = CustomAnalyzer.builder()
      .withTokenizer(ClassicTokenizerFactory.NAME)
      .addTokenFilter(Word2VecSynonymFilterFactory.NAME, "model", "wiki-ita-w2v-model.zip")
      .addTokenFilter(FlattenGraphFilterFactory.NAME)
      .addTokenFilter(LowerCaseFilterFactory.NAME)
      .build();

Please find some screenshots in our presentation from slide 32
https://www.slideshare.net/SeaseLtd/word2vec-model-to-generate-synonyms-on-the-fly-in-apache-lucenepdf

rmuir · 2023-03-01T11:13:24Z

I think you misunderstand the question. What happens to BoostAttribute at index-time? absolutely nothing.

rmuir · 2023-03-01T11:18:12Z

From what I can tell, this probably shouldnt be an analyzer at all. Seems it only works at query-time and will simply do the wrong thing at index-time. The attempted boost manipulation by an analyzer is what gives you the hint that the design is wrong.

I think what it is doing is more like query expansion, where it formulates query directly. Then there is no need to mess around with any boostattribute or anything like that.

alessandrobenedetti · 2023-03-10T13:03:09Z

I'll leave the BoostAttribute discussion for another time as I don't have access to the code right now, but javaDocs or not, it seems extremely suspicious to have a public class that shouldn't be used, it smells bad practice, quite frankly.

Regarding the "should this be a token filter or something else", I don't see any particular difference with the regular Synonym token filter ( it's just a difference way of providing synonyms from a machine learned model rather than the explicit file).
The boost side of things (which should be optional) is designed for the query time usage only (if I remember correctly we don't have any kind of indexing time boost anymore nor there was an index time per term boost at all).
Not much difference from the weighted synonyms approach.

dantuzi · 2023-03-14T10:29:07Z

@rmuir we spent some time analyzing the code after your comments.

Index Time

At index time the BoostAttribute is ignored as you said but, it looks like we "waste" 4 bytes to store a float and an assignment for each term

Query Time

At query time, everything works as expected because in this PR, @alessandrobenedetti, @dsmiley, and @romseygeek agreed to use BoostAttribute as a container for a float boost to be used at query time.

That being said, independently of the Javadoc (that doesn't really explain why a public class can not be used), is there any concrete issue with using the BoostAttribute?

mkhludnev · 2023-03-14T10:51:11Z

For matter of fact, PositionLengthAttribute is also ignored by indexer.

rmuir · 2023-03-14T11:42:51Z

I'll leave the BoostAttribute discussion for another time as I don't have access to the code right now, but javaDocs or not, it seems extremely suspicious to have a public class that shouldn't be used, it smells bad practice, quite frankly.

The problem is how attributes work internally. that's why fuzzy's internal boostattribute is public. It is really important that this only be used by fuzzyquery, it is only for multitermquery. That's why it is documented as such.

rmuir · 2023-03-14T11:46:29Z

Here is what the documentation says:

/**
 * Add this {@link Attribute} to a {@link TermsEnum} returned by {@link
 * MultiTermQuery#getTermsEnum(Terms,AttributeSource)} and update the boost on each returned term.
 * This enables to control the boost factor for each matching term in {@link
 * MultiTermQuery#SCORING_BOOLEAN_REWRITE} or {@link TopTermsRewrite} mode. {@link FuzzyQuery} is
 * using this to take the edit distance into account.
 *
 * <p><b>Please note:</b> This attribute is intended to be added only by the TermsEnum to itself in
 * its constructor and consumed by the {@link MultiTermQuery.RewriteMethod}.
 *
 * @lucene.internal
 */

I am not going to leave the discussion for another time. This needs to be fixed.

dsmiley · 2023-03-14T16:28:11Z

@rmuir Why is it important that BoostAttribute's usage be constrained to only MultiTermQuery?

alessandrobenedetti · 2023-03-14T17:03:09Z

I'll leave the BoostAttribute discussion for another time as I don't have access to the code right now, but javaDocs or not, it seems extremely suspicious to have a public class that shouldn't be used, it smells bad practice, quite frankly.

The problem is how attributes work internally. that's why fuzzy's internal boostattribute is public. It is really important that this only be used by fuzzyquery, it is only for multitermquery. That's why it is documented as such.

I am not going to leave the discussion for another time. This needs to be fixed.

Hi @rmuir, first of all, I deeply appreciate the time you are taking to help us on this issue, thank you for that.
When I said "'ll leave the BoostAttribute discussion for another time" I meant today (I was on holiday last week and I had the chance to look at the code only today with @dantuzi and @ilariapet .
I appreciate the javaDoc comments, but as also @dsmiley mentioned, I would like to understand practical reasons to avoid the usage of the BoostAttribute in other areas.
It seems to me just a simple wrapper of a float value.
Having a JavaDoc comment to restrict the usage of a public class is not ideal, but if you can elaborate we are more than happy to improve the contribution.

alessandrobenedetti · 2023-03-21T09:55:09Z

Just to let you know that if we don't get any concrete motivation on why the BoostAttribute should not be used, I'll proceed with an in-depth review and merge in the next few days.
Any additional review and comment is hugely welcome!

alessandrobenedetti · 2023-03-21T10:49:52Z

...ne/analysis/common/src/java/org/apache/lucene/analysis/synonym/word2vec/Dl4jModelReader.java

+ *
+ * @lucene.experimental
+ */
+public class Dl4jModelReader implements Word2VecModelReader {


For the sake of context, I worked with @dantuzi on deciding if implementing a custom reader(this one) or using dl4j as an imported library.
Multiple attempts were done to include dl4j as a library in Lucene, but the effort and impact was not worth it so we reverted to a simple custom reader class.

There are downsides for this of course, ma it's much more lightweight

+1 to avoid dependencies whenever practical

alessandrobenedetti · 2023-03-21T10:58:12Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/word2vec/Word2VecModel.java

+  }
+
+  @Override
+  public RandomAccessVectorValues<float[]> copy() throws IOException {


if this copy is meant to do a deep copy, I suspect we'll need to handle it differently, I am not sure it's copying the internal elements, but reusing them?
So a copy could end up adding elements to data structures used by the original Object?

It does not need to do a deep copy; the purpose of this method is to enable multiple concurrent accesses to the underlying data. Since this implementation doesn't have any temporary variable into which vectors are decoded (which could be overwritten), I think it's safe to simply return this.

@msokolov I tried to implement your suggestion but it looks like the method HnswGraphBuilder::build doesn't want the same reference passed to the HnswGraphBuilder.create. [1]
To be honest I still don't understand why this check [2] is required

[1]

Vectors to build must be independent of the source of vectors provided to HnswGraphBuilder() java.lang.IllegalArgumentException: Vectors to build must be independent of the source of vectors provided to HnswGraphBuilder() at __randomizedtesting.SeedInfo.seed([994075DD4398F0A4:E100BB05917EA0E6]:0) at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.util.hnsw.HnswGraphBuilder.build(HnswGraphBuilder.java:165) at org.apache.lucene.analysis.synonym.word2vec.Word2VecSynonymProvider.<init>(Word2VecSynonymProvider.java:64) at org.apache.lucene.analysis.synonym.word2vec.TestWord2VecSynonymProvider.<init>(TestWord2VecSynonymProvider.java:39)

[2]

lucene/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java

Lines 155 to 158 in 776149f

if (vectorsToAdd == this.vectors) {

throw new IllegalArgumentException(

"Vectors to build must be independent of the source of vectors provided to HnswGraphBuilder()");

}

alessandrobenedetti

The contribution is clean, simple to read, and very well-tested.
I left some minor considerations, that shoud be addressed.

I'll wait for some additional review and potential elaboration on the BoostAttribute and then merge.
This is just a first step for a new feature, so some future iterations are expected (for example I foresee the optional use of similarity as boost, the support for multi term synonyms, other models, ect

msokolov

Thanks for this nice contribution! I think it needs a bit of work, and it would be helpful to explain the benefit somewhere. ~~It's possible I missed some discussion about that sorry~~. Update: OK I see you have a summary in this PR :) I added some suggestions how to tighten up some implementations and a recommendation to eliminate some of the abstraction layers. I am traveling and might take a few days to respond -- please be patient, sorry!

...ne/analysis/common/src/java/org/apache/lucene/analysis/synonym/word2vec/Dl4jModelReader.java

msokolov · 2023-03-21T23:14:38Z

...ne/analysis/common/src/java/org/apache/lucene/analysis/synonym/word2vec/Dl4jModelReader.java

+ *
+ * @lucene.experimental
+ */
+public class Dl4jModelReader implements Word2VecModelReader {


+1 to avoid dependencies whenever practical

msokolov · 2023-03-21T23:16:55Z

...ne/analysis/common/src/java/org/apache/lucene/analysis/synonym/word2vec/Dl4jModelReader.java

+        int vectorDimension = Integer.parseInt(headerValues[1]);
+
+        Word2VecModel model = new Word2VecModel(dictionarySize, vectorDimension);
+        reader


to me this would read much clearer using a traditional while loop:

while ((line = reader.readLine()) != null) ...

msokolov · 2023-03-21T23:18:34Z

...ne/analysis/common/src/java/org/apache/lucene/analysis/synonym/word2vec/Dl4jModelReader.java

+
+  private static final String MODEL_FILE_NAME_PREFIX = "syn0";
+
+  private final String word2vecModelFilePath;


why pass in the path when it is only used in toString()? Can we choose between accepting a java.nio.Path that does its own open/close of the zip file and an (anonymous) InputStream?

Everything comes from the Word2VecSynonymFilterFactory that implements ResourceLoaderAware. This interface provides us a org.apache.lucene.util.ResourceLoader and the possibility to obtain an anonymous InputStream.
I decided to pass also the model file path to enrich the Exception message and make the user's life easier.
BTW I don't have a strong opinion about this. I can easily remove that string

Addressed comment by removing the string word2vecModelFilePath and changing the exception message

msokolov · 2023-03-21T23:21:04Z

...ne/analysis/common/src/java/org/apache/lucene/analysis/synonym/word2vec/Dl4jModelReader.java

+        return model;
+      }
+    }
+    throw new UnsupportedEncodingException(


I think this exception is really intended for use with character set encodings only. Maybe IllegalArgumentException would fit better?

When we use the library DL4J to train a model and we export it, we obtain a compressed zip file.
This zip contains multiple files but we are only interested in file syn0. The exception is thrown if the passed zip does not contain any syn0 file.
I guess IllegalArgumentException would fit

msokolov · 2023-03-21T23:36:15Z

...nalysis/common/src/java/org/apache/lucene/analysis/synonym/word2vec/Word2VecModelReader.java

+ *
+ * @lucene.experimental
+ */
+public interface Word2VecModelReader extends Closeable {


Do we have more than one implementation? I don't think this interface is necessary. Later we can always add it if we have multiple implementations and need to abstract. For now it's just extra stuff to maintain

msokolov · 2023-03-21T23:38:30Z

...lysis/common/src/java/org/apache/lucene/analysis/synonym/word2vec/Word2VecSynonymFilter.java

+      TermAndBoost synonym = synonymBuffer.pollFirst();
+      clearAttributes();
+      restoreState(this.lastState);
+      termAtt.setEmpty();


this would be cleaner if we used BytesTermAttribute

This is a good idea, I tried your suggestion and it is much cleaner but it doesn't pass the unit tests because the BytesTermAttribute contains a null ByteRef. I don't have much more time now to dedicate to this contribution. Feel free to change it once merged

msokolov · 2023-03-21T23:41:19Z

...ne/analysis/common/src/java/org/apache/lucene/analysis/synonym/word2vec/SynonymProvider.java

+   * SynonymProvider constructor
+   *
+   * @param term we want to find the synonyms
+   * @param maxSynonymsPerTerm maximum number of result returned by the synonym search


I don't see that we need this interface if its only use is in this one Dl4jWord2VecSynonymFilter. Can't we simply refer directly to the implementing class?

msokolov · 2023-03-21T23:43:29Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/word2vec/Word2VecModel.java

+  }
+
+  public BytesRef binaryValue(int targetOrd) throws IOException {
+    return data[targetOrd].getTerm();


This is incorrect - the purpose of this method is to return some bytes representing the vector value. I think instead you ought to simply throw UnsupportedOperationException since this implementation will never be used in an indexing context where this method is required.

As you can see, this method is not @Override so this is not an implementation of RandomAccessVectorValues interface. This is a custom method used in our implementation

msokolov · 2023-03-21T23:46:55Z

...lysis/common/src/java/org/apache/lucene/analysis/synonym/word2vec/Word2VecSynonymFilter.java

+import org.apache.lucene.util.TermAndBoost;
+
+/**
+ * Applies single-token synonyms from a Word2Vec trained network to an incoming {@link TokenStream}.


I don't see any reference to word2vec in here? We just use this abstract SynonymFilter that could conceivably be implemented using a thesaurus. That makes this more powerful than I think we are claiming it to be. Again, let's simply use the concrete classes throughout and call the classes Dl4jWord2VecXXX

As you suggested, I'm replacing the general SynonymFilter interface with the "concrete" corresponding class

rmuir · 2023-03-22T10:19:08Z

Hi @rmuir, first of all, I deeply appreciate the time you are taking to help us on this issue, thank you for that.
When I said "'ll leave the BoostAttribute discussion for another time" I meant today (I was on holiday last week and I had the chance to look at the code only today with @dantuzi and @ilariapet .
I appreciate the javaDoc comments, but as also @dsmiley mentioned, I would like to understand practical reasons to avoid the usage of the BoostAttribute in other areas.
It seems to me just a simple wrapper of a float value.
Having a JavaDoc comment to restrict the usage of a public class is not ideal, but if you can elaborate we are more than happy to improve the contribution.

The class is public due to the way attributes are implemented. AttributesImpl classes must be public, there's no way to make them package private! Does this make sense? It is in the search package for the fuzzy query to use for this reason, with lots of warnings in its documentation about the usage.

The attribute is needed to manage the priority queue of fuzzy query's rewrite method. Because of this, there are some odd things about its behavior (e.g. allowing negative boosts). It is really not suitable for usage anywhere else in the system.

please address my concern and don't ignore it. i'm -1 on this PR being merged.

alessandrobenedetti · 2023-03-22T10:26:33Z

Hi @rmuir, we definitely don't want to ignore improvement recommendations, rest assured.
Sorry if I am pedantic, I just want to understand why it shouldn't be used for additional things(on top of what it was originally designed to).
Do you have more info on the "odd things about its behavior"?
Allowing negative numbers shouldn't be an issue, the vector similarity score is 0<=x<=1.
Thank you again for your time, it's genuinely much appreciated

rmuir · 2023-03-22T10:33:46Z

Hi @rmuir, we definitely don't want to ignore improvement recommendations, rest assured.
Sorry if I am pedantic, I just want to understand why it shouldn't be used for additional things(on top of what it was originally designed to).
Do you have more info on the "odd things about its behavior"?
Allowing negative numbers shouldn't be an issue, the vector similarity score is 0<=x<=1.
Thank you again for your time, it's genuinely much appreciated

The class documents that it should not be used for this purpose. I honestly feel the documentation is good: the problem is you didn't read it. Please don't make me repeat myself here.

To be clear, i'm -1 on the issue and this constitutes a veto. it can't be ignored, just fix the issue instead of merging the PR.

rmuir · 2023-03-22T10:36:48Z

Allowing negative numbers shouldn't be an issue, the vector similarity score is 0<=x<=1.

You don't understand. Some day, someone may want to add such safety as a check to the attribute, but they won't be able to because fuzzy query relies on such things. That's a great big reason this attribute should not be used for other purposes!

It was created to be applied to TermsEnum and not a tokenstream at all. By using an attribute on TermsEnum, we are able to handle this corner-case needed to accelerate top-N selection of fuzzyquery, without adding unnecessary bloat to TermsEnum itself.

romseygeek · 2023-03-22T10:49:11Z

Would your worries be assuaged if we created a separate QueryBoostAttribute class @rmuir? Then QueryBuilder can use that, and we can add checks for negative boosts, javadocs that say it should only be used for query time analysis, etc.

rmuir · 2023-03-22T11:01:45Z

Would your worries be assuaged if we created a separate QueryBoostAttribute class @rmuir? Then QueryBuilder can use that, and we can add checks for negative boosts, javadocs that say it should only be used for query time analysis, etc.

It would certainly be a LOT less offensive than using FuzzyQuery's BoostAtt?

But I still don't think analysis pipeline should be involved in this way, in query formulation. It is confusing in many ways, especially with boosts that will get silently dropped at index-time, confusing users. So I'd really rather us avoid analysis chain doing this stuff. it is mixing separate concerns in a very bad way: analysis chain is supposed to be about tokenization and boosts are supposed to be in the domain of higher level query parsers/MLT as it represents query intent (and also makes no sense at index time). See my comment above about whether this should even be an analyzer :)

I really see it closer to "query expansion" (e.g. closer in functionality to MLT).

alessandrobenedetti · 2023-03-22T11:03:00Z

Ok, let me rephrase my question then:
Let's assume:

we don't care why that class was originally created and its JavaDocs comments
we don't intend to modify it

Quoting @dsmiley "Why is it important that BoostAttribute's usage be constrained to only MultiTermQuery?" (from the code is just containing a float payload).
What makes the BoostAttribute different from the org.apache.lucene.analysis.tokenattributes.PayloadAttribute for example? (aside from the different types of payload BytesRef vs float).
Let's just give a constructive explanation if you know, rather than citing again that JavaDoc :)

N.B. I am not a strenuous defender of using that class, to be clear, I want to do best in terms of readability and quality for the project, I just want my review to improve the work of the original contributors of this patch!

jimczi

That's a nice addition to Lucene, thanks @alessandrobenedetti .
I left one comment regarding the multi-thread aspect. I don't think it works currently and that could lead to weird behaviour/bugs. The OnHeapHnswGraph does not handle concurrency so it's not possible to use a single instance with multiple threads. It would require some changes in the APIs to make it possible unless I missed something?

dantuzi · 2023-04-24T15:07:55Z

@romseygeek, @alessandrobenedetti and me found the bug and we are fixing it right now

Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io>

dantuzi · 2023-04-24T15:48:10Z

@romseygeek We pushed the fix to the main branch and we cherry-pick it to the branch 9.x.

@jimczi Thank you for your comments and suggestions, for the time being, we don't have enough capacity to work again on this contribution. Feel free to publish a PR with the improvement you think are useful and we will be happy to review it.

jimczi · 2023-04-24T15:53:54Z

for the time being, we don't have enough capacity to work again on this contribution

It's not a suggestion it's a bug ;). As it is my suggestion would be to revert this change and to work on a fix. If I am right, this filter is not usable at all and would break in weird ways so please don't ignore it.

msokolov · 2023-04-24T17:24:17Z

Thanks for calling this out @jimczi. Yeah, that's true -- the javadoc for HnswGraph is incorrect. It says " The graph may be searched by multiple threads concurrently, but updates are not thread-safe." but it this class doesn't actually provide any searching methods, only some methods for iteration that modify internal state, which are clearly not thread-safe.

We might be able to clean up the API to make it clearer what's going on; multiple people seem to be getting confused by this. But I agree this change as it stands is not usable.

Adding a clone() / copy() method to HnswGraph seems like a fairly unobtrusive way to support this. Javadoc definitely needs to be clearer.

alessandrobenedetti · 2023-04-24T17:39:26Z

Hi @jimczi , @msokolov , I don't think it was @dantuzi intention to ignore you, I think he just misread your comment as a suggestion of improvement rather than a blocker.

I had Daniele and Ilaria extensively testing and benchmarking the contribution in the last few months, we left the PR open for various months and we thought it was in a good state to merge (being a new feature, not impacting existing works).

I really appreciate you spending the time to find this problem and we'll take a look in detail in 2 days (Wednesday 26th).
Given that, and not entering the technical merit until the 26th (I don't have the time right now to process your insights), is it worth reverting or we can simply fix it/commit a workaround as soon as we find it in the next few days?

jimczi · 2023-04-25T09:17:39Z

Adding a clone() / copy() method to HnswGraph seems like a fairly unobtrusive way to support this. Javadoc definitely needs to be clearer.

I looked at this and I am a bit reluctant to add this method since the OnHeapHnswGraph is mutable. That would be confusing and could lead to the same issue we have in this PR. I'd say that in this current form the HnswGraph is an internal class that shouldn't be used outside of the indexing primitives.

I am also curious to hear what's the expectation for this filter. I checked some resources created with word2vec or GloVe and they output a large vocabulary size. With a single thread, building a graph of 1M vectors with hundreds of dimensions would take several minutes (16 minutes at 1000 docs/s). That seems overkill for an application to rebuild this graph on every restart. Could we instead build the resource offline (the hnsw graph) and just load it when the filter starts?

is it worth reverting or we can simply fix it/commit a workaround as soon as we find it in the next few days?

I don't have a strong feeling here but I am not sure we should add this filter in the common analysis module. As it stands it feels more experimental so the sandbox might be more appropriate?

romseygeek · 2023-04-25T09:28:55Z

I'm planning on cutting a release branch for 9x tomorrow. Given the issues that Jim has identified, I think it's worth reverting at least from 9x?

This reverts commit 4deb000.

This reverts commit 1f4f2bf.

alessandrobenedetti · 2023-04-25T09:49:58Z

I agree @romseygeek and done that, we'll think about the other issues tomorrow on main!

zhaih · 2023-04-26T01:04:04Z

...sis/common/src/java/org/apache/lucene/analysis/synonym/word2vec/Word2VecSynonymProvider.java

+              SIMILARITY_FUNCTION,
+              hnswGraph,
+              null,
+              word2VecModel.size());


setting this limit will actually cause inaccurate result, since this limit is actually limit on "total number of nodes you visited on all levels" so when you reaches the bottom level you're not guaranteed to be able to visit every possible node. (see update of the limit here and reset of visited nodes for each level here)

not sure I get it, can you elaborate a bit?

I see, you are referring to line 88 and 'word2VecModel.size()' @dantuzi

This is a quick mitigation to avoid unexpected behaviour if multiple queries are executed concurrently. This commit wants to address some comments received in the PR apache#12169

Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io>

PR apache#12169 accidentally moved the `TermAndBoost` class to a different location, which would break custom sub-classes of `QueryBuilder`. This commit moves it back to its original location.

PR #12169 accidentally moved the `TermAndBoost` class to a different location, which would break custom sub-classes of `QueryBuilder`. This commit moves it back to its original location.

…dc8ca633e8bcf`) (#20) * Add next minor version 9.7.0 * Fix SynonymQuery equals implementation (apache#12260) The term member of TermAndBoost used to be a Term instance and became a BytesRef with apache#11941, which means its equals impl won't take the field name into account. The SynonymQuery equals impl needs to be updated accordingly to take the field into account as well, otherwise synonym queries with same term and boost across different fields are equal which is a bug. * Fix MMapDirectory documentation for Java 20 (apache#12265) * Don't generate stacktrace in CollectionTerminatedException (apache#12270) CollectionTerminatedException is always caught and never exposed to users so there's no point in filling in a stack-trace for it. * add missing changelog entry for apache#12260 * Add missing author to changelog entry for apache#12220 * Make query timeout members final in ExitableDirectoryReader (apache#12274) There's a couple of places in the Exitable wrapper classes where queryTimeout is set within the constructor and never modified. This commit makes such members final. * Update javadocs for QueryTimeout (apache#12272) QueryTimeout was introduced together with ExitableDirectoryReader but is now also optionally set to the IndexSearcher to wrap the bulk scorer with a TimeLimitingBulkScorer. Its javadocs needs updating. * Make TimeExceededException members final (apache#12271) TimeExceededException has three members that are set within its constructor and never modified. They can be made final. * DOAP changes for release 9.6.0 * Add back-compat indices for 9.6.0 * `ToParentBlockJoinQuery` Explain Support Score Mode (apache#12245) (apache#12283) * `ToParentBlockJoinQuery` Explain Support Score Mode --------- Co-authored-by: Marcus <marcuseagan@gmail.com> * Simplify SliceExecutor and QueueSizeBasedExecutor (apache#12285) The only behaviour that QueueSizeBasedExecutor overrides from SliceExecutor is when to execute on the caller thread. There is no need to override the whole invokeAll method for that. Instead, this commit introduces a shouldExecuteOnCallerThread method that can be overridden. * [Backport] GITHUB-11838 Add api to allow concurrent query rewrite (apache#12197) * GITHUB-11838 Change API to allow concurrent query rewrite (apache#11840) Replace Query#rewrite(IndexReader) with Query#rewrite(IndexSearcher) Co-authored-by: Patrick Zhai <zhaih@users.noreply.github.com> Co-authored-by: Adrien Grand <jpountz@gmail.com> Backport of apache#11840 Changes from original: - Query keeps `rewrite(IndexReader)`, but it is now deprecated - VirtualMethod is used to correct delegate to the overridden methods - The changes to `RewriteMethod` type classes are reverted, this increased the backwards compatibility impact. ------------------------------ ### Description Issue: apache#11838 #### Updated Proposal * Change signature of rewrite to `rewrite(IndexSearcher)` * How did I migrate the usage: * Use Intellij to do preliminary refactoring for me * For test usage, use searcher whenever is available, otherwise create one using `newSearcher(reader)` * For very few non-test classes which doesn't have IndexSearcher available but called rewrite, create a searcher using `new IndexSearcher(reader)`, tried my best to avoid creating it recurrently (Especially in `FieldQuery`) * For queries who have implemented the rewrite and uses some part of reader's functionality, use shortcut method when possible, otherwise pull out the reader from indexSearcher. * Backport: Concurrent rewrite for KnnVectorQuery (apache#12160) (apache#12288) * Concurrent rewrite for KnnVectorQuery (apache#12160) - Reduce overhead of non-concurrent search by preserving original execution - Improve readability by factoring into separate functions --------- Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com> * adjusting for backport --------- Co-authored-by: Kaival Parikh <46070017+kaivalnp@users.noreply.github.com> Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com> * toposort use iterator to avoid stackoverflow (apache#12286) Co-authored-by: tangdonghai <tangdonghai@meituan.com> # Conflicts: # lucene/CHANGES.txt * Fix test to compile with Java 11 after backport of apache#12286 * Update Javadoc for topoSortStates method after apache#12286 (apache#12292) * Optimize HNSW diversity calculation (apache#12235) * Minor cleanup and improvements to DaciukMihovAutomatonBuilder (apache#12305) * GITHUB-12291: Skip blank lines from stopwords list. (apache#12299) * Wrap Query rewrite backwards layer with AccessController (apache#12308) * Make sure APIJAR reproduces with different timezone (unfortunately java encodes the date using local timezone) (apache#12315) * Add multi-thread searchability to OnHeapHnswGraph (apache#12257) * Fix backport error * [MINOR] Update javadoc in Query class (apache#12233) - add a few missing full stops - update wording in the description of Query#equals method * [Backport] Integrate the Incubating Panama Vector API apache#12311 (apache#12327) Leverage accelerated vector hardware instructions in Vector Search. Lucene already has a mechanism that enables the use of non-final JDK APIs, currently used for the Previewing Pamana Foreign API. This change expands this mechanism to include the Incubating Pamana Vector API. When the jdk.incubator.vector module is present at run time the Panamaized version of the low-level primitives used by Vector Search is enabled. If not present, the default scalar version of these low-level primitives is used (as it was previously). Currently, we're only targeting support for JDK 20. A subsequent PR should evaluate JDK 21. --------- Co-authored-by: Uwe Schindler <uschindler@apache.org> Co-authored-by: Robert Muir <rmuir@apache.org> * Parallelize knn query rewrite across slices rather than segments (apache#12325) The concurrent query rewrite for knn vectory query introduced with apache#12160 requests one thread per segment to the executor. To align this with the IndexSearcher parallel behaviour, we should rather parallelize across slices. Also, we can reuse the same slice executor instance that the index searcher already holds, in that way we are using a QueueSizeBasedExecutor when a thread pool executor is provided. * Optimize ConjunctionDISI.createConjunction (apache#12328) This method is showing up as a little hot when profiling some queries. Almost all the time spent in this method is just burnt on ceremony around stream indirections that don't inline. Moving this to iterators, simplifying the check for same doc id and also saving one iteration (for the min cost) makes this method far cheaper and easier to read. * Update changes to be correct with ARM (it is called NEON there) * GH#12321: Marked DaciukMihovAutomatonBuilder as deprecated (apache#12332) Preparing to reduce visibility of this class in a future release * add BitSet.clear() (apache#12268) # Conflicts: # lucene/CHANGES.txt * Clenaup and update changes and synchronize with 9.x * Update TestVectorUtilProviders.java (apache#12338) * Don't generate stacktrace for TimeExceededException (apache#12335) The exception is package private and never rethrown, we can avoid generating a stacktrace for it. * Introduced the Word2VecSynonymFilter (apache#12169) Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io> * Word2VecSynonymFilter constructor null check (apache#12169) * Use thread-safe search version of HnswGraphSearcher (apache#12246) Addressing comment received in the PR apache#12246 * Word2VecSynonymProvider to use standard Integer max value for hnsw searches (apache#12235) We observed this change was not ported previously from main in an old cherry-pick * Fix searchafter high latency when after value is out of range for segment (apache#12334) * Make memory fence in `ByteBufferGuard` explicit (apache#12290) * Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit (apache#12320) * Add updateDocuments API which accept a query (reopen) (apache#12346) * GITHUB#11350: Handle backward compatibility when merging segments with different FieldInfo This commits restores Lucene 9's ability to handle indices created with Lucene 8 where there are discrepancies in FieldInfos, such as different IndexOptions * [Tessellator] Improve the checks that validate the diagonal between two polygon nodes (apache#12353) # Conflicts: # lucene/CHANGES.txt * feat: soft delete optimize (apache#12339) * Better paging when random reads go backwards (apache#12357) When reading data from outside the buffer, BufferedIndexInput always resets its buffer to start at the new read position. If we are reading backwards (for example, using an OffHeapFSTStore for a terms dictionary) then this can have the effect of re-reading the same data over and over again. This commit changes BufferedIndexInput to use paging when reading backwards, so that if we ask for a byte that is before the current buffer, we read a block of data of bufferSize that ends at the previous buffer start. Fixes apache#12356 * Work around SecurityManager issues during initialization of vector api (JDK-8309727) (apache#12362) * Restrict GraphTokenStreamFiniteStrings#articulationPointsRecurse recursion depth (apache#12249) * Implement MMapDirectory with Java 21 Project Panama Preview API (apache#12294) Backport incl JDK21 apijar file with java.util.Objects regenerated * remove relic in apijar folder caused by vector additions * Speed up IndexedDISI Sparse #AdvanceExactWithinBlock for tiny step advance (apache#12324) * Add checks in KNNVectorField / KNNVectorQuery to only allow non-null, non-empty and finite vectors (apache#12281) --------- Co-authored-by: Uwe Schindler <uschindler@apache.org> * Implement VectorUtilProvider with Java 21 Project Panama Vector API (apache#12363) (apache#12365) This commit enables the Panama Vector API for Java 21. The version of VectorUtilPanamaProvider for Java 21 is identical to that of Java 20. As such, there is no specific 21 version - the Java 20 version will be loaded from the MRJAR. * Add CHANGES.txt for apache#12334 Honor after value for skipping documents even if queue is not full for PagingFieldCollector (apache#12368) Signed-off-by: gashutos <gashutos@amazon.com> * Move TermAndBoost back to its original location. (apache#12366) PR apache#12169 accidentally moved the `TermAndBoost` class to a different location, which would break custom sub-classes of `QueryBuilder`. This commit moves it back to its original location. * GITHUB-12252: Add function queries for computing similarity scores between knn vectors (apache#12253) Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io> * hunspell (minor): reduce allocations when processing compound rules (apache#12316) (cherry picked from commit a454388) * hunspell (minor): reduce allocations when reading the dictionary's morphological data (apache#12323) there can be many entries with morph data, so we'd better avoid compiling and matching regexes and even stream allocation (cherry picked from commit 4bf1b94) * TestHunspell: reduce the flakiness probability (apache#12351) * TestHunspell: reduce the flakiness probability We need to check how the timeout interacts with custom exception-throwing checkCanceled. The default timeout seems not enough for some CI agents, so let's increase it. Co-authored-by: Dawid Weiss <dawid.weiss@gmail.com> (cherry picked from commit 5b63a18) * This allows VectorUtilProvider tests to be executed although hardware may not fully support vectorization or if C2 is not enabled (apache#12376) --------- Signed-off-by: gashutos <gashutos@amazon.com> Co-authored-by: Alan Woodward <romseygeek@apache.org> Co-authored-by: Luca Cavanna <javanna@apache.org> Co-authored-by: Uwe Schindler <uschindler@apache.org> Co-authored-by: Armin Braun <me@obrown.io> Co-authored-by: Mikhail Khludnev <mkhludnev@users.noreply.github.com> Co-authored-by: Marcus <marcuseagan@gmail.com> Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com> Co-authored-by: Kaival Parikh <46070017+kaivalnp@users.noreply.github.com> Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com> Co-authored-by: tang donghai <tangdhcs@gmail.com> Co-authored-by: Patrick Zhai <zhaih@users.noreply.github.com> Co-authored-by: Greg Miller <gsmiller@gmail.com> Co-authored-by: Jerry Chin <metrxqin@gmail.com> Co-authored-by: Patrick Zhai <zhai7631@gmail.com> Co-authored-by: Andrey Bozhko <andybozhko@gmail.com> Co-authored-by: Chris Hegarty <62058229+ChrisHegarty@users.noreply.github.com> Co-authored-by: Robert Muir <rmuir@apache.org> Co-authored-by: Jonathan Ellis <jbellis@datastax.com> Co-authored-by: Daniele Antuzi <daniele.antuzi@gmail.com> Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io> Co-authored-by: Chaitanya Gohel <104654647+gashutos@users.noreply.github.com> Co-authored-by: Petr Portnov | PROgrm_JARvis <pportnov@ozon.ru> Co-authored-by: Tomas Eduardo Fernandez Lobbe <tflobbe@apache.org> Co-authored-by: Ignacio Vera <ivera@apache.org> Co-authored-by: fudongying <30896830+fudongyingluck@users.noreply.github.com> Co-authored-by: Chris Fournier <chris.fournier@shopify.com> Co-authored-by: gf2121 <52390227+gf2121@users.noreply.github.com> Co-authored-by: Adrien Grand <jpountz@gmail.com> Co-authored-by: Elia Porciani <e.porciani@sease.io> Co-authored-by: Peter Gromov <peter@jetbrains.com>

Introduced the Word2VecSynonymFilter

2abb1ea

alessandrobenedetti requested review from alessandrobenedetti and tteofili February 24, 2023 08:36

dantuzi marked this pull request as draft February 24, 2023 08:38

alessandrobenedetti linked an issue Feb 24, 2023 that may be closed by this pull request

Word2Vec model to generate synonyms on the fly #12168

Closed

dantuzi marked this pull request as ready for review February 24, 2023 08:40

rmuir requested changes Feb 24, 2023

View reviewed changes

alessandrobenedetti reviewed Mar 21, 2023

View reviewed changes

alessandrobenedetti approved these changes Mar 21, 2023

View reviewed changes

msokolov reviewed Mar 21, 2023

View reviewed changes

jimczi reviewed Apr 24, 2023

View reviewed changes

dantuzi added a commit to SeaseLtd/lucene that referenced this pull request Apr 24, 2023

Introduced the Word2VecSynonymFilter - check null (apache#12169)

8250065

asfgit pushed a commit that referenced this pull request Apr 24, 2023

Word2VecSynonymFilter constructor null check (#12169)

4deb000

asfgit pushed a commit that referenced this pull request Apr 24, 2023

Introduced the Word2VecSynonymFilter (#12169)

53709cc

Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io>

asfgit pushed a commit that referenced this pull request Apr 24, 2023

Word2VecSynonymFilter constructor null check (#12169)

afb5ed6

asfgit pushed a commit that referenced this pull request Apr 25, 2023

Revert "Word2VecSynonymFilter constructor null check (#12169)"

19925d4

This reverts commit 4deb000.

asfgit pushed a commit that referenced this pull request Apr 25, 2023

Revert "Introduced the Word2VecSynonymFilter (#12169)"

7e15a39

This reverts commit 1f4f2bf.

zhaih reviewed Apr 26, 2023

View reviewed changes

zhaih mentioned this pull request Apr 26, 2023

Optimize HNSW diversity calculation #12235

Merged

dantuzi mentioned this pull request Apr 27, 2023

Set word2vec getSynonyms method thread-safe #12246

Merged

zhaih mentioned this pull request May 9, 2023

Add multi-thread searchability to OnHeapHnswGraph #12257

Merged

asfgit pushed a commit that referenced this pull request May 30, 2023

Introduced the Word2VecSynonymFilter (#12169)

64b48b8

Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io>

asfgit pushed a commit that referenced this pull request May 30, 2023

Word2VecSynonymFilter constructor null check (#12169)

34e5905

reta mentioned this pull request Jun 6, 2023

Update Apache Lucene to 9.7.0-snapshot-204acc3 opensearch-project/OpenSearch#7887

Merged

6 tasks

jpountz mentioned this pull request Jun 13, 2023

Move TermAndBoost back to its original location. #12366

Merged

dantuzi mentioned this pull request Jul 11, 2023

Introduce the similarity as boost functionality to the Word2VecSynonyFilter #12433

Draft

	if (vectorsToAdd == this.vectors) {
	throw new IllegalArgumentException(
	"Vectors to build must be independent of the source of vectors provided to HnswGraphBuilder()");
	}


		private static final String MODEL_FILE_NAME_PREFIX = "syn0";

		private final String word2vecModelFilePath;

Introduced the Word2VecSynonymFilter #12169

Introduced the Word2VecSynonymFilter #12169

Conversation

dantuzi commented Feb 24, 2023 • edited Loading

For the reviewer:

Issue

rmuir left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tteofili commented Feb 27, 2023

dantuzi commented Mar 1, 2023

rmuir commented Mar 1, 2023

rmuir commented Mar 1, 2023

alessandrobenedetti commented Mar 10, 2023

dantuzi commented Mar 14, 2023

Index Time

Query Time

mkhludnev commented Mar 14, 2023

rmuir commented Mar 14, 2023

rmuir commented Mar 14, 2023

dsmiley commented Mar 14, 2023

alessandrobenedetti commented Mar 14, 2023 • edited Loading

alessandrobenedetti commented Mar 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alessandrobenedetti left a comment

Choose a reason for hiding this comment

msokolov left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dantuzi Apr 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dantuzi Apr 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rmuir commented Mar 22, 2023

alessandrobenedetti commented Mar 22, 2023

rmuir commented Mar 22, 2023

rmuir commented Mar 22, 2023

romseygeek commented Mar 22, 2023

rmuir commented Mar 22, 2023

alessandrobenedetti commented Mar 22, 2023

jimczi left a comment

Choose a reason for hiding this comment

dantuzi commented Apr 24, 2023

dantuzi commented Apr 24, 2023

jimczi commented Apr 24, 2023

msokolov commented Apr 24, 2023

alessandrobenedetti commented Apr 24, 2023

jimczi commented Apr 25, 2023

romseygeek commented Apr 25, 2023

alessandrobenedetti commented Apr 25, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dantuzi commented Feb 24, 2023 •

edited

Loading

alessandrobenedetti commented Mar 14, 2023 •

edited

Loading

msokolov left a comment •

edited

Loading

dantuzi Apr 5, 2023 •

edited

Loading

dantuzi Apr 4, 2023 •

edited

Loading