Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GITHUB-11838 Add api to allow concurrent query rewrite #11840

Merged
merged 7 commits into from
Oct 19, 2022
Merged

Conversation

zhaih
Copy link
Contributor

@zhaih zhaih commented Oct 7, 2022

Description

Issue: #11838

Initial Proposal (discarded)

  • Added one public method to Query to allow passing in an executor, and default to just call single thread version
  • In IndexSearcher we always pass in the executor
  • I'm not sure whether we should add a parameter to control the behavior, I'm inclined to believe that users who passed in the executor are willing to make use of concurrent rewrite.
  • Rewritten ConstantScoreQuery so that the query executor will be passed into inner query.

Updated Proposal

  • Change signature of rewrite to rewrite(IndexSearcher)
  • How did I migrate the usage:
    • Use Intellij to do preliminary refactoring for me
    • For test usage, use searcher whenever is available, otherwise create one using newSearcher(reader)
    • For very few non-test classes which doesn't have IndexSearcher available but called rewrite, create a searcher using new IndexSearcher(reader), tried my best to avoid creating it recurrently (Especially in FieldQuery)
    • For queries who have implemented the rewrite and uses some part of reader's functionality, use shortcut method when possible, otherwise pull out the reader from indexSearcher.

@jpountz
Copy link
Contributor

jpountz commented Oct 11, 2022

We already have one class that wraps an IndexReader and an Executor: IndexSearcher. Should this new rewrite method take an IndexSearcher instead of an IndexReader and an Executor?

@jpountz
Copy link
Contributor

jpountz commented Oct 11, 2022

(and should it replace the existing rewrite method instead of just adding another one?)

@zhaih
Copy link
Contributor Author

zhaih commented Oct 12, 2022

Hi @jpountz thank for taking a look!

We already have one class that wraps an IndexReader and an Executor: IndexSearcher. Should this new rewrite method take an IndexSearcher instead of an IndexReader and an Executor?

I think it'd be a little weird to know that query rewriting requires an IndexSearcher from a user point of view? Also for people who're not using IndexSearcher provided by lucene (like we're in LinkedIn) it will be a bit inconvenience to create an IndexSearcher just for query rewriting?

(and should it replace the existing rewrite method instead of just adding another one?)

Yes I think replace is better, I previously wanted to make it into 9.x so don't want to change the API, are we able to make it in 9.x even if with the API change?

@jpountz
Copy link
Contributor

jpountz commented Oct 12, 2022

I don't think it'd be weird to require an IndexSearcher, Query#rewrite is essentially a way to improve caching and keep Query#createWeight simple. Given that Query#createWeight already creates an IndexSearcher, I'd assume that you already create an IndexSearcher even though you might not be using its search methods?

@zhaih
Copy link
Contributor Author

zhaih commented Oct 13, 2022

@jpountz You're right, our case is a bit complex since currently we're not even using Lucene's Query (but we're planning to in the future!) so I totally forgot the createWeight also takes an IndexSearcher.
Yeah now it makes sense to me to change that API to use IndexSearcher as well, I'll try to get one out these days. Thank you for the discussion!

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Patrick, I left some minor comments but the change looks good to me overall.

Change signature of FeatureFunction#rewrite and
MultiTermQuery.RewriteMethod#rewrite
@uschindler
Copy link
Contributor

I think the general idea looks well.
This should me Lucene 10 only, as the changes in query API may break lots of code downstream. We could theoretically add a backwards combatiility layer using VirtualMethod (says the policeman), if that's wanted, I can help. VirtualMethod utility class would be needed to figure out which of both methods (rewrite(IndexReader) or rewrite(IndexSearcher)) was overridden by a subclass downstream.

@jpountz
Copy link
Contributor

jpountz commented Oct 14, 2022

Agreed on making this Lucene 10.

As far as bw compat is concerned, I was thinking of updating Lucene 9 this way:

  • introduce Query#rewrite(IndexSearcher) and make it delegate to Query#rewrite(IndexReader)
  • keep existing implementations as they are, except possibly the few ones where we'd like to take advantage of the executor
  • deprecate Query#rewrite(IndexReader)
  • make IndexSearcher#rewrite(Query) use Query#rewrite(IndexSearcher)

@zhaih
Copy link
Contributor Author

zhaih commented Oct 15, 2022

Yeah I already made this one only on Lucene 10, I'll make a 9x PR according to Adrien's suggestion, it sounds not too complicated. (Thanks Uwe as well, good to know that we have a VirtualMethod class just for the back compatibility)

@zhaih
Copy link
Contributor Author

zhaih commented Oct 15, 2022

I tried created one but then realized it seems not working, because query rewriting are rewriting query from top of the tree to the bottom so once we delegate to rewrite(IndexReader) we're not able to call the indexSearcher one in the children.

I wonder if people are ok with this only happens after Lucene 10? Or @uschindler can you help evaluate using VirtualMethod? It seems to me it is a tool detecting whether there's difference in implementation, but still we are not able to avoid losing IndexSearcher once we've called a delegated method?

@uschindler
Copy link
Contributor

uschindler commented Oct 16, 2022

Hi @zhaih,
the problem you describe is exactly why VirtualMethod is there. It allows to figure out which of both methods was overridden by a subclass at runtime.
I can look into this next week. Actually it is long ago when we used VirtualMethod and there is currently no impementation actually using it. We used it mostly for Analysis when we did the big change from tokens to attributes and thats already 10 years ago. So I have to bring myself up to state, mainly because the whole thing is "not easy to understand" as there are various code paths that could be affected.
To say it short: changing method signatures of protected methods where you have no chance to enforce downstream classes to use new API are not easy. JDK avoids this, we have VirtualMethod :-)
We should maybe only merge this to main branch for now and then work separately on backport.

@uschindler
Copy link
Contributor

uschindler commented Oct 16, 2022

It seems to me it is a tool detecting whether there's difference in implementation, but still we are not able to avoid losing IndexSearcher once we've called a delegated method?

Yes, basically all classes that delegate rewrite to subqueries need to change to new API, so we still need to backport everything to 9.x and only detect cases where somebody uses the old api.

Of course in the case that somebody writes a subclass that only implements the IndexReader variant of rewrite it won't use the searcher. If this user subclass then rewrites subqueries on its own it will call "rewrite(IndexReader)" and so it is subqueries won't use parallelization. This is not so dramatic as a new query in user's code will just never pass parallelization down its subqueries.

Let me think a few evening about it :-)

@jpountz
Copy link
Contributor

jpountz commented Oct 18, 2022

Of course in the case that somebody writes a subclass that only implements the IndexReader variant of rewrite it won't use the searcher. If this user subclass then rewrites subqueries on its own it will call "rewrite(IndexReader)" and so it is subqueries won't use parallelization. This is not so dramatic as a new query in user's code will just never pass parallelization down its subqueries.

Using the same reasoning, I wonder if we'd solve 90% of the problem by propagating Query#rewrite(IndexSearcher) in BooleanQuery, ConstantScoreQuery and DisjunctionMaxQuery (the main queries that can be inner nodes in a query tree).

} else {
searcher = newSearcher(reader);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it work? It looks like the reader is not instantiated in the setup, but in some protected helper methods that run after the setup? Should we instead modify the parent class to always instantiate an IndexSearcher when instantiating a reader?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works.. with flaws, since if one of the unit test alters the reader and the reader become non-null in this way searcher stays null. (Altho I personally don't like the pattern where a shared member will be altered within test)

Should we instead modify the parent class to always instantiate an IndexSearcher when instantiating a reader?

Yeah that's better, I changed accordingly, thanks!

Copy link
Contributor

@uschindler uschindler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine. About the backport to 9.x I will help soon, at moment I am not well due to COVID after a conference last week.

@zhaih
Copy link
Contributor Author

zhaih commented Oct 19, 2022

About the backport to 9.x I will help soon, at moment I am not well due to COVID after a conference last week.

Thanks Uwe! No hurries and take care!

@zhaih zhaih merged commit 6cde41c into apache:main Oct 19, 2022
benwtrent pushed a commit to benwtrent/lucene that referenced this pull request Mar 9, 2023
Replace Query#rewrite(IndexReader) with Query#rewrite(IndexSearcher)
benwtrent added a commit that referenced this pull request May 12, 2023
…2197)

* GITHUB-11838 Change API to allow concurrent query rewrite (#11840)

Replace Query#rewrite(IndexReader) with Query#rewrite(IndexSearcher)

Co-authored-by: Patrick Zhai <zhaih@users.noreply.github.com>
Co-authored-by: Adrien Grand <jpountz@gmail.com>

Backport of #11840

Changes from original:
 - Query keeps `rewrite(IndexReader)`, but it is now deprecated
 - VirtualMethod is used to correct delegate to the overridden methods
 - The changes to `RewriteMethod` type classes are reverted, this increased the backwards compatibility impact. 

------------------------------

### Description
Issue: #11838 

#### Updated Proposal
 * Change signature of rewrite to `rewrite(IndexSearcher)`
 * How did I migrate the usage:
   * Use Intellij to do preliminary refactoring for me
   * For test usage, use searcher whenever is available, otherwise create one using `newSearcher(reader)`
   * For very few non-test classes which doesn't have IndexSearcher available but called rewrite, create a searcher using `new IndexSearcher(reader)`, tried my best to avoid creating it recurrently (Especially in `FieldQuery`)
   * For queries who have implemented the rewrite and uses some part of reader's functionality, use shortcut method when possible, otherwise pull out the reader from indexSearcher.
@javanna javanna added this to the 9.7.0 milestone May 26, 2023
javanna added a commit to elastic/elasticsearch that referenced this pull request May 31, 2023
Most relevant changes:

- add api to allow concurrent query rewrite (GITHUB-11838 Add api to allow concurrent query rewrite apache/lucene#11840)
- knn query rewrite (Concurrent rewrite for KnnVectorQuery apache/lucene#12160)
- Integrate the incubating Panama Vector API (Integrate the Incubating Panama Vector API  apache/lucene#12311)

As part of this commit I moved the ES codebase off of overriding or relying on the deprecated rewrite(IndexReader) method in favour of using rewrite(IndexSearcher) instead. For score functions, I went for not breaking existing plugins and create a new IndexSearcher whenever we rewrite a filter, otherwise we'd need to change the ScoreFunction#rewrite signature to take a searcher instead of a reader.

Co-authored-by: ChrisHegarty <christopher.hegarty@elastic.co>
hiteshk25 pushed a commit to cowpaths/lucene that referenced this pull request Jul 18, 2023
…dc8ca633e8bcf`) (#20)

* Add next minor version 9.7.0

* Fix SynonymQuery equals implementation (apache#12260)

The term member of TermAndBoost used to be a Term instance and became a
BytesRef with apache#11941, which means its equals impl won't take the field
name into account. The SynonymQuery equals impl needs to be updated
accordingly to take the field into account as well, otherwise synonym
queries with same term and boost across different fields are equal which
is a bug.

* Fix MMapDirectory documentation for Java 20 (apache#12265)

* Don't generate stacktrace in CollectionTerminatedException (apache#12270)

CollectionTerminatedException is always caught and never exposed to users so there's no point in filling
in a stack-trace for it.

* add missing changelog entry for apache#12260

* Add missing author to changelog entry for apache#12220

* Make query timeout members final in ExitableDirectoryReader (apache#12274)

There's a couple of places in the Exitable wrapper classes where
queryTimeout is set within the constructor and never modified. This
commit makes such members final.

* Update javadocs for QueryTimeout (apache#12272)

QueryTimeout was introduced together with ExitableDirectoryReader but is
now also optionally set to the IndexSearcher to wrap the bulk scorer
with a TimeLimitingBulkScorer. Its javadocs needs updating.

* Make TimeExceededException members final (apache#12271)

TimeExceededException has three members that are set within its constructor and never modified. They can be made final.

* DOAP changes for release 9.6.0

* Add back-compat indices for 9.6.0

* `ToParentBlockJoinQuery` Explain Support Score Mode (apache#12245) (apache#12283)

* `ToParentBlockJoinQuery` Explain Support Score Mode

---------

Co-authored-by: Marcus <marcuseagan@gmail.com>

* Simplify SliceExecutor and QueueSizeBasedExecutor (apache#12285)

The only behaviour that QueueSizeBasedExecutor overrides from SliceExecutor is when to execute on the caller thread. There is no need to override the whole invokeAll method for that. Instead, this commit introduces a shouldExecuteOnCallerThread method that can be overridden.

* [Backport] GITHUB-11838 Add api to allow concurrent query rewrite (apache#12197)

* GITHUB-11838 Change API to allow concurrent query rewrite (apache#11840)

Replace Query#rewrite(IndexReader) with Query#rewrite(IndexSearcher)

Co-authored-by: Patrick Zhai <zhaih@users.noreply.github.com>
Co-authored-by: Adrien Grand <jpountz@gmail.com>

Backport of apache#11840

Changes from original:
 - Query keeps `rewrite(IndexReader)`, but it is now deprecated
 - VirtualMethod is used to correct delegate to the overridden methods
 - The changes to `RewriteMethod` type classes are reverted, this increased the backwards compatibility impact. 

------------------------------

### Description
Issue: apache#11838 

#### Updated Proposal
 * Change signature of rewrite to `rewrite(IndexSearcher)`
 * How did I migrate the usage:
   * Use Intellij to do preliminary refactoring for me
   * For test usage, use searcher whenever is available, otherwise create one using `newSearcher(reader)`
   * For very few non-test classes which doesn't have IndexSearcher available but called rewrite, create a searcher using `new IndexSearcher(reader)`, tried my best to avoid creating it recurrently (Especially in `FieldQuery`)
   * For queries who have implemented the rewrite and uses some part of reader's functionality, use shortcut method when possible, otherwise pull out the reader from indexSearcher.

* Backport: Concurrent rewrite for KnnVectorQuery (apache#12160) (apache#12288)

* Concurrent rewrite for KnnVectorQuery (apache#12160)


- Reduce overhead of non-concurrent search by preserving original execution
- Improve readability by factoring into separate functions

---------

Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com>

* adjusting for backport

---------

Co-authored-by: Kaival Parikh <46070017+kaivalnp@users.noreply.github.com>
Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com>

* toposort use iterator to avoid stackoverflow (apache#12286)

Co-authored-by: tangdonghai <tangdonghai@meituan.com>
# Conflicts:
#	lucene/CHANGES.txt

* Fix test to compile with Java 11 after backport of apache#12286

* Update Javadoc for topoSortStates method after apache#12286 (apache#12292)

* Optimize HNSW diversity calculation (apache#12235)

* Minor cleanup and improvements to DaciukMihovAutomatonBuilder (apache#12305)

* GITHUB-12291: Skip blank lines from stopwords list. (apache#12299)

* Wrap Query rewrite backwards layer with AccessController (apache#12308)

* Make sure APIJAR reproduces with different timezone (unfortunately java encodes the date using local timezone) (apache#12315)

* Add multi-thread searchability to OnHeapHnswGraph (apache#12257)

* Fix backport error

* [MINOR] Update javadoc in Query class (apache#12233)

- add a few missing full stops
- update wording in the description of Query#equals method

* [Backport] Integrate the Incubating Panama Vector API apache#12311 (apache#12327)

Leverage accelerated vector hardware instructions in Vector Search.

Lucene already has a mechanism that enables the use of non-final JDK APIs, currently used for the Previewing Pamana Foreign API. This change expands this mechanism to include the Incubating Pamana Vector API. When the jdk.incubator.vector module is present at run time the Panamaized version of the low-level primitives used by Vector Search is enabled. If not present, the default scalar version of these low-level primitives is used (as it was previously).

Currently, we're only targeting support for JDK 20. A subsequent PR should evaluate JDK 21.
---------

Co-authored-by: Uwe Schindler <uschindler@apache.org>
Co-authored-by: Robert Muir <rmuir@apache.org>

* Parallelize knn query rewrite across slices rather than segments (apache#12325)

The concurrent query rewrite for knn vectory query introduced with apache#12160
requests one thread per segment to the executor. To align this with the
IndexSearcher parallel behaviour, we should rather parallelize across
slices. Also, we can reuse the same slice executor instance that the
index searcher already holds, in that way we are using a
QueueSizeBasedExecutor when a thread pool executor is provided.

* Optimize ConjunctionDISI.createConjunction (apache#12328)

This method is showing up as a little hot when profiling some queries.
Almost all the time spent in this method is just burnt on ceremony
around stream indirections that don't inline.
Moving this to iterators, simplifying the check for same doc id and also saving one iteration (for the min
cost) makes this method far cheaper and easier to read.

* Update changes to be correct with ARM (it is called NEON there)

* GH#12321: Marked DaciukMihovAutomatonBuilder as deprecated (apache#12332)

Preparing to reduce visibility of this class in a future release

* add BitSet.clear() (apache#12268)

# Conflicts:
#	lucene/CHANGES.txt

* Clenaup and update changes and synchronize with 9.x

* Update TestVectorUtilProviders.java (apache#12338)

* Don't generate stacktrace for TimeExceededException (apache#12335)

The exception is package private and never rethrown, we can avoid
generating a stacktrace for it.

* Introduced the Word2VecSynonymFilter (apache#12169)

Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io>

* Word2VecSynonymFilter constructor null check (apache#12169)

* Use thread-safe search version of HnswGraphSearcher (apache#12246)

Addressing comment received in the PR apache#12246

* Word2VecSynonymProvider to use standard Integer max value for hnsw searches (apache#12235)
We observed this change was not ported previously from main in an old cherry-pick

* Fix searchafter high latency when after value is out of range for segment (apache#12334)

* Make memory fence in `ByteBufferGuard` explicit (apache#12290)

* Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit (apache#12320)

* Add updateDocuments API which accept a query (reopen) (apache#12346)

* GITHUB#11350: Handle backward compatibility when merging segments with different FieldInfo

This commits restores Lucene 9's ability to handle indices created with Lucene 8 where there are discrepancies in FieldInfos, such as different IndexOptions

* [Tessellator] Improve the checks that validate the diagonal between two polygon nodes (apache#12353)

# Conflicts:
#	lucene/CHANGES.txt

* feat: soft delete optimize (apache#12339)

* Better paging when random reads go backwards (apache#12357)

When reading data from outside the buffer, BufferedIndexInput always resets
its buffer to start at the new read position. If we are reading backwards (for example,
using an OffHeapFSTStore for a terms dictionary) then this can have the effect of
re-reading the same data over and over again.

This commit changes BufferedIndexInput to use paging when reading backwards,
so that if we ask for a byte that is before the current buffer, we read a block of data
of bufferSize that ends at the previous buffer start.

Fixes apache#12356

* Work around SecurityManager issues during initialization of vector api (JDK-8309727) (apache#12362)

* Restrict GraphTokenStreamFiniteStrings#articulationPointsRecurse recursion depth (apache#12249)

* Implement MMapDirectory with Java 21 Project Panama Preview API (apache#12294)
Backport incl JDK21 apijar file with java.util.Objects regenerated

* remove relic in apijar folder caused by vector additions

* Speed up IndexedDISI Sparse #AdvanceExactWithinBlock for tiny step advance (apache#12324)

* Add checks in KNNVectorField / KNNVectorQuery to only allow non-null, non-empty and finite vectors (apache#12281)


---------

Co-authored-by: Uwe Schindler <uschindler@apache.org>

* Implement VectorUtilProvider with Java 21 Project Panama Vector API (apache#12363) (apache#12365)

This commit enables the Panama Vector API for Java 21. The version of
VectorUtilPanamaProvider for Java 21 is identical to that of Java 20.
As such, there is no specific 21 version - the Java 20 version will be
loaded from the MRJAR.

* Add CHANGES.txt for apache#12334 Honor after value for skipping documents even if queue is not full for PagingFieldCollector (apache#12368)

Signed-off-by: gashutos <gashutos@amazon.com>

* Move TermAndBoost back to its original location. (apache#12366)

PR apache#12169 accidentally moved the `TermAndBoost` class to a different location,
which would break custom sub-classes of `QueryBuilder`. This commit moves it
back to its original location.

* GITHUB-12252: Add function queries for computing similarity scores between knn vectors (apache#12253)

Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io>

* hunspell (minor): reduce allocations when processing compound rules (apache#12316)

(cherry picked from commit a454388)

* hunspell (minor): reduce allocations when reading the dictionary's morphological data (apache#12323)

there can be many entries with morph data, so we'd better avoid compiling and matching regexes and even stream allocation

(cherry picked from commit 4bf1b94)

* TestHunspell: reduce the flakiness probability (apache#12351)

* TestHunspell: reduce the flakiness probability

We need to check how the timeout interacts with custom exception-throwing checkCanceled.
The default timeout seems not enough for some CI agents, so let's increase it.

Co-authored-by: Dawid Weiss <dawid.weiss@gmail.com>
(cherry picked from commit 5b63a18)

* This allows VectorUtilProvider tests to be executed although hardware may not fully support vectorization or if C2 is not enabled (apache#12376)

---------

Signed-off-by: gashutos <gashutos@amazon.com>
Co-authored-by: Alan Woodward <romseygeek@apache.org>
Co-authored-by: Luca Cavanna <javanna@apache.org>
Co-authored-by: Uwe Schindler <uschindler@apache.org>
Co-authored-by: Armin Braun <me@obrown.io>
Co-authored-by: Mikhail Khludnev <mkhludnev@users.noreply.github.com>
Co-authored-by: Marcus <marcuseagan@gmail.com>
Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com>
Co-authored-by: Kaival Parikh <46070017+kaivalnp@users.noreply.github.com>
Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com>
Co-authored-by: tang donghai <tangdhcs@gmail.com>
Co-authored-by: Patrick Zhai <zhaih@users.noreply.github.com>
Co-authored-by: Greg Miller <gsmiller@gmail.com>
Co-authored-by: Jerry Chin <metrxqin@gmail.com>
Co-authored-by: Patrick Zhai <zhai7631@gmail.com>
Co-authored-by: Andrey Bozhko <andybozhko@gmail.com>
Co-authored-by: Chris Hegarty <62058229+ChrisHegarty@users.noreply.github.com>
Co-authored-by: Robert Muir <rmuir@apache.org>
Co-authored-by: Jonathan Ellis <jbellis@datastax.com>
Co-authored-by: Daniele Antuzi <daniele.antuzi@gmail.com>
Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io>
Co-authored-by: Chaitanya Gohel <104654647+gashutos@users.noreply.github.com>
Co-authored-by: Petr Portnov | PROgrm_JARvis <pportnov@ozon.ru>
Co-authored-by: Tomas Eduardo Fernandez Lobbe <tflobbe@apache.org>
Co-authored-by: Ignacio Vera <ivera@apache.org>
Co-authored-by: fudongying <30896830+fudongyingluck@users.noreply.github.com>
Co-authored-by: Chris Fournier <chris.fournier@shopify.com>
Co-authored-by: gf2121 <52390227+gf2121@users.noreply.github.com>
Co-authored-by: Adrien Grand <jpountz@gmail.com>
Co-authored-by: Elia Porciani <e.porciani@sease.io>
Co-authored-by: Peter Gromov <peter@jetbrains.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants