Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-thread searchability to OnHeapHnswGraph #12257

Merged
merged 1 commit into from
May 22, 2023

Conversation

zhaih
Copy link
Contributor

@zhaih zhaih commented Apr 30, 2023

Description

Implemented a searcher specific to address the thread-safety problem on searching on OnHeapHnswGraph

Idea: #12246 (comment)

Test

A test case comparing the search using 4 threads and new class with single thread and old one.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zhaih , did you see the other PR that would also allow multi-threaded insertion? Adding multi-threading just for search would be misleading imo since this class exposes insertion too.

@zhaih
Copy link
Contributor Author

zhaih commented May 3, 2023

Hi @jimczi, thanks for the comment. Yeah I'm aware of that PR and also looking forward to that. However I decided to make this PR because:

  1. This can resolve the thread-safety issue faster, as I would imagine we will iterate on that PR a bit longer.
  2. I'm not sure whether the concurrent writable & readable HNSW graph will completely replace the current one, citing the author's own words:

With the optimizations it's still about 38% slower than the non-concurrent code (including the HashMap optimization for the non-concurrent that went in).

I think the original impl still has it's own value in case where people don't necessarily need concurrent write, so have a thread-safe search solution may still be desirable? (Especially when it is quite easy to make one)

As for

Adding multi-threading just for search would be misleading imo since this class exposes insertion too.

Yes I agree, but I think it's more of a javadoc issue? If we documented it clearly that HNSW graph is not supposed to be written and read at the same time unless explicitly stated, maybe it is more acceptable?
Just like HashMap, ArrayList and tons of other data structures that are not supposed to be safe when written & read together but expose an insertion API, if we don't advertise it as "totally safe" then I don't think people should have that impression?

@alessandrobenedetti
Copy link
Contributor

Hi @zhaih , I think your idea can be a quick workaround, I like it, in its simplicity, I take the freedom to add a commit on top of it to make it more usable and compilation friendly, it's just an idea so feel free to revert the commit/ have discussion on top of it.

@alessandrobenedetti
Copy link
Contributor

@zhaih , @jimczi take a look to my last commit, it's intended as a draft, just to spark additional discussion, it's not meant in no way to override/diminish your work.
The rationale is to force users to choose the searcher depending on the graph they use, in this way, a non-thread safe execution will fail at compile time.

Tests should be green and if you like the idea, we shouldn't need the comment "*

Note: if you want to search {@link OnHeapHnswGraph} in a thread-safety manner, please
* consider using {@link OnHeapHnswGraphSearcher}"

@zhaih
Copy link
Contributor Author

zhaih commented May 5, 2023

Thanks @alessandrobenedetti for your idea, but sorry I am a bit against the OffHeapHnswSearcher as it isn't thread-safe either: in normal search we achieve it by creating a new copy of graph for each search (source).

I think what HnswGraphSearcher did is ok: it uses the API set that HnswGraph provided to do the search. And we definitely should warn users that it does not guarantee thread safety and it varies on different graph implementation: like if user put the future ConcurrentHnswGraph there it should be able to just work with multi-threading out of box.

@alessandrobenedetti
Copy link
Contributor

alessandrobenedetti commented May 8, 2023

Hi @zhaih, the main intent of my commits was not to make searching an OffHeap graph thread safe.
The main problem I see at the moment is that we delegate to a comment the responsibility of indicating to users 'the better searcher for OnHeap graphs'.
I looked around and when we use a graph searcher, as far as I saw, it's always known of we are dealing with an OffHeap or OnHeap version.
Hence the reason I made it explicit in the classes, so it won't be possible to have a choice: if you deal with the OnHeap you must use the OnHeapSearcher and if you deal with the OffHeap, you use the OffHeapSearcher.
For example with your original commits the bug we have on the SynonymProvider will still be there and no compilation error will tell us there's actually a better graphSearcher for the use case.
I hope this makes my contribution here clearer, and what I proposed is not meant to be the perfect solution but I believe we shouldn't ignore the problem and build a similar solution (happy to evaluate better approaches if any!)

@zhaih zhaih force-pushed the multiSearchHnsw branch 2 times, most recently from 2650d3a to 0c83a14 Compare May 9, 2023 04:20
@zhaih
Copy link
Contributor Author

zhaih commented May 9, 2023

@alessandrobenedetti thanks for explaining, altho I still don't like the idea of creating an OffHeapHnswGraphSearcher but use only APIs provided by HnswGraph, but to address your concern that user may not aware of there's a better searcher, I made a new change:
instead of making a new OnHeapHnswGraphSearch.search API I overloaded the old method so that when users are using OnHeapHnswGraph they will automatically use the thread-safe version, how do you think about that?

Copy link
Contributor

@alessandrobenedetti alessandrobenedetti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this pull request, It's not invasive and I think it solves the problem of concurrently searching OnHeap HNSW graphs!
+1 from my side

@zhaih
Copy link
Contributor Author

zhaih commented May 9, 2023

Thanks @alessandrobenedetti , I'll wait a day to give @msokolov and @jimczi a chance to review before merging it!

@jimczi
Copy link
Contributor

jimczi commented May 9, 2023

Sorry but that doesn't make sense to me to work/review this PR while #12254 is under review. I also disagree with the goal here, the mutable hnsw graph is used internally by the indexer, making it multi-threaded just for search is an external requirement. The other PR started with the goal of speeding up the building by using multiple threads. That seems reasonable to me. Using multiple threads just for the sake of this token filter is a non-goal imo.
Can you follow the progress of #12254 and plug your changes there?

@zhaih
Copy link
Contributor Author

zhaih commented May 9, 2023

Sorry @jimczi I am not sure I get your idea.
Did you mean:

  1. We shouldn't make improvement to this OnHeapHnswGraph right now because the other PR is ongoing and it will likely replace this one so what we're doing here will be deprecated eventually.
    Or:
  2. We shouldn't make improvement to this OnHeapHnswGraph for reason that is not for our main use case: which is indexing.

I actually don't think this PR is that related to the other one, actually this one can be benefit to the other one as that one is currently using ThreadLocal internally to solve the problem this PR is solving. If this PR is checked in, I believe the other one might be able to abandon the ThreadLocal.

@jimczi
Copy link
Contributor

jimczi commented May 9, 2023

I think both 1 and 2 yes. If you think this PR will benefit the other PR then it should be discussed in #12254. We already have issues with indexing so adding new requirements outside of this use case should be avoided imo.
The thing I don't like with this change alone is that it makes it like if using the HNSW builder for a search use case is ok.
That's not right, nobody should use the HNSW builder as a static searcher. If we make it multi-threaded it should be to make the build faster. Search is exposed in the builder simply because it is used by the algorithm to create the graph.

@zhaih
Copy link
Contributor Author

zhaih commented May 9, 2023

I'm ok to just pause it here or eventually drop this PR as I'm definitely not the one who need this feature (at least for now). But I guess if @jimczi you want to stop people from using the OnHeapHnswGraph besides indexing, the better place is to stop (since it's merged then probably revert) this one #12169

That's not right, nobody should use the HNSW builder as a static searcher.

BTW we're using HnswGraph but not HnswGraphBuilder here, but if by builder you mean that graph has a public insertion API then yes it does.

@alessandrobenedetti
Copy link
Contributor

I still think this contribution to be valuable, as I don't like much the fact that OnHeapGraph is stateful.
But I agree that the other contribution is solving a very similar problem.
@zhaih would be ok for you to try to get involved in the other contribution, trying to align your intent with the guy in there?
If you don't have the time, I'll try to find some, by the end of the week, as it's a very interesting topic.

In regard to avoiding people using the OnHeapHnswHraph outside the builder, I suspect we'll need to restructure the code in some way, as simple comments won't prevent future people to not using public classes.
And by the way, I also agree that we can improve the Word2Vec synonym filter in the future, changing the OnHeap approach with an OffHeap one, but rather than reverting it (and very likely lose the contribution) I do believe that incremental updates are the key here.

@alessandrobenedetti
Copy link
Contributor

alessandrobenedetti commented May 10, 2023

Another option, to be honest, is also to merge this first, and then the other committer will have to resolve the conflicts.
This one is a very minimal change and it seems an improvement to me, so it doesn't seem to me like a massive problem for the other pull request to deal with.

Rephrasing, @jimczi , is there any reason you believe the contribution from @zhaih to not be an improvement?
And no builder is involved in here as far as I reviewed.

@zhaih
Copy link
Contributor Author

zhaih commented May 13, 2023

would be ok for you to try to get involved in the other contribution, trying to align your intent with the guy in there?

I'll for sure trying to review that PR, but that one is trying to introduce a new data structure while this one is actually trying to add a quick workaround for HnswGraph's API (which discourage thread-safe), there's no conflict from the beginning IMO.

is there any reason you believe the contribution from @zhaih to not be an improvement?

In my understanding @jimczi is trying to say this change some what encourages the use of OnHeapHnswGraph, which originally is only used to build the graph in memory and shouldn't be used outside.

I'm ok with that, but to be fair: this PR is not trying to promote the use of that, it is trying to solve a potential issue in the current trunk. And the real factor that encourages the use is:

  1. It is a public class, even without lucene.internal tag
  2. All related tools (Builder, Searcher) are public available
  3. There's already an example of using it being recently added and no people are against it.

I'm not sure what all the solutions are, but stopping this PR won't solve any of the problems above, and will only make the thread-safety problem stays in the trunk longer.

@alessandrobenedetti
Copy link
Contributor

I agree to your points @zhaih , thinking more about it, to me this is just a minor contribution and an improvement.

Although I value also Jim's insights, I don't think they are blockers and I would suggest to merge on monday (or when it pleases you Patrick) unless @jimczi has some pragmatic evidence this causes any regression.

The other PR can manage to solve the conflicts later if any, that alone shouldn't block improvements like this.

@zhaih zhaih merged commit 8a602b5 into apache:main May 22, 2023
4 checks passed
@zhaih zhaih deleted the multiSearchHnsw branch May 22, 2023 04:49
hiteshk25 pushed a commit to cowpaths/lucene that referenced this pull request Jul 18, 2023
…dc8ca633e8bcf`) (#20)

* Add next minor version 9.7.0

* Fix SynonymQuery equals implementation (apache#12260)

The term member of TermAndBoost used to be a Term instance and became a
BytesRef with apache#11941, which means its equals impl won't take the field
name into account. The SynonymQuery equals impl needs to be updated
accordingly to take the field into account as well, otherwise synonym
queries with same term and boost across different fields are equal which
is a bug.

* Fix MMapDirectory documentation for Java 20 (apache#12265)

* Don't generate stacktrace in CollectionTerminatedException (apache#12270)

CollectionTerminatedException is always caught and never exposed to users so there's no point in filling
in a stack-trace for it.

* add missing changelog entry for apache#12260

* Add missing author to changelog entry for apache#12220

* Make query timeout members final in ExitableDirectoryReader (apache#12274)

There's a couple of places in the Exitable wrapper classes where
queryTimeout is set within the constructor and never modified. This
commit makes such members final.

* Update javadocs for QueryTimeout (apache#12272)

QueryTimeout was introduced together with ExitableDirectoryReader but is
now also optionally set to the IndexSearcher to wrap the bulk scorer
with a TimeLimitingBulkScorer. Its javadocs needs updating.

* Make TimeExceededException members final (apache#12271)

TimeExceededException has three members that are set within its constructor and never modified. They can be made final.

* DOAP changes for release 9.6.0

* Add back-compat indices for 9.6.0

* `ToParentBlockJoinQuery` Explain Support Score Mode (apache#12245) (apache#12283)

* `ToParentBlockJoinQuery` Explain Support Score Mode

---------

Co-authored-by: Marcus <marcuseagan@gmail.com>

* Simplify SliceExecutor and QueueSizeBasedExecutor (apache#12285)

The only behaviour that QueueSizeBasedExecutor overrides from SliceExecutor is when to execute on the caller thread. There is no need to override the whole invokeAll method for that. Instead, this commit introduces a shouldExecuteOnCallerThread method that can be overridden.

* [Backport] GITHUB-11838 Add api to allow concurrent query rewrite (apache#12197)

* GITHUB-11838 Change API to allow concurrent query rewrite (apache#11840)

Replace Query#rewrite(IndexReader) with Query#rewrite(IndexSearcher)

Co-authored-by: Patrick Zhai <zhaih@users.noreply.github.com>
Co-authored-by: Adrien Grand <jpountz@gmail.com>

Backport of apache#11840

Changes from original:
 - Query keeps `rewrite(IndexReader)`, but it is now deprecated
 - VirtualMethod is used to correct delegate to the overridden methods
 - The changes to `RewriteMethod` type classes are reverted, this increased the backwards compatibility impact. 

------------------------------

### Description
Issue: apache#11838 

#### Updated Proposal
 * Change signature of rewrite to `rewrite(IndexSearcher)`
 * How did I migrate the usage:
   * Use Intellij to do preliminary refactoring for me
   * For test usage, use searcher whenever is available, otherwise create one using `newSearcher(reader)`
   * For very few non-test classes which doesn't have IndexSearcher available but called rewrite, create a searcher using `new IndexSearcher(reader)`, tried my best to avoid creating it recurrently (Especially in `FieldQuery`)
   * For queries who have implemented the rewrite and uses some part of reader's functionality, use shortcut method when possible, otherwise pull out the reader from indexSearcher.

* Backport: Concurrent rewrite for KnnVectorQuery (apache#12160) (apache#12288)

* Concurrent rewrite for KnnVectorQuery (apache#12160)


- Reduce overhead of non-concurrent search by preserving original execution
- Improve readability by factoring into separate functions

---------

Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com>

* adjusting for backport

---------

Co-authored-by: Kaival Parikh <46070017+kaivalnp@users.noreply.github.com>
Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com>

* toposort use iterator to avoid stackoverflow (apache#12286)

Co-authored-by: tangdonghai <tangdonghai@meituan.com>
# Conflicts:
#	lucene/CHANGES.txt

* Fix test to compile with Java 11 after backport of apache#12286

* Update Javadoc for topoSortStates method after apache#12286 (apache#12292)

* Optimize HNSW diversity calculation (apache#12235)

* Minor cleanup and improvements to DaciukMihovAutomatonBuilder (apache#12305)

* GITHUB-12291: Skip blank lines from stopwords list. (apache#12299)

* Wrap Query rewrite backwards layer with AccessController (apache#12308)

* Make sure APIJAR reproduces with different timezone (unfortunately java encodes the date using local timezone) (apache#12315)

* Add multi-thread searchability to OnHeapHnswGraph (apache#12257)

* Fix backport error

* [MINOR] Update javadoc in Query class (apache#12233)

- add a few missing full stops
- update wording in the description of Query#equals method

* [Backport] Integrate the Incubating Panama Vector API apache#12311 (apache#12327)

Leverage accelerated vector hardware instructions in Vector Search.

Lucene already has a mechanism that enables the use of non-final JDK APIs, currently used for the Previewing Pamana Foreign API. This change expands this mechanism to include the Incubating Pamana Vector API. When the jdk.incubator.vector module is present at run time the Panamaized version of the low-level primitives used by Vector Search is enabled. If not present, the default scalar version of these low-level primitives is used (as it was previously).

Currently, we're only targeting support for JDK 20. A subsequent PR should evaluate JDK 21.
---------

Co-authored-by: Uwe Schindler <uschindler@apache.org>
Co-authored-by: Robert Muir <rmuir@apache.org>

* Parallelize knn query rewrite across slices rather than segments (apache#12325)

The concurrent query rewrite for knn vectory query introduced with apache#12160
requests one thread per segment to the executor. To align this with the
IndexSearcher parallel behaviour, we should rather parallelize across
slices. Also, we can reuse the same slice executor instance that the
index searcher already holds, in that way we are using a
QueueSizeBasedExecutor when a thread pool executor is provided.

* Optimize ConjunctionDISI.createConjunction (apache#12328)

This method is showing up as a little hot when profiling some queries.
Almost all the time spent in this method is just burnt on ceremony
around stream indirections that don't inline.
Moving this to iterators, simplifying the check for same doc id and also saving one iteration (for the min
cost) makes this method far cheaper and easier to read.

* Update changes to be correct with ARM (it is called NEON there)

* GH#12321: Marked DaciukMihovAutomatonBuilder as deprecated (apache#12332)

Preparing to reduce visibility of this class in a future release

* add BitSet.clear() (apache#12268)

# Conflicts:
#	lucene/CHANGES.txt

* Clenaup and update changes and synchronize with 9.x

* Update TestVectorUtilProviders.java (apache#12338)

* Don't generate stacktrace for TimeExceededException (apache#12335)

The exception is package private and never rethrown, we can avoid
generating a stacktrace for it.

* Introduced the Word2VecSynonymFilter (apache#12169)

Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io>

* Word2VecSynonymFilter constructor null check (apache#12169)

* Use thread-safe search version of HnswGraphSearcher (apache#12246)

Addressing comment received in the PR apache#12246

* Word2VecSynonymProvider to use standard Integer max value for hnsw searches (apache#12235)
We observed this change was not ported previously from main in an old cherry-pick

* Fix searchafter high latency when after value is out of range for segment (apache#12334)

* Make memory fence in `ByteBufferGuard` explicit (apache#12290)

* Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit (apache#12320)

* Add updateDocuments API which accept a query (reopen) (apache#12346)

* GITHUB#11350: Handle backward compatibility when merging segments with different FieldInfo

This commits restores Lucene 9's ability to handle indices created with Lucene 8 where there are discrepancies in FieldInfos, such as different IndexOptions

* [Tessellator] Improve the checks that validate the diagonal between two polygon nodes (apache#12353)

# Conflicts:
#	lucene/CHANGES.txt

* feat: soft delete optimize (apache#12339)

* Better paging when random reads go backwards (apache#12357)

When reading data from outside the buffer, BufferedIndexInput always resets
its buffer to start at the new read position. If we are reading backwards (for example,
using an OffHeapFSTStore for a terms dictionary) then this can have the effect of
re-reading the same data over and over again.

This commit changes BufferedIndexInput to use paging when reading backwards,
so that if we ask for a byte that is before the current buffer, we read a block of data
of bufferSize that ends at the previous buffer start.

Fixes apache#12356

* Work around SecurityManager issues during initialization of vector api (JDK-8309727) (apache#12362)

* Restrict GraphTokenStreamFiniteStrings#articulationPointsRecurse recursion depth (apache#12249)

* Implement MMapDirectory with Java 21 Project Panama Preview API (apache#12294)
Backport incl JDK21 apijar file with java.util.Objects regenerated

* remove relic in apijar folder caused by vector additions

* Speed up IndexedDISI Sparse #AdvanceExactWithinBlock for tiny step advance (apache#12324)

* Add checks in KNNVectorField / KNNVectorQuery to only allow non-null, non-empty and finite vectors (apache#12281)


---------

Co-authored-by: Uwe Schindler <uschindler@apache.org>

* Implement VectorUtilProvider with Java 21 Project Panama Vector API (apache#12363) (apache#12365)

This commit enables the Panama Vector API for Java 21. The version of
VectorUtilPanamaProvider for Java 21 is identical to that of Java 20.
As such, there is no specific 21 version - the Java 20 version will be
loaded from the MRJAR.

* Add CHANGES.txt for apache#12334 Honor after value for skipping documents even if queue is not full for PagingFieldCollector (apache#12368)

Signed-off-by: gashutos <gashutos@amazon.com>

* Move TermAndBoost back to its original location. (apache#12366)

PR apache#12169 accidentally moved the `TermAndBoost` class to a different location,
which would break custom sub-classes of `QueryBuilder`. This commit moves it
back to its original location.

* GITHUB-12252: Add function queries for computing similarity scores between knn vectors (apache#12253)

Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io>

* hunspell (minor): reduce allocations when processing compound rules (apache#12316)

(cherry picked from commit a454388)

* hunspell (minor): reduce allocations when reading the dictionary's morphological data (apache#12323)

there can be many entries with morph data, so we'd better avoid compiling and matching regexes and even stream allocation

(cherry picked from commit 4bf1b94)

* TestHunspell: reduce the flakiness probability (apache#12351)

* TestHunspell: reduce the flakiness probability

We need to check how the timeout interacts with custom exception-throwing checkCanceled.
The default timeout seems not enough for some CI agents, so let's increase it.

Co-authored-by: Dawid Weiss <dawid.weiss@gmail.com>
(cherry picked from commit 5b63a18)

* This allows VectorUtilProvider tests to be executed although hardware may not fully support vectorization or if C2 is not enabled (apache#12376)

---------

Signed-off-by: gashutos <gashutos@amazon.com>
Co-authored-by: Alan Woodward <romseygeek@apache.org>
Co-authored-by: Luca Cavanna <javanna@apache.org>
Co-authored-by: Uwe Schindler <uschindler@apache.org>
Co-authored-by: Armin Braun <me@obrown.io>
Co-authored-by: Mikhail Khludnev <mkhludnev@users.noreply.github.com>
Co-authored-by: Marcus <marcuseagan@gmail.com>
Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com>
Co-authored-by: Kaival Parikh <46070017+kaivalnp@users.noreply.github.com>
Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com>
Co-authored-by: tang donghai <tangdhcs@gmail.com>
Co-authored-by: Patrick Zhai <zhaih@users.noreply.github.com>
Co-authored-by: Greg Miller <gsmiller@gmail.com>
Co-authored-by: Jerry Chin <metrxqin@gmail.com>
Co-authored-by: Patrick Zhai <zhai7631@gmail.com>
Co-authored-by: Andrey Bozhko <andybozhko@gmail.com>
Co-authored-by: Chris Hegarty <62058229+ChrisHegarty@users.noreply.github.com>
Co-authored-by: Robert Muir <rmuir@apache.org>
Co-authored-by: Jonathan Ellis <jbellis@datastax.com>
Co-authored-by: Daniele Antuzi <daniele.antuzi@gmail.com>
Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io>
Co-authored-by: Chaitanya Gohel <104654647+gashutos@users.noreply.github.com>
Co-authored-by: Petr Portnov | PROgrm_JARvis <pportnov@ozon.ru>
Co-authored-by: Tomas Eduardo Fernandez Lobbe <tflobbe@apache.org>
Co-authored-by: Ignacio Vera <ivera@apache.org>
Co-authored-by: fudongying <30896830+fudongyingluck@users.noreply.github.com>
Co-authored-by: Chris Fournier <chris.fournier@shopify.com>
Co-authored-by: gf2121 <52390227+gf2121@users.noreply.github.com>
Co-authored-by: Adrien Grand <jpountz@gmail.com>
Co-authored-by: Elia Porciani <e.porciani@sease.io>
Co-authored-by: Peter Gromov <peter@jetbrains.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants