Add checks in KNNVectorField / KNNVectorQuery to only allow non-null, non-empty and finite vectors #12281

jbellis · 2023-05-10T14:03:26Z

This PR adds argument checking to constructors of fields and query, so all vector values must be finite.

(this let me figure out that it was a NaN causing problems)

jbellis · 2023-06-09T21:35:21Z

(Rebased to main)

rmuir · 2023-06-10T09:57:03Z

Sorry, I am opposed to this PR. Checking if vectors are finite needs to be done at eg index time (in the xxField class), not one very similarity operation. And widening intermediate calculations to double will kill the performance of vectorized impl, which should stay consistent with it.

These functions are performance sensitive, not the place for this.

uschindler

Hi,
there are multiple problems:

we have no benchmark. The code introduced here will most likely prevent SIMD instructions introduced by Hotspot. This code is very performance sensitive!
the SIMD implementation in PanamaVectorUtilProvider does not use doubles. It allready differs from its results and this would change the behaviour completely.
checking for finite vectors should not be done at query runtime and not at indexing time.

I am strongly -1 to apply anything like that!

The big question here: What are "large dimension" vectors? If the cosine gets NaN it must be large. As discussed in other issue #11507 and PR #12306, there is a limit of 1024 dimensions in Lucene - for a good reason! If you changed the constant in the codec and indexed shitillions of dimensions, it is not Lucene's problem. Those functions here are tested to behave fine with up to 1024 dimensions.

BTW, this PR/issue in an argument to not raise the limit in Lucene to more than 1024 or 2048. Each dimension more adds more rounding errors!

There is no need to make the cosine of identical vector return exactly 1 (it can never be exactly one). This is for scoring and sorting documents, the actual value does not matter. This is also why we do not care about the new Panama Vector based implementation returns different values because of different order of execution and rounding.

benwtrent

Even with removing the checkFinite for user friendly errors, I would like to see any benchmarking.

I honestly don't know if summing into a double will make things slower or not. But if its slower at all, I would rather not bother with it.

This is a hot path, and we don't want to slow it down at all for exceptional cases.

benwtrent · 2023-06-10T15:14:39Z

lucene/core/src/java/org/apache/lucene/util/VectorUtilDefaultProvider.java

+    return r;
+  }
+
+  private static void checkFinite(float r, float[] a, float[] b, String optype) {


it is indeed the caller's responsibility to ensure the passed vectors have valid values. We cannot do validations, even in exceptional cases.

rmuir · 2023-06-10T15:25:31Z

I would move the isFinite check to KnnFloatVectorField.setVectorValue:

lucene/lucene/core/src/java/org/apache/lucene/document/KnnFloatVectorField.java

Line 152 in 95e2cfc

*/

I would fix KnnFloatVectorField ctor to actually call setVectorValue, as currently the first value passed in bypasses all checks:

lucene/lucene/core/src/java/org/apache/lucene/document/KnnFloatVectorField.java

Line 104 in 95e2cfc

fieldsData = vector;

This way the check only happens at index-time once, and not on every comparison.

jbellis · 2023-06-10T15:32:32Z

I'm pretty sure that a single check for isFinite is going to be negligible compared to the cost of doing the computation -- checking the vectors only happens if that single check fails. We "know" that they must be finite b/c of index-time check but since we only do the full check if we lost so much precision during the computation that we get NaN/Infinity back, it seems reasonable to me to double-check.

I'll put together a benchmark to see how much the double math slows things down. Is there a threshold at which we're okay paying a small performance cost for improved precision?

uschindler · 2023-06-10T15:55:10Z

I'm pretty sure that a single check for isFinite is going to be negligible compared to the cost of doing the computation -- checking the vectors only happens if that single check fails. We "know" that they must be finite b/c of index-time check but since we only do the full check if we lost so much precision during the computation that we get NaN/Infinity back, it seems reasonable to me to double-check.

I'll put together a benchmark to see how much the double math slows things down. Is there a threshold at which we're okay paying a small performance cost for improved precision?

Please do not forget the new vectorized code! The default provider is not the only one!

jbellis · 2023-06-10T15:56:59Z

Thanks for the reminder. Is working with the default provider a reasonable first step, or is there something about the vectorized code that prevents this entirely?

rmuir · 2023-06-10T16:00:29Z

Please do not forget the new vectorized code! The default provider is not the only one!

That's the issue. I don't care about the performance of the scalar one. But the semantics must match.

If we widen to double, now we can only process half the elements at once: calling F2D on both sides, then expanding to 64-bit vectors before multiplication/addition. this conversion is expensive and expanding is the only choice (working in "parts" is too slow). To avoid heavy slowdowns, we have to do tricks that will make the code look like the integer versions.

I don't think we should do this: let's keep this stuff simple, it is too hot. But the safety checks can be added elsewhere.

rmuir · 2023-06-10T16:02:49Z

widening to double is guaranteed to be at least a 2x slowdown for the vector code since only half the elements can be processed in the same time.

…h 0 is undefined)

jbellis · 2023-06-10T16:14:20Z

okay, thanks. how about this version then that just does the sanity check?

uschindler · 2023-06-10T16:21:57Z

Why do we need the sanity check here?

You can do the sanity check directly in VectorUtil, no need to do it in any of both providers. The code in VectorUtil also does the argument checks. Move all code that cannot be directly vectorized to VectorUtil class. As the checks just do isFinite on return value, there's the correct place, check return value of provider.

Maybe do it as assertion only?

rmuir · 2023-06-10T16:41:06Z

The check belongs in KnnVectorField, not here.

rmuir · 2023-06-10T16:41:22Z

See comment: #12281 (comment)

uschindler · 2023-06-10T16:44:59Z

The check belongs in KnnVectorField, not here.

Yes and no. In KnnVectorField it needs to scan through all vector components to check that it is not finite.

The check here that isFinite is only on return value of calculation. If one component is inFinite the result of calculation is infinite.

The code only scans through vector to give a good exception.

uschindler · 2023-06-10T17:18:39Z

My proposal would be:

Revert all changes to all providers. The change to double is out of discussion now. We have some rounding errors for large vectors, but this has nothing to do with finite checks. We are a search engine that calculates float scores. The implementations in Panama and the unvectorized default one are different anyways. For a vector search engine it is just important to return a score, if it is precisie does not matter so much, it is only used for sorting
Add a static method VectorUtil#checkFinite(float[] vector) that throws IllegalArgumentException (and does the missing null check gratis). Theres no need to move it to the providers. The autovectorizer can handle it.
Call this method from both ctors in KNNVectorField. The ctors are currently missing the checks that are mentioned in the javadocs. E.g., theres no null check, which is a bug.

uschindler · 2023-06-10T17:20:39Z

Also change the title of this PR to be "add checks in KNNVectorField to only allow non-null, non-empty and finite vectors".

uschindler

see my proposal

…stead, and call from KFVF.createType

jbellis · 2023-06-10T18:41:54Z

Thanks! I've made the proposed changes.

It looks like KNNFloatVectorField.createType is doing the null check for both constructors, so I added checkFinite there.

rmuir · 2023-06-11T01:19:32Z

This patch is still mixing concerns and calling nonsense checks in similarity functions. Is this a troll?

-1

uschindler · 2023-06-11T07:37:22Z

If the index no longer has infinite values (checked during indexing), there's really no reason to call isFinite from the similarity functions.

If there's a query vector, it should also be checked, but in the query's ctor:

lucene/lucene/core/src/java/org/apache/lucene/search/KnnFloatVectorQuery.java

Line 73 in 113c4ad

this.target = target;

uschindler · 2023-06-11T11:04:08Z

Finally I just want to understand that sentence in the description:

Cosine of two equal vectors is exactly 1, but we're losing too much precision on large-dimension vectors and ending up with NaN.

How can that be? The cosine can only get infinite / NaN if any of the arguments are infinite. Do you have an example where it gets NaN with finite vectors (overflow?).

I agree with this PR, if you make the query also check its input and remove the finity checks from the similarity functions.

P.S.: We also do not allow other queries to return scores of NaN, but in the same way like proposed, we check inputs of queries (like invalid or negative boosts). We do not check for invalid float scores on each scored document. This is what you want to do here and that's not accepted.

I would agree to add it as assertion to trigger bugs in our testing of (possibly new) queries. But for that it is enough to do assert Float.isFinite(result), nothing more (no analysis of vector arguments).

jbellis · 2023-06-11T14:59:42Z

it is enough to do assert Float.isFinite(result), nothing more (no analysis of vector arguments).

I don't understand we wouldn't want to include the analysis. This is a result that we expect "can't happen" so it will be extremely valuable to know which scenario it is:

A mistake was introduced into validation of parameters
A new code path happening that didn't do validation properly
Somehow the float32 math didn't work the way we thought it should

Basically impossible to tell which scenario without knowing the vector arguments, and there is no cost to having that code on the happy path of everything working since it is only called if the assert fails.

uschindler · 2023-06-11T15:01:08Z

Here is the example vector: https://gist.github.com/jbellis/8a9c42ee8ecdf603021498deddfcc243

This is an example where the result is not expected. But how would throwing an exception help? A query searching for the exact same vector (cosine=1) should return a valid score. So actually this is a different issue.

The original PR did not have that vector - or as it force merged away? The test in your latest PR was just about a 3 dim vector with infinity / NaN.

I suspect, that with double vector math this can't be prevented, too. How does the query fail downstream?

jbellis · 2023-06-11T15:09:39Z

I force-pushed it away. Here is the original version: https://github.com/jbellis/lucene/tree/hnsw-nan-og

I ordered the commits this way specifically so that it was clear that switching to double math solved the issue with exploding to NaN.

uschindler · 2023-06-12T07:33:28Z

Hi,
so this looks like 2 different issues:

incorrect validation of query and indexed field contents.
some vectors cause NaN when the cosine is calculated. I am not sure how this can happen, because cosine should only be NaN when the argument of the square root is negative or NaN. To me this indeed looks like a rounding error. Maybe we should guard the cosine to not allow negative numbers. This can only happen for special cases, so maybe Math.abs() is fine to work around. Actually the cosine should never be outside [-1 .. 1].

I think instead of throwing an exception (which would be crazy to the end user as a query or the indexing process would suddenly fail with unexpected arguments, although the input vectors are valid), we should make sure that due to rounding errors the argument of the cosine cannot get negative.

Can we fix the cosine problem due to negative sqrt in a separate PR and let this one alone?

uschindler · 2023-06-12T07:51:20Z

I think I know why it happens, in both providers we calculate the cosine like that:

lucene/lucene/core/src/java/org/apache/lucene/util/VectorUtilDefaultProvider.java

Line 105 in ef35e6e

return (float) (sum / Math.sqrt(norm1 * norm2));

When we calculate the argument for sqrt it may get negative for one of those reasons:

norm1 or norm2 is negative
we don't cast both factors to double before multiplication, so it overflows the exponent of the float

The other similarity functions are not affected, but we can possibly add some guard her by casting both floats to double before multiplying them.

uschindler · 2023-06-12T08:07:56Z

I debugged through it: The vector causes norm1 and norm2, as well as sum to get Infinity. Infinity/Infinity results in NaN.

So it is not caused by sqrt. In general you are right this might happen with any vector if the exponent overflows while summing up the component squares.

I am not sure how to work around that. Maybe it should return 1, if the sqrt gets infinity?

benwtrent

This is a solid change. We should not allow infinite floats, and I am surprised that we didn't have null checks already 🤦 .

Thank you for moving the "isFinite" checks and only verifying responses if assertions are enabled.

uschindler · 2023-06-12T11:26:58Z

I think the related issue found with some vectors creating an NaN cosine (happens when the floats are too large by exponent and the result gets infinity after multiplication) is a separate one. I think we should open an issue.

This vector causes the assert to trigger:

  public void testCosineNaN() {
    final float[] v = new float[] { 1.E31f };
    assertEquals(1f, VectorUtil.cosine(v, v), DELTA);
  }

I figured out that I did not check explicitely for "empty" vectors. I may add a test for that.

uschindler · 2023-06-12T11:31:46Z

I think we could decide to also disallow such vectors. If the square of one its components gets infinity it should maybe also be rejected. What do you think?

float y= 1e31f * 1e31f;
System.err.println(y); // prints "Infinity"

Maybe we change the infinite check to use square?

uschindler · 2023-06-12T11:40:17Z

Actually with the current similarity methods we should make sure that for each vector component v, the following is true: Float.isFinite(v * v * vector.length).

This is quite easy to implement in the isFinite check!

In my opinion, this is a good compromise. This will help with all different similarity function, as all of them generally sum up the squares of components (or like that). If somebody indexes or queries a vector that's too large, we should reject it from beginning. Maybe with a better error message than currently.

Otherwise we would really need to go and expand to double because it allows larger exponents. The issue here is not precission of the float (that is fine), the problem is the limitation on the exponent.

uschindler · 2023-06-12T11:56:45Z

I modified the function like that, but have not yet committed:

  /**
   * Checks if a float vector only has finite components and the square of its components multiplied by vector dimensions is finite.
   *
   * @param v bytes containing a vector
   * @return the vector for call-chaining
   * @throws IllegalArgumentException if any component of vector violates the contract
   */
  public static float[] checkFinite(float[] v) {
    for (int i = 0; i < v.length; i++) {
      float f = v[i] * v[i] * v.length;
      if (!Float.isFinite(f)) {
        throw new IllegalArgumentException("non-finite or too large (with respect to dimensions) value at vector[" + i + "]=" + v[i]);
      }
    }
    return v;
  }

What do you think?

uschindler · 2023-06-12T11:59:04Z

The above check should auto-vectorize in Hotspot, so the check during indexing/searching should be cheap.

benwtrent · 2023-06-12T13:12:26Z

@uschindler I think this change (v[i] * v[i] * v.length) is getting complicated. It really makes me think about if we should do any complex infinite checking other than isFinite(float). For example, this check would prevent vectors that may be valid for squareDistance.

If a specific similarity has NaN issues, than a more complicated isFinite check should account for those. But, this then opens the door to how KnnFloatVectorQuery would even know about the similarity used, or how that could be checked at all.

I would prefer us doing the simple float value validation for now and we open an issue to see if we can get a better API for checking similarity infinity checks (e.g. squareDistance vs cosine) and if we should have those at all.

uschindler · 2023-06-12T14:42:38Z

That would be my plan! Let's open a new issue and discuss that there.

So we should merge this PR for now. Before doing that I will only add the "vector size>0" check in the API, because that may be missing at some places. A zero length vector also causes havoc.

…e. The current code is quite unclear, so it is better to be explicit (Math.sqrt uses double argument anyways)

… type

uschindler · 2023-06-12T15:16:54Z

Hi, I added the dimension check to the constructor which uses a predefined field type. For the query it can't be done in the constructor, as we do not know the field type. The query will fail later, so an explicit check is not needed.

I think this is ready to be merged. @jbellis we should open a separate issue to make sure that we find a solution for vectors where the float overflows (infinity) leading to infinite scores. I strongly disagree to use double math, we should maybe have some documentation or checks like proposed above. We could also enforce it, but that's harder to decide! So it should be separate issue.

It is very easy to reproduce with a single-dimension vector.

jbellis · 2023-06-12T15:38:43Z

SGTM, thank you for taking the lead on investigating the root cause of the NaNs!

benwtrent · 2023-06-12T16:34:57Z

lucene/core/src/java/org/apache/lucene/document/KnnByteVectorField.java

@@ -137,7 +137,12 @@ public KnnByteVectorField(String name, byte[] vector, FieldType fieldType) {
              + " using byte[] but the field encoding is "
              + fieldType.vectorEncoding());
    }
-    fieldsData = Objects.requireNonNull(vector, "vector value must not be null");
+    Objects.requireNonNull(vector, "vector value must not be null");
+    if (vector.length != fieldType.vectorDimension()) {


Good catch!

… non-empty and finite vectors (#12281) --------- Co-authored-by: Uwe Schindler <uschindler@apache.org>

uschindler · 2023-06-14T12:38:02Z

I did not see any slowdowns in last night @mikemccand benchmark caused by the check during indexing and on building the query.

…dc8ca633e8bcf`) (#20) * Add next minor version 9.7.0 * Fix SynonymQuery equals implementation (apache#12260) The term member of TermAndBoost used to be a Term instance and became a BytesRef with apache#11941, which means its equals impl won't take the field name into account. The SynonymQuery equals impl needs to be updated accordingly to take the field into account as well, otherwise synonym queries with same term and boost across different fields are equal which is a bug. * Fix MMapDirectory documentation for Java 20 (apache#12265) * Don't generate stacktrace in CollectionTerminatedException (apache#12270) CollectionTerminatedException is always caught and never exposed to users so there's no point in filling in a stack-trace for it. * add missing changelog entry for apache#12260 * Add missing author to changelog entry for apache#12220 * Make query timeout members final in ExitableDirectoryReader (apache#12274) There's a couple of places in the Exitable wrapper classes where queryTimeout is set within the constructor and never modified. This commit makes such members final. * Update javadocs for QueryTimeout (apache#12272) QueryTimeout was introduced together with ExitableDirectoryReader but is now also optionally set to the IndexSearcher to wrap the bulk scorer with a TimeLimitingBulkScorer. Its javadocs needs updating. * Make TimeExceededException members final (apache#12271) TimeExceededException has three members that are set within its constructor and never modified. They can be made final. * DOAP changes for release 9.6.0 * Add back-compat indices for 9.6.0 * `ToParentBlockJoinQuery` Explain Support Score Mode (apache#12245) (apache#12283) * `ToParentBlockJoinQuery` Explain Support Score Mode --------- Co-authored-by: Marcus <marcuseagan@gmail.com> * Simplify SliceExecutor and QueueSizeBasedExecutor (apache#12285) The only behaviour that QueueSizeBasedExecutor overrides from SliceExecutor is when to execute on the caller thread. There is no need to override the whole invokeAll method for that. Instead, this commit introduces a shouldExecuteOnCallerThread method that can be overridden. * [Backport] GITHUB-11838 Add api to allow concurrent query rewrite (apache#12197) * GITHUB-11838 Change API to allow concurrent query rewrite (apache#11840) Replace Query#rewrite(IndexReader) with Query#rewrite(IndexSearcher) Co-authored-by: Patrick Zhai <zhaih@users.noreply.github.com> Co-authored-by: Adrien Grand <jpountz@gmail.com> Backport of apache#11840 Changes from original: - Query keeps `rewrite(IndexReader)`, but it is now deprecated - VirtualMethod is used to correct delegate to the overridden methods - The changes to `RewriteMethod` type classes are reverted, this increased the backwards compatibility impact. ------------------------------ ### Description Issue: apache#11838 #### Updated Proposal * Change signature of rewrite to `rewrite(IndexSearcher)` * How did I migrate the usage: * Use Intellij to do preliminary refactoring for me * For test usage, use searcher whenever is available, otherwise create one using `newSearcher(reader)` * For very few non-test classes which doesn't have IndexSearcher available but called rewrite, create a searcher using `new IndexSearcher(reader)`, tried my best to avoid creating it recurrently (Especially in `FieldQuery`) * For queries who have implemented the rewrite and uses some part of reader's functionality, use shortcut method when possible, otherwise pull out the reader from indexSearcher. * Backport: Concurrent rewrite for KnnVectorQuery (apache#12160) (apache#12288) * Concurrent rewrite for KnnVectorQuery (apache#12160) - Reduce overhead of non-concurrent search by preserving original execution - Improve readability by factoring into separate functions --------- Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com> * adjusting for backport --------- Co-authored-by: Kaival Parikh <46070017+kaivalnp@users.noreply.github.com> Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com> * toposort use iterator to avoid stackoverflow (apache#12286) Co-authored-by: tangdonghai <tangdonghai@meituan.com> # Conflicts: # lucene/CHANGES.txt * Fix test to compile with Java 11 after backport of apache#12286 * Update Javadoc for topoSortStates method after apache#12286 (apache#12292) * Optimize HNSW diversity calculation (apache#12235) * Minor cleanup and improvements to DaciukMihovAutomatonBuilder (apache#12305) * GITHUB-12291: Skip blank lines from stopwords list. (apache#12299) * Wrap Query rewrite backwards layer with AccessController (apache#12308) * Make sure APIJAR reproduces with different timezone (unfortunately java encodes the date using local timezone) (apache#12315) * Add multi-thread searchability to OnHeapHnswGraph (apache#12257) * Fix backport error * [MINOR] Update javadoc in Query class (apache#12233) - add a few missing full stops - update wording in the description of Query#equals method * [Backport] Integrate the Incubating Panama Vector API apache#12311 (apache#12327) Leverage accelerated vector hardware instructions in Vector Search. Lucene already has a mechanism that enables the use of non-final JDK APIs, currently used for the Previewing Pamana Foreign API. This change expands this mechanism to include the Incubating Pamana Vector API. When the jdk.incubator.vector module is present at run time the Panamaized version of the low-level primitives used by Vector Search is enabled. If not present, the default scalar version of these low-level primitives is used (as it was previously). Currently, we're only targeting support for JDK 20. A subsequent PR should evaluate JDK 21. --------- Co-authored-by: Uwe Schindler <uschindler@apache.org> Co-authored-by: Robert Muir <rmuir@apache.org> * Parallelize knn query rewrite across slices rather than segments (apache#12325) The concurrent query rewrite for knn vectory query introduced with apache#12160 requests one thread per segment to the executor. To align this with the IndexSearcher parallel behaviour, we should rather parallelize across slices. Also, we can reuse the same slice executor instance that the index searcher already holds, in that way we are using a QueueSizeBasedExecutor when a thread pool executor is provided. * Optimize ConjunctionDISI.createConjunction (apache#12328) This method is showing up as a little hot when profiling some queries. Almost all the time spent in this method is just burnt on ceremony around stream indirections that don't inline. Moving this to iterators, simplifying the check for same doc id and also saving one iteration (for the min cost) makes this method far cheaper and easier to read. * Update changes to be correct with ARM (it is called NEON there) * GH#12321: Marked DaciukMihovAutomatonBuilder as deprecated (apache#12332) Preparing to reduce visibility of this class in a future release * add BitSet.clear() (apache#12268) # Conflicts: # lucene/CHANGES.txt * Clenaup and update changes and synchronize with 9.x * Update TestVectorUtilProviders.java (apache#12338) * Don't generate stacktrace for TimeExceededException (apache#12335) The exception is package private and never rethrown, we can avoid generating a stacktrace for it. * Introduced the Word2VecSynonymFilter (apache#12169) Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io> * Word2VecSynonymFilter constructor null check (apache#12169) * Use thread-safe search version of HnswGraphSearcher (apache#12246) Addressing comment received in the PR apache#12246 * Word2VecSynonymProvider to use standard Integer max value for hnsw searches (apache#12235) We observed this change was not ported previously from main in an old cherry-pick * Fix searchafter high latency when after value is out of range for segment (apache#12334) * Make memory fence in `ByteBufferGuard` explicit (apache#12290) * Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit (apache#12320) * Add updateDocuments API which accept a query (reopen) (apache#12346) * GITHUB#11350: Handle backward compatibility when merging segments with different FieldInfo This commits restores Lucene 9's ability to handle indices created with Lucene 8 where there are discrepancies in FieldInfos, such as different IndexOptions * [Tessellator] Improve the checks that validate the diagonal between two polygon nodes (apache#12353) # Conflicts: # lucene/CHANGES.txt * feat: soft delete optimize (apache#12339) * Better paging when random reads go backwards (apache#12357) When reading data from outside the buffer, BufferedIndexInput always resets its buffer to start at the new read position. If we are reading backwards (for example, using an OffHeapFSTStore for a terms dictionary) then this can have the effect of re-reading the same data over and over again. This commit changes BufferedIndexInput to use paging when reading backwards, so that if we ask for a byte that is before the current buffer, we read a block of data of bufferSize that ends at the previous buffer start. Fixes apache#12356 * Work around SecurityManager issues during initialization of vector api (JDK-8309727) (apache#12362) * Restrict GraphTokenStreamFiniteStrings#articulationPointsRecurse recursion depth (apache#12249) * Implement MMapDirectory with Java 21 Project Panama Preview API (apache#12294) Backport incl JDK21 apijar file with java.util.Objects regenerated * remove relic in apijar folder caused by vector additions * Speed up IndexedDISI Sparse #AdvanceExactWithinBlock for tiny step advance (apache#12324) * Add checks in KNNVectorField / KNNVectorQuery to only allow non-null, non-empty and finite vectors (apache#12281) --------- Co-authored-by: Uwe Schindler <uschindler@apache.org> * Implement VectorUtilProvider with Java 21 Project Panama Vector API (apache#12363) (apache#12365) This commit enables the Panama Vector API for Java 21. The version of VectorUtilPanamaProvider for Java 21 is identical to that of Java 20. As such, there is no specific 21 version - the Java 20 version will be loaded from the MRJAR. * Add CHANGES.txt for apache#12334 Honor after value for skipping documents even if queue is not full for PagingFieldCollector (apache#12368) Signed-off-by: gashutos <gashutos@amazon.com> * Move TermAndBoost back to its original location. (apache#12366) PR apache#12169 accidentally moved the `TermAndBoost` class to a different location, which would break custom sub-classes of `QueryBuilder`. This commit moves it back to its original location. * GITHUB-12252: Add function queries for computing similarity scores between knn vectors (apache#12253) Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io> * hunspell (minor): reduce allocations when processing compound rules (apache#12316) (cherry picked from commit a454388) * hunspell (minor): reduce allocations when reading the dictionary's morphological data (apache#12323) there can be many entries with morph data, so we'd better avoid compiling and matching regexes and even stream allocation (cherry picked from commit 4bf1b94) * TestHunspell: reduce the flakiness probability (apache#12351) * TestHunspell: reduce the flakiness probability We need to check how the timeout interacts with custom exception-throwing checkCanceled. The default timeout seems not enough for some CI agents, so let's increase it. Co-authored-by: Dawid Weiss <dawid.weiss@gmail.com> (cherry picked from commit 5b63a18) * This allows VectorUtilProvider tests to be executed although hardware may not fully support vectorization or if C2 is not enabled (apache#12376) --------- Signed-off-by: gashutos <gashutos@amazon.com> Co-authored-by: Alan Woodward <romseygeek@apache.org> Co-authored-by: Luca Cavanna <javanna@apache.org> Co-authored-by: Uwe Schindler <uschindler@apache.org> Co-authored-by: Armin Braun <me@obrown.io> Co-authored-by: Mikhail Khludnev <mkhludnev@users.noreply.github.com> Co-authored-by: Marcus <marcuseagan@gmail.com> Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com> Co-authored-by: Kaival Parikh <46070017+kaivalnp@users.noreply.github.com> Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com> Co-authored-by: tang donghai <tangdhcs@gmail.com> Co-authored-by: Patrick Zhai <zhaih@users.noreply.github.com> Co-authored-by: Greg Miller <gsmiller@gmail.com> Co-authored-by: Jerry Chin <metrxqin@gmail.com> Co-authored-by: Patrick Zhai <zhai7631@gmail.com> Co-authored-by: Andrey Bozhko <andybozhko@gmail.com> Co-authored-by: Chris Hegarty <62058229+ChrisHegarty@users.noreply.github.com> Co-authored-by: Robert Muir <rmuir@apache.org> Co-authored-by: Jonathan Ellis <jbellis@datastax.com> Co-authored-by: Daniele Antuzi <daniele.antuzi@gmail.com> Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io> Co-authored-by: Chaitanya Gohel <104654647+gashutos@users.noreply.github.com> Co-authored-by: Petr Portnov | PROgrm_JARvis <pportnov@ozon.ru> Co-authored-by: Tomas Eduardo Fernandez Lobbe <tflobbe@apache.org> Co-authored-by: Ignacio Vera <ivera@apache.org> Co-authored-by: fudongying <30896830+fudongyingluck@users.noreply.github.com> Co-authored-by: Chris Fournier <chris.fournier@shopify.com> Co-authored-by: gf2121 <52390227+gf2121@users.noreply.github.com> Co-authored-by: Adrien Grand <jpountz@gmail.com> Co-authored-by: Elia Porciani <e.porciani@sease.io> Co-authored-by: Peter Gromov <peter@jetbrains.com>

improve NeighborArray assert message when results do not sort correctly

90aea15

(this let me figure out that it was a NaN causing problems)

jbellis force-pushed the hnsw-nan branch from 1561b7b to 6b19c5d Compare June 9, 2023 21:35

benwtrent self-requested a review June 9, 2023 22:15

uschindler reviewed Jun 10, 2023

View reviewed changes

This was referenced Jun 10, 2023

Make MAX_DIMENSIONS configurable via a system property. #12306

Closed

Increase the number of dims for KNN vectors to 2048 [LUCENE-10471] #11507

Closed

benwtrent requested changes Jun 10, 2023

View reviewed changes

add checkFinite and fix TestExitableDirectoryReader (cosine taken wit…

c9b3cb8

…h 0 is undefined)

jbellis force-pushed the hnsw-nan branch from 6b19c5d to c9b3cb8 Compare June 10, 2023 16:13

uschindler requested changes Jun 10, 2023

View reviewed changes

revert changes to VUDefaultProvider; add checkFinite to VectorUtil in…

113c4ad

…stead, and call from KFVF.createType

benwtrent approved these changes Jun 12, 2023

View reviewed changes

uschindler added 3 commits June 12, 2023 16:52

Make sure the last multiplication of norms are explicitly using doubl…

2dc8484

…e. The current code is quite unclear, so it is better to be explicit (Math.sqrt uses double argument anyways)

Add length check for the constructor that does not create a new field…

0bde969

… type

Add CHANGES.txt

0ab0be6

uschindler self-assigned this Jun 12, 2023

uschindler added this to the 9.7.0 milestone Jun 12, 2023

uschindler added the type:enhancement label Jun 12, 2023

benwtrent reviewed Jun 12, 2023

View reviewed changes

uschindler merged commit 071461e into apache:main Jun 13, 2023
4 checks passed

asfgit pushed a commit that referenced this pull request Jun 13, 2023

Add checks in KNNVectorField / KNNVectorQuery to only allow non-null,…

8c00149

… non-empty and finite vectors (#12281) --------- Co-authored-by: Uwe Schindler <uschindler@apache.org>

uschindler mentioned this pull request Jun 13, 2023

Prevent VectorSimilarity.DOT_PRODUCT from returning negative scores #12342

Closed

jbellis mentioned this pull request Jun 14, 2023

require that float vector components are smaller than 1E17 to prevent overflowing to Infinity #12373

Closed

alessandrobenedetti added the vector-based-search label Jun 15, 2023

Add checks in KNNVectorField / KNNVectorQuery to only allow non-null, non-empty and finite vectors #12281

Add checks in KNNVectorField / KNNVectorQuery to only allow non-null, non-empty and finite vectors #12281

Conversation

jbellis commented May 10, 2023 • edited by uschindler

jbellis commented Jun 9, 2023

rmuir commented Jun 10, 2023

uschindler left a comment • edited

Choose a reason for hiding this comment

benwtrent left a comment

Choose a reason for hiding this comment

benwtrent Jun 10, 2023

Choose a reason for hiding this comment

rmuir commented Jun 10, 2023

jbellis commented Jun 10, 2023 • edited

uschindler commented Jun 10, 2023

jbellis commented Jun 10, 2023

rmuir commented Jun 10, 2023

rmuir commented Jun 10, 2023

jbellis commented Jun 10, 2023

uschindler commented Jun 10, 2023 • edited

rmuir commented Jun 10, 2023

rmuir commented Jun 10, 2023

uschindler commented Jun 10, 2023

uschindler commented Jun 10, 2023 • edited

uschindler commented Jun 10, 2023

uschindler left a comment

Choose a reason for hiding this comment

jbellis commented Jun 10, 2023

rmuir commented Jun 11, 2023

uschindler commented Jun 11, 2023 • edited

uschindler commented Jun 11, 2023 • edited

jbellis commented Jun 11, 2023

uschindler commented Jun 11, 2023

jbellis commented Jun 11, 2023

uschindler commented Jun 12, 2023 • edited

uschindler commented Jun 12, 2023 • edited

uschindler commented Jun 12, 2023

benwtrent left a comment

Choose a reason for hiding this comment

uschindler commented Jun 12, 2023 • edited

uschindler commented Jun 12, 2023

uschindler commented Jun 12, 2023 • edited

uschindler commented Jun 12, 2023

uschindler commented Jun 12, 2023

benwtrent commented Jun 12, 2023

uschindler commented Jun 12, 2023

uschindler commented Jun 12, 2023 • edited

jbellis commented Jun 12, 2023

benwtrent Jun 12, 2023

Choose a reason for hiding this comment

uschindler commented Jun 14, 2023

jbellis commented May 10, 2023 •

edited by uschindler

uschindler left a comment •

edited

jbellis commented Jun 10, 2023 •

edited

uschindler commented Jun 10, 2023 •

edited

uschindler commented Jun 10, 2023 •

edited

uschindler commented Jun 11, 2023 •

edited

uschindler commented Jun 11, 2023 •

edited

uschindler commented Jun 12, 2023 •

edited

uschindler commented Jun 12, 2023 •

edited

uschindler commented Jun 12, 2023 •

edited

uschindler commented Jun 12, 2023 •

edited

uschindler commented Jun 12, 2023 •

edited