Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distance measures for dense and sparse vectors #37947

Merged
merged 11 commits into from Feb 20, 2019

Conversation

mayya-sharipova
Copy link
Contributor

@mayya-sharipova mayya-sharipova commented Jan 29, 2019

Introduce painless functions of
cosineSimilarity and dotProduct distance
measures for dense and sparse vector fields.

{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilarity(params.queryVector, doc['my_dense_vector'])",
        "params": {
          "queryVector": [4, 3.4, -1.2]
        }
      }
    }
  }
}
{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilaritySparse(params.queryVector, doc['my_sparse_vector'])",
        "params": {
          "queryVector": {"2": 0.5, "10" : 111.3, "50": -1.3, "113": 140.8, "4545": 111.0}
        }
      }
    }
  }
}

Closes #31615

Introduce painless functions of
cosineSimilarity and dotProduct distance
measures for dense and sparse vector fields.

```js
{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilarity(params.queryVector, doc['my_dense_vector'].value)",
        "params": {
          "queryVector": [4, 3.4, -1.2]
        }
      }
    }
  }
}
```

```js
{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilaritySparse(params.queryVector, doc['my_sparse_vector'].value)",
        "params": {
          "queryVector": {"2": -0.5, "10" : 111.3, "50": -13.0, "113": 14.8, "4545": -156.0}
        }
      }
    }
  }
}
```

Closes elastic#31615
@mayya-sharipova mayya-sharipova added the :Search/Ranking Scoring, rescoring, rank evaluation. label Jan 29, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only had a quick look, one concern that I have is that we are leaking the internal representation of vector fields.

I believe we should instead expose vectors in scripts via a dedicated ScriptDocValues sub-class, like we are doing for dates for instance, or only give access to vector fields via functions, whose signature would look like dotProduct(queryVector, fieldName).

@@ -9,7 +9,8 @@ not exceed 500. The number of dimensions can be
different across documents. A `dense_vector` field is
a single-valued field.

These vectors can be used for document scoring.
These vectors can be used for
{ref}/query-dsl-script-score-query.html#vector-functions[document scoring].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason not use an internal link, eg. <<vector-functions,document scoring>>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Adrien. I think we can use internal links only to reference within the same document. What I wanted to do here is reference a section of the external document

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still a bit confused, this is the same document, isn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpountz Sorry Adrien, I meant that inside one asciidoc doc dense-vector.asciidoc we want to reference a section of another asciidoc doc script-score-query.asciidoc.

We can indeed use an easier format : <<query-dsl-script-score-query,document_scoring>>, but this will link to the whole document. And as I understood after talking with the documentation team, the only way to link to the section of another doc is to use this full html link.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpountz Sorry Adrien, please disregard my previous comments. I have followed your advice to use internal links and it looks like documentation CI passed.

this.queryVectorMagnitude = (float) Math.sqrt(dotProduct);
}

public float cosineSimilarity(BytesRef docVectorBR) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make these methods return a double. We only support floats at index time because of space contraints, but this isn't a problem here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpountz Thanks for the review, Adrien. I will change this to double. The main reason for float was that it is a document's score, and all other Scorers are returning floats.

@mayya-sharipova
Copy link
Contributor Author

@jpountz Thanks for the initial review, Adrien. I have tried to address your comments and this PR is ready for the review when you have time:

  • I have made functions return double instead of float
  • I have modified the format of the functions from dotProduct(queryVector, doc['my_dense_vector'].value) to dotProduct(queryVector, doc['my_dense_vector'])
    I could not exactly made them as you suggested: dotProduct(queryVector, fieldName) inside the painless script.

About exposing vectors in scripts via a dedicated ScriptDocValues sub-class - this was already initially implemented through VectorScriptDocValues.java.

About leaking the internal representation of vector fields - I have made getValue() method of VectorScriptDocValues package private, so that vector fields
are NOT accessible in scripts, sorting, or aggs outside our distance unctions. Or are you concerned that vector values are returned as a part of the search request as below?

"hits": [
      {
        "_index": "dindex",
        "_type": "_doc",
        "_id": "2",
        "_score": 1.0000001,
        "_source": {
          "my_text": "text2",
          "my_vector": [
            4.5,
            3.4,
            -1.2
          ]
        }
      },

@LiuGangR
Copy link

LiuGangR commented Jan 31, 2019

how to use cosineSimilarity?
I use the snapshot which is built from your branch 'vector-fied-query'

it just tell me '"lang":"painless","caused_by":{"type":"illegal_argument_exception","reason":"Variable [my_feature] is not defined'

{"error":{"root_cause":[{"type":"script_exception","reason":"compile error","script_stack":["... (params.queryVector, doc[my_feature].value)"," ^---- HERE"],"script":"cosineSimilarity(params.queryVector, doc[my_feature].value)","lang":"painless"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"test_index","node":"mZKn55wJSSi-vs3hMxocbQ","reason":{"type":"query_shard_exception","reason":"script_score: the script could not be loaded","index_uuid":"Q8lJHJLLRIatXSOY1_2UJg","index":"test_index","caused_by":{"type":"script_exception","reason":"compile error","script_stack":["... (params.queryVector, doc[my_feature].value)"," ^---- HERE"],"script":"cosineSimilarity(params.queryVector, doc[my_feature].value)","lang":"painless","caused_by":{"type":"illegal_argument_exception","reason":"Variable [my_feature] is not defined."}}}}],"caused_by":{"type":"script_exception","reason":"compile error","script_stack":["... (params.queryVector, doc[my_feature].value)"," ^---- HERE"],"script":"cosineSimilarity(params.queryVector, doc[my_feature].value)","lang":"painless","caused_by":{"type":"illegal_argument_exception","reason":"Variable [my_feature] is not defined."}}},"status":400}

and this is my mapping
{ "test_index": { "mappings": { "properties": { "my_feature": { "type": "dense_vector" } } } } }

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Mayya, I like this approach much more. I left some minor comments. One additional thing that would be nice to address would be to make sure that users get a nice error if they call the sparse functions on dense vectors or vice-versa, I have the feeling that users would get cryptic decoding errors if they do that with the current state of your PR?


@Override
public SortedBinaryDocValues getBytesValues() {
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you throw an exception instead?

@@ -9,7 +9,8 @@ not exceed 500. The number of dimensions can be
different across documents. A `dense_vector` field is
a single-valued field.

These vectors can be used for document scoring.
These vectors can be used for
{ref}/query-dsl-script-score-query.html#vector-functions[document scoring].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still a bit confused, this is the same document, isn't it?

@@ -74,6 +74,108 @@ to be the most efficient by using the internal mechanisms.
--------------------------------------------------
// NOTCONSOLE

[[vector-functions]]
===== Distance functions for vector fields
These functions are used to calculate distances
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's maybe avoid mentioning "distance" since eg. cosineSimilarity measure the similarity between two vectors rather than their distance?

// NOTCONSOLE

NOTE: If a document doesn't have a value for a vector field on which
a distance function is executed, 0 will be returned as a result.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also clarify what happens for dense vectors if they don't have the same number of dimensions?

public static int[] decodeSparseVectorDims(BytesRef vectorBR) {
if (vectorBR == null) {
throw new IllegalStateException("A document doesn't have a value for a vector field!");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be an illegal argument exception?

int i = 0;
for (Map.Entry<String, Number> dimValue : queryVector.entrySet()) {
queryDims[i] = Integer.parseInt(dimValue.getKey());
queryValues[i] = dimValue.getValue().floatValue();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/floatValue/doubleValue/?

double dotProduct = 0;
int i = 0;
for (Map.Entry<String, Number> dimValue : queryVector.entrySet()) {
queryDims[i] = Integer.parseInt(dimValue.getKey());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

catch the NumberFormatException to return a more user-friendly exception?

// calculate docVector magnitude
double dotProduct = 0;
for (float value : docValues) {
dotProduct += value * value;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cast one of the values to a double to have better accuracy and avoid overflows?


VectorDVAtomicFieldData(BinaryDocValues values) {
super();
this.values = values;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's take a LeafReader and a String field like other impls do and re-pull binary doc values each time, this way calling getScriptDocValues() multiple times on the same AtomicFieldData instance will work as expected

}

// package private access only for {@link ScoreScriptUtils}
BytesRef getValue() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's call it something like getEncodedValue to clarify what it is about?

@jpountz
Copy link
Contributor

jpountz commented Jan 31, 2019

@LiuGangR You need to put quotes around the field name.

@LiuGangR
Copy link

LiuGangR commented Jan 31, 2019

@jpountz Thanks!
This query is working.
{ "query": { "script_score": { "query": { "match_all": {} }, "script": { "source": "dotProduct(params.queryVector, doc[\"my_feature\"])", "params": { "queryVector": [4, 3.4, -1.2] } } } } }

@LiuGangR
Copy link

But there is new problem

script score function must not produce negative scores

{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"script score function must not produce negative scores, but got: [-0.1967234265776135]"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"test_index","node":"mZKn55wJSSi-vs3hMxocbQ","reason":{"type":"illegal_argument_exception","reason":"script score function must not produce negative scores, but got: [-0.1967234265776135]"}}],"caused_by":{"type":"illegal_argument_exception","reason":"script score function must not produce negative scores, but got: [-0.1967234265776135]","caused_by":{"type":"illegal_argument_exception","reason":"script score function must not produce negative scores, but got: [-0.1967234265776135]"}}},"status":400}

@jpountz
Copy link
Contributor

jpountz commented Jan 31, 2019

This is a good point, we should update examples so that they may only create positive scores, regardless of what vectors are indexed.

@LiuGangR
Copy link

LiuGangR commented Feb 1, 2019

@jpountz
That is cool. And you have any plan to support that in which version ?

@jpountz
Copy link
Contributor

jpountz commented Feb 1, 2019

@LiuGangR Hopefully 7.1.

@LiuGangR
Copy link

LiuGangR commented Feb 1, 2019

@jpountz
another question. If I what to search 'dense_vector' field, the 'cosineSimilarity' is the only why. And is there a default vector query?
Thanks!

@mayya-sharipova
Copy link
Contributor Author

@LiuGangR yes, the only way to use dense_vector or sparse_vector in queries is through cosineSimilarity and dotProduct functions

@mayya-sharipova
Copy link
Contributor Author

@jpountz Thanks Adrien for another review. I have addressed all your feedback except 1 comment, and this PR is ready for another round of review whenever you have time.

Unaddressed feedback:

One additional thing that would be nice to address would be to make sure that users get a nice error if they call the sparse functions on dense vectors or vice-versa, I have the feeling that users would get cryptic decoding errors if they do that with the current state of your PR?

Uses can make two mistakes here:

  1. provide a query vector in a wrong format. Here we have some safeguards for parseInt or painless script engine will complain and I can't do anything (e.g. queryVector is expected to be a Map but Array was provided)
  2. provide a document vector in a wrong format (dense versus sparse). Looks like here a user will not see failures, but will see unexpected scores (either 0, or very huge negative float numbers). The only way to prevent it, is to have the first byte in BytesRef as a special value that can tell us if the encoded vector is dense or sparse. What do you think?

About changing the encoding for vector fields, I was also thinking possibly to encode the magnitude of a document vector, so not to calculate it each time. What do you think about this?

@LiuGangR
Copy link

@jpountz
I build source today from the last commit of vector-field-query. And I use the data and search which are success in last built version. But it is failed.
And this is the log.

{"error":{"root_cause":[{"type":"script_exception","reason":"runtime error","script_stack":["cosineSimilarity(params.queryVector, doc[\"my_feature\"])"," ^---- HERE"],"script":"cosineSimilarity(params.queryVector, doc[\"my_feature\"])","lang":"painless"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"test_index","node":"1FxxsdfiRfGh_O_3OAe6ZA","reason":{"type":"script_exception","reason":"runtime error","script_stack":["cosineSimilarity(params.queryVector, doc[\"my_feature\"])"," ^---- HERE"],"script":"cosineSimilarity(params.queryVector, doc[\"my_feature\"])","lang":"painless","caused_by":{"type":"class_cast_exception","reason":"class org.elasticsearch.index.query.VectorScriptDocValues cannot be cast to class org.apache.lucene.util.BytesRef (org.elasticsearch.index.query.VectorScriptDocValues is in unnamed module of loader java.net.FactoryURLClassLoader @46046c06; org.apache.lucene.util.BytesRef is in unnamed module of loader 'app')"}}}]},"status":400}

@wmelton
Copy link

wmelton commented May 13, 2019

@mayya-sharipova - For clarification, does this native vector function use source values for the computations or the document values? Only ask because there seems to be performance degredations anytime source values are accessed at query time for queries like this.

Only asking because the documentation seems to suggest the use of _source values - https://www.elastic.co/guide/en/elasticsearch/reference/master/query-dsl-script-score-query.html#vector-functions

Also - do you have any performance numbers you've run/tested? Someone mentioned this feature was being added and said a test with 5 Million documents with vectors of dim=300 took 5 seconds to return results, which seems like pretty anemic response times.

@mayya-sharipova
Copy link
Contributor Author

@wmelton Answering your questions:

For clarification, does this native vector function use source values for the computations or the document values?

We use binary document values, we encode vectors as binaries during indexing, and decode them back to numeric vectors during search.

do you have any performance numbers you've run/tested?

No, currently, we don't have, but we plan to work on adding some benchmarks. Vector functions use linear scan over all matched docs, so the response time should increase linearly with the number of matched docs.

Also, would like to note that vector fields is an experimental feature, and APIs and the way the vectors are indexed and encoded may be changed in the non-backward compatible way.

@wmelton
Copy link

wmelton commented May 19, 2019

Hi @mayya-sharipova -

Thank you for your responses.

Regarding "Vector functions use linear scan over all matched docs, so the response time should increase linearly with the number of matched docs." - I think taking the linear approach for this is a mistake, personally.

The pL2AP algorithm and Facebooks open source FAISS (Fast Similarity Search) both highlight ways to parallelize the search space. I think implementing a linear search approach will be frustrating to the type of users who are actually the most likely to want to use the dense or sparse vector field type you are proposing adding.

@mayya-sharipova
Copy link
Contributor Author

@wmelton Thanks for your comment. Indeed linear scan would not scale, and it is intended mostly to score a limited set of documents.

About FAISS library, the speed ups there are based on the hardware acceleration and approximate knn algorithms. We currently don't have plans to employ hardware acceleration, but we are exploring algorithms for approximate knn.

@ra1ski
Copy link

ra1ski commented May 27, 2019

@mayya-sharipova
Hi! Is there any chance to use long dense vectors to compute cosine distance?
I have these kinds of vectors
[0.7831882238388062, 0.8473913073539734, 0.6641695499420166...]

with 200 floating point numbers

@ra1ski
Copy link

ra1ski commented May 28, 2019

@mayya-sharipova
Hi! Is there any chance to use long dense vectors to compute cosine distance?
I have these kinds of vectors
[0.7831882238388062, 0.8473913073539734, 0.6641695499420166...]

with 200 floating point numbers

Your example above with "queryVector": [ 4.5, 3.4, -1.2] works fine, but when it comes to [0.7831882238388062, 0.8473913073539734, 0.6641695499420166...] vectors, I get an error:
"caused_by": { "type": "script_exception", "reason": "compile error", "script_stack": [ "cosineSimilarity(params.q ...", "^---- HERE" ], "script": "cosineSimilarity(params.queryVector, doc['vector'])", "lang": "painless", "caused_by": { "type": "illegal_argument_exception", "reason": "Unknown call [cosineSimilarity] with [2] arguments." } }

@mayya-sharipova
Copy link
Contributor Author

mayya-sharipova commented May 28, 2019

@ra1ski What do you mean by "long dense vectors"? Do you mean to use 200 dimensions? Yes, you can use up to 1024 dimensions. It should be fine.
I am not sure why you are experiencing this error. Can you provide the whole query? Also are you testing this against the current master?

@ra1ski
Copy link

ra1ski commented May 29, 2019

@ra1ski What do you mean by "long dense vectors"? Do you mean to use 200 dimensions? Yes, you can use up to 1024 dimensions. It should be fine.
I am not sure why you are experiencing this error. Can you provide the whole query? Also are you testing this against the current master?

Yes, 200 dimensions.
I'm using it against 7.0.0. master, tried with 7.1.1 also

Here is the query

{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilarity(params.queryVector, doc['vector'])",
        "params": {
          "queryVector": [0.7831882238388062, 0.8473913073539734, 0.6641695499420166, -0.7800988554954529, 0.6427151560783386, 0.8618375062942505, -0.7508959174156189, 0.8940073251724243, -0.8382183313369751, -0.8465797305107117, 0.8887408375740051, 0.8348124623298645, 0.7685972452163696, -0.8586599230766296, 0.7378193140029907, -0.7119467854499817, -0.8077011108398438, 0.8601088523864746, 0.8935535550117493, 0.6392208337783813, 0.8716743588447571, -0.7871374487876892, 0.6682323217391968, -0.8151301145553589, -0.8227899670600891, -0.7399943470954895, -0.897373378276825, 0.8426622152328491, 0.8269796371459961, 0.8424233198165894, 0.8509830236434937, -0.7777097821235657, 0.8377213478088379, 0.9059052467346191, 0.7352653741836548, -0.7400990128517151, -0.8934587240219116, -0.9130118489265442, -0.8574285507202148, -0.8946468234062195, 0.8552821278572083, 0.8763160705566406, -0.7989016771316528, -0.642711341381073, -0.7476733922958374, -0.8486865758895874, 0.8278630971908569, -0.8525271415710449, -0.8806391954421997, -0.6730614304542542, -0.881908118724823, 0.7430080771446228, 0.7847618460655212, 0.8260719180107117, -0.8224948644638062, -0.7607067823410034, 0.8367544412612915, 0.20206642150878906, 0.7692943215370178, -0.8679789304733276, -0.7517973780632019, -0.8642300367355347, -0.7322789430618286, -0.8890762329101562, -0.8113778829574585, -0.8182528614997864, -0.8263254165649414, 0.8806875944137573, -0.8628260493278503, 0.838936984539032, 0.8677369952201843, -0.776382565498352, 0.8289804458618164, 0.6592877507209778, -0.8425590395927429, -0.763074517250061, 0.8569432497024536, -0.7417001128196716, 0.8681409955024719, -0.8540714979171753, -0.8500930070877075, -0.8368064761161804, -0.8406449556350708, -0.8733716011047363, -0.8958595991134644, 0.8130819201469421, -0.8314911723136902, 0.8423287272453308, 0.8449920415878296, -0.8795095682144165, 0.7511520981788635, -0.8035956621170044, 0.7193001508712769, 0.7730565071105957, -0.857988178730011, 0.8187726140022278, 0.831302285194397, 0.8996239900588989, -0.863531231880188, 0.8358138799667358, -0.8426796197891235, 0.8390976190567017, 0.7986222505569458, -0.8568884134292603, 0.8369844555854797, 0.8447090983390808, 0.8311792612075806, -0.8208156824111938, -0.7700560092926025, -0.784808874130249, -0.874031662940979, 0.8473763465881348, 0.8083603978157043, 0.8634394407272339, 0.8724079132080078, -0.7952577471733093, 0.5091663599014282, 0.656829833984375, -0.8029653429985046, -0.8171727061271667, 0.8314194679260254, -0.8559287190437317, 0.8022019267082214, 0.7917070388793945, -0.8446627855300903, -0.7673274278640747, 0.832277774810791, -0.8024963140487671, 0.9498147964477539, -0.7452983856201172, 0.8978539705276489, 0.8834426999092102, 0.8543949127197266, 0.8466156721115112, -0.8207280039787292, 0.8191858530044556, -0.8309515118598938, 0.7519159317016602, 0.8341091275215149, -0.8656532168388367, 0.8573458790779114, -0.8247408866882324, 0.9135391116142273, -0.8272571563720703, -0.8448845148086548, -0.8408781290054321, -0.8409822583198547, -0.842566967010498, 0.7356223464012146, 0.8904960751533508, 0.8448322415351868, -0.8642748594284058, 0.8605462908744812, 0.8045945167541504, -0.8715876340866089, -0.8079540133476257, -0.8474785089492798, -0.8472393155097961, 0.8432945013046265, -0.8253397941589355, 0.7905577421188354, 0.7081928253173828, 0.6722716093063354, 0.8101333379745483, -0.8465112447738647, 0.8858150243759155, 0.8352972269058228, -0.7904651761054993, -0.8659583330154419, -0.8847810626029968, -0.762391209602356, -0.7752716541290283, -0.7860286831855774, -0.8350412249565125, -0.8377161026000977, -0.8326281309127808, 0.6579743027687073, -0.8490581512451172, 0.7932018041610718, 0.7292879819869995, 0.8307300806045532, 0.8333244323730469, -0.7778127193450928, -0.8621459007263184, -0.8240952491760254, 0.8149698376655579, 0.8036678433418274, 0.7759568691253662, -0.8074528574943542, -0.8319423794746399, -0.685379683971405, -0.6155311465263367, 0.771338701248169, 0.7577664256095886, 0.7837430238723755, -0.7604954838752747, 0.8120626211166382, -0.8959243893623352, -0.7081544995307922, 0.8636442422866821]
        }
      }
    }
  }
}

@mayya-sharipova
Copy link
Contributor Author

@ra1ski Vector functions are available starting from 7.2

mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this pull request Jul 9, 2019
Add L1norm - Manhattan distance
Add L2norm - Euclidean distance
relates to elastic#37947
mayya-sharipova added a commit that referenced this pull request Jul 11, 2019
Add L1norm - Manhattan distance
Add L2norm - Euclidean distance
relates to #37947
@csenol
Copy link

csenol commented Aug 6, 2019

Hi @mayya-sharipova . these two functions are not available in painless when using them sorting with painless script. I think they are not available in so-called Sort Context

Is there a reason for that?

Example query

{
  "_source": {
    "excludes": [
      "*"
    ]
  },
  "from": 0,
  "profile": true,
  "query": {
    "bool": {
      "filter": [
        {
          "match_all": {}
        }
      ]
    }
  },
  "sort": {
    "_script": {
      "order": "desc",
      "script": {
        "lang": "painless",
        "params": {
          "user_vector": [
            140,
            45
          ]
        },
        "source": "\nreturn dotProduct(params.user_vector, doc[\"vctr\"]);"
      },
      "type": "number"
    }
  }
}
{
  "error": {
    "root_cause": [
      {
        "type": "script_exception",
        "reason": "compile error",
        "script_stack": [
          "\nreturn dotProduct(params.user_ve ...",
          "        ^---- HERE"
        ],
        "script": "\nreturn dotProduct(params.user_vector, doc[\"vctr\"]);",
        "lang": "painless"
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "test",
        "node": "LwRebTTXS3CgSzdYSqpFFA",
        "reason": {
          "type": "script_exception",
          "reason": "compile error",
          "script_stack": [
            "\nreturn dotProduct(params.user_ve ...",
            "        ^---- HERE"
          ],
          "script": "\nreturn dotProduct(params.user_vector, doc[\"vctr\"]);",
          "lang": "painless",
          "caused_by": {
            "type": "illegal_argument_exception",
            "reason": "Unknown call [dotProduct] with [2] arguments."
          }
        }
      }
    ]
  },
  "status": 400
}

@csenol
Copy link

csenol commented Aug 6, 2019

This limitation makes vectors a bit useless in script sorting. One can try to implement dotProduct or cosineSimilarity in painless but it is not possible since this is package private and makes impossible of decoding values.

@mayya-sharipova
Copy link
Contributor Author

@csenol Thanks for letting us know about this issue. Indeed, vector functions are not available in the Sort Script Context in 7.3. We have made a patch to add them to the Sort Context from 7.4.
But please be aware that vector functions on vectors with high dims can be quite slow, so you should limit the number of matched docs on which vector functions applied with more restrictive filters.

@csenol
Copy link

csenol commented Aug 6, 2019

@mayya-sharipova Thanks million times for such a quick action. I wish it was a fix in 7.3.X instead of waiting for 7.4 but anyway thanks a lot

@SthPhoenix
Copy link

Hi, @mayya-sharipova! Thanks for great work with vector scoring! In my setup it can sort 3m 512d vectors in ~1200ms, and in pair with LSH it can achieve around 100ms while scoring 120k top matches (10k per shard).

Only issue I've found this far is that when using dotProduct on normalized vectors, score might be in range (-1,1) which can cause error with negative score. Currently I'm fixing it with normalizing score to range (0,1) with following: (1.0 + dotProduct (params.queryVector, doc["vector"]))/2.0

@mayya-sharipova
Copy link
Contributor Author

@SthPhoenix Thanks for the info on your setup.

in pair with LSH

I was wondering what LSH are you using?

@SthPhoenix
Copy link

I'm using my fork of @alexklibisz elastik-nearest-neighbors plugin.
But my fork is a bit outdated comparing to my local version.

@@ -119,8 +120,7 @@ public Query existsQuery(QueryShardContext context) {

@Override
public IndexFieldData.Builder fielddataBuilder(String fullyQualifiedIndexName) {
throw new UnsupportedOperationException(
"Field [" + name() + "] of type [" + typeName() + "] doesn't support sorting, scripting or aggregating");
return new VectorDVIndexFieldData.Builder(true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @mayya-sharipova I've been doing some digging to figure out why fielddataBuilder looks like makes the dense vector field aggregatable, while docValueFormat says throw new UnsupportedOperationException("Field [" + name() + "] of type [" + typeName() + "] doesn't support docvalue_fields or aggregations");. Is that correct or does docValueFormat need updating? I am totally fine with fixing if needed, just trying to put all the pieces together first. ;)

@jtibshirani jtibshirani added :Search/Vectors Vector search and removed :Search/Ranking Scoring, rescoring, rank evaluation. labels Jul 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Introduce vector field, vector query and rescoring based on them