Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce vector field, vector query and rescoring based on them #31615

Closed
mayya-sharipova opened this issue Jun 27, 2018 · 21 comments

Comments

Projects
None yet
@mayya-sharipova
Copy link
Contributor

commented Jun 27, 2018

Introduce a new field of type vector on which vector calculations can be done during rescoring phase

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_feature": {
          "type": "vector"   
      }
    }
  }
}

Indexing

Allow only a single value per document
Allow to index both dense and sparse vectors?

Dense form:

PUT my_index/_doc/1
{
  "my_feature":   [11.5, 10.4, 23.0]
}

Sparse form (represented as list of dimension names and values for corresponding dimensions):

PUT my_index/_doc/1
{
  "my_feature": {"1": 11.5, "5": 10.5,  "101": 23.0}
}

Query and Rescoring

Introduce a special type of vector query:

"vector" : {
   "field" : "my_feature",
    "query_vector": {"1": 3, "5": 10.5,  "101": 12}
}

This query can only be used in the rescoring context.
This query produces a score for every document in the rescoring context in the following way:

  1. If a document doesn't have a vector value for field, 0 value will be returned
  2. If a document does have a vector value for field : doc_vector, the cosine similarity between doc_vector and query_vector is calculated:
    dotProduct(doc_vector, query_vector) / (sqrt(doc_vector) * sqrt(query_vector))
POST /_search
{
   "query" : {"<user-query>"},
   "rescore" : {
      "window_size" : 50,
      "query" : {
         "rescore_query" : {
            "vector" : {
               "field" : "my_feature",
               "query_vector": {"1": 3, "5": 10.5,  "101": 12}
            }
         }
      }
   }
}

Internal encoding

  1. Encoding of vectors:
    Internally both dense and sparse vectors are encoded as sorted hash?
    Thus dense array is transformed:
    [4, 12] -> {0: 4, 1: 12}
    Keys are sorted, so we can iterate over them instead of calculating hash

  2. What should be values in vectors?

    • floats?
    • smaller than floats? (lost some precision here, but less index size)
  3. Vectors are encoded as binaries.

@elasticmachine

This comment has been minimized.

Copy link
Collaborator

commented Jun 27, 2018

Pinging @elastic/es-search-aggs

@jpountz

This comment has been minimized.

Copy link
Contributor

commented Jun 27, 2018

This query can only be used in the rescoring context.

If we want to enforce this, then it might be easier to have a rescorer rather than a query (today we only have one rescore implementation: QueryRescorer, but we can add more of them, see eg. https://github.com/elastic/elasticsearch/tree/master/plugins/examples/rescore). We might also want to give it a more explicit name like cosine_similarity?

@etienne1985

This comment has been minimized.

Copy link

commented Jul 3, 2018

Hi, commenting here on @mayya-sharipova 's invitation. Our use case is that we'd want to use ES to search for sentences that have similar meaning to the sentence in the query, based on each sentence having an embedding. Vectors would be dense. Dimensionality would be 100-300 most of the time presumably. Cosine similarity would be my starting point for computing the similarity of embeddings.

@james-daily

This comment has been minimized.

Copy link

commented Jul 3, 2018

Allow only a single value per document

Do you mean only one vector field per document or only one value for each field? It would be useful to allow more than one one vector field per document for testing different embeddings, dimensionalities, etc. Something like:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "GloVe": {
          "type": "vector"   
      },
        "word2vec": {
          "type": "vector"   
      }
    }
  }
}
@mayya-sharipova

This comment has been minimized.

Copy link
Contributor Author

commented Jul 3, 2018

@james-daily Thanks for your feedback, James. Sorry, for a single value per document, we meant a single value per field, so it would be possible to have several vector fields.

@djptek

This comment has been minimized.

Copy link

commented Jul 27, 2018

Have you considered Manhattan distance as a cheaper alternative in terms of processing? Though this will not deliver the same result it can be comparable in terms of ranking vectors while delivering higher throughput than euclidian/cosine

@jtibshirani

This comment has been minimized.

Copy link
Member

commented Jul 27, 2018

In case it’s useful, here’s another datapoint from @gangeli, who also expressed interest in the feature:

  • Their use case also involves retrieving sentences or short paragraphs. Both the query and documents would be modelled using a sentence embedding (based on an RNN).
  • Vectors are dense and can have from 50 - 1000 dimensions, but are concentrated in the 200 - 300 range.
  • Ideally, cosine similarity would be applied to all documents when scoring (as opposed to just during a rescoring phase). In their use case, sentence retrieval is a component of a fairly general NLP pipeline, and they rely strongly on these sentence embeddings to understand synonyms/ textual similarity.
@mayya-sharipova

This comment has been minimized.

Copy link
Contributor Author

commented Jul 27, 2018

@djp-search thanks for a suggestion, we will study Manhattan distance

@jtibshirani thanks for another use-case

@mayya-sharipova mayya-sharipova self-assigned this Aug 13, 2018

mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this issue Aug 21, 2018

Vector query and cosine similarity
1. Dense vector

PUT dindex
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_vector": {
          "type": "dense_vector"
        },
        "my_text" : {
          "type" : "keyword"
        }
      }
    }
  }
}

PUT dinex/_doc/1
{
  "my_text" : "text1",
  "my_vector" : [ 0.5, 10, 6 ]
}

PUT dindex/_doc/2
{
  "my_text" : "text2",
  "my_vector" : [ 0.5, 10, 10]
}

GET dindex/_search
{
  "query" : {
        "vector" : {
            "field" : "my_vector",
            "query_vector": [ 0.5, 10, 10]
        }
    }
}

Result:
....
"hits": [
    {
        "_index": "dindex",
        "_type": "_doc",
        "_id": "2",
        "_score": 1.0000001,
        "_source": {
            "my_text": "text1",
            "my_vector": [
                0.5,
                10,
                10
            ]
        }
    },
    {
        "_index": "dindex",
        "_type": "_doc",
        "_id": "1",
        "_score": 0.97016037,
        "_source": {
            "my_text": "text1",
            "my_vector": [
                0.5,
                10,
                6
            ]
        }
    }
]

2. Sparse vector

PUT sindex
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_vector": {
          "type": "sparse_vector"
        },
        "my_text" : {
          "type" : "keyword"
        }
      }
    }
  }
}

PUT sindex/_doc/1
{
  "my_text" : "text1",
  "my_vector" : {"1": 0.5, "99": -0.5,  "5": 1}
}

PUT sindex/_doc/2
{
  "my_text" : "text2",
  "my_vector" : {"103": 0.5, "4": -0.5,  "5": 1}
}

GET sindex/_search
{
  "query" : {
        "vector" : {
            "field" : "my_vector",
            "query_vector": {"99": -0.5,  "1": 0.5,  "5": 1}
        }
    }
}

Result:
"hits": [
    {
        "_index": "sindex",
        "_type": "_doc",
        "_id": "1",
        "_score": 0.99999994,
        "_source": {
            "my_text": "text1",
            "my_vector": {
                "1": 0.5,
                "99": -0.5,
                "5": 1
            }
        }
    },
    {
        "_index": "sindex",
        "_type": "_doc",
        "_id": "2",
        "_score": 0.6666666,
        "_source": {
            "my_text": "text2",
            "my_vector": {
                "103": 0.5,
                "4": -0.5,
                "5": 1
            }
        }
    }
]

Search with filter:

GET sindex/_search
{
  "query": {
    "bool": {
      "must" : {
        "match": {
          "my_text": "text2"
        }
      },
      "should" : {
        "vector" : {
            "field" : "my_vector",
            "query_vector": {"99": -0.5,  "1": 0.5,  "5": 1}
        }
      }
    }
  }
}

Result:
"hits": [
    {
        "_index": "sindex",
        "_type": "_doc",
        "_id": "2",
        "_score": 0.6931472,
        "_source": {
            "my_text": "text2",
            "my_vector": {
                "103": 0.5,
                "4": -0.5,
                "5": 1
            }
        }
    }
]

3. Implementation details

3.1 Dense Vector
- BinaryDocValuesField
- byte array ->
    - integer (number of dimensions)
    - array of integers (encoded array of float values)

3.2 Sparse Vector
- BinaryDocValuesField
- byte array ->
    - integer (number of dimenstions)
    - array of integers (encoded array of float values)
    - array of integers (array of integer dimensions)

Relates to elastic#31615

mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this issue Nov 6, 2018

Vector query and cosine similarity
1. Dense vector

PUT dindex
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_vector": {
          "type": "dense_vector"
        },
        "my_text" : {
          "type" : "keyword"
        }
      }
    }
  }
}

PUT dinex/_doc/1
{
  "my_text" : "text1",
  "my_vector" : [ 0.5, 10, 6 ]
}

PUT dindex/_doc/2
{
  "my_text" : "text2",
  "my_vector" : [ 0.5, 10, 10]
}

GET dindex/_search
{
  "query" : {
        "vector" : {
            "field" : "my_vector",
            "query_vector": [ 0.5, 10, 10]
        }
    }
}

Result:
....
"hits": [
    {
        "_index": "dindex",
        "_type": "_doc",
        "_id": "2",
        "_score": 1.0000001,
        "_source": {
            "my_text": "text1",
            "my_vector": [
                0.5,
                10,
                10
            ]
        }
    },
    {
        "_index": "dindex",
        "_type": "_doc",
        "_id": "1",
        "_score": 0.97016037,
        "_source": {
            "my_text": "text1",
            "my_vector": [
                0.5,
                10,
                6
            ]
        }
    }
]

2. Sparse vector

PUT sindex
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_vector": {
          "type": "sparse_vector"
        },
        "my_text" : {
          "type" : "keyword"
        }
      }
    }
  }
}

PUT sindex/_doc/1
{
  "my_text" : "text1",
  "my_vector" : {"1": 0.5, "99": -0.5,  "5": 1}
}

PUT sindex/_doc/2
{
  "my_text" : "text2",
  "my_vector" : {"103": 0.5, "4": -0.5,  "5": 1}
}

GET sindex/_search
{
  "query" : {
        "vector" : {
            "field" : "my_vector",
            "query_vector": {"99": -0.5,  "1": 0.5,  "5": 1}
        }
    }
}

Result:
"hits": [
    {
        "_index": "sindex",
        "_type": "_doc",
        "_id": "1",
        "_score": 0.99999994,
        "_source": {
            "my_text": "text1",
            "my_vector": {
                "1": 0.5,
                "99": -0.5,
                "5": 1
            }
        }
    },
    {
        "_index": "sindex",
        "_type": "_doc",
        "_id": "2",
        "_score": 0.6666666,
        "_source": {
            "my_text": "text2",
            "my_vector": {
                "103": 0.5,
                "4": -0.5,
                "5": 1
            }
        }
    }
]

Search with filter:

GET sindex/_search
{
  "query": {
    "bool": {
      "must" : {
        "match": {
          "my_text": "text2"
        }
      },
      "should" : {
        "vector" : {
            "field" : "my_vector",
            "query_vector": {"99": -0.5,  "1": 0.5,  "5": 1}
        }
      }
    }
  }
}

Result:
"hits": [
    {
        "_index": "sindex",
        "_type": "_doc",
        "_id": "2",
        "_score": 0.6931472,
        "_source": {
            "my_text": "text2",
            "my_vector": {
                "103": 0.5,
                "4": -0.5,
                "5": 1
            }
        }
    }
]

3. Implementation details

3.1 Dense Vector
- BinaryDocValuesField
- byte array ->
    - integer (number of dimensions)
    - array of integers (encoded array of float values)

3.2 Sparse Vector
- BinaryDocValuesField
- byte array ->
    - integer (number of dimenstions)
    - array of integers (encoded array of float values)
    - array of integers (array of integer dimensions)

Relates to elastic#31615
@softwaredoug

This comment has been minimized.

Copy link
Contributor

commented Dec 21, 2018

Are there plans to use this to control matching as well? Such as filter in/out based on proximity (maybe some kind of distance) to a point being queried? Then it would be applicable outside a rescoring context

@mayya-sharipova

This comment has been minimized.

Copy link
Contributor Author

commented Dec 27, 2018

@softwaredoug We are still debating if we should use this field for matching, as it may make queries slow. For now the plan is to introduce two functions cosineSimilarity and dotProduct as a part of script score query. The idea is that these functions will be used for scoring after the match is already done.

@JnBrymn-EB

This comment has been minimized.

Copy link

commented Dec 29, 2018

We've been discussing this a bit in Relevant Search slack. I'm hoping we can use this field for matching too.

  • Certainly matching with this field will be a little slower, but there aren't any real surprises here. For instance, normal search with posting lists, etc. executes in O(num_docs), this field will surely still be O(num_docs) right? And if it's slower, I bet it's not that much slower is it? (Is it?)
  • The users of this field are likely to be the more sophisticated users who would more likely know the issues they are getting into.
  • Part of the nice value of using this field for matching is that presumably you would also be able to use it with other normal fields. For instance, I could have an index of "users" and I could say, "find me all users that are in San Francisco (geo search), that are most similar to this sample user (vector similarity)".
@ailurus1991

This comment has been minimized.

Copy link

commented Jan 28, 2019

Hey guys, awesome job. btw, this feature has been added in 7.0-alpha2? I'm testing dense vector rescore but I didn't find the right way to query...
I've tried

POST /_search
{
   "query" : {"<user-query>"},
   "rescore" : {
      "window_size" : 50,
      "query" : {
         "rescore_query" : {
            "vector" : {
               "field" : "my_feature",
               "query_vector": {"1": 3, "5": 10.5,  "101": 12}
            }
         }
      }
   }
}

and I got:

"error":{"root_cause":[{"type":"parsing_exception","reason":"no [query] registered for [vector]","line":9,"col":24}],

mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this issue Jan 29, 2019

Distance measures for dense and sparse vectors
Introduce painless functions of
cosineSimilarity and dotProduct distance
measures for dense and sparse vector fields.

```js
{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilarity(params.queryVector, doc['my_dense_vector'].value)",
        "params": {
          "queryVector": [4, 3.4, -1.2]
        }
      }
    }
  }
}
```

```js
{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilaritySparse(params.queryVector, doc['my_sparse_vector'].value)",
        "params": {
          "queryVector": {"2": -0.5, "10" : 111.3, "50": -13.0, "113": 14.8, "4545": -156.0}
        }
      }
    }
  }
}
```

Closes elastic#31615
@mayya-sharipova

This comment has been minimized.

Copy link
Contributor Author

commented Jan 29, 2019

@ailurus1991 Yes, you are right, currently there is no way to query vector fields.
We are working on introducing the ways through painless script functions.

@ailurus1991

This comment has been minimized.

Copy link

commented Feb 5, 2019

@mayya-sharipova wow I see, great work!

mayya-sharipova added a commit that referenced this issue Feb 20, 2019

Distance measures for dense and sparse vectors (#37947)
* Distance measures for dense and sparse vectors

Introduce painless functions of
cosineSimilarity and dotProduct distance
measures for dense and sparse vector fields.

```js
{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilarity(params.queryVector, doc['my_dense_vector'].value)",
        "params": {
          "queryVector": [4, 3.4, -1.2]
        }
      }
    }
  }
}
```

```js
{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilaritySparse(params.queryVector, doc['my_sparse_vector'].value)",
        "params": {
          "queryVector": {"2": -0.5, "10" : 111.3, "50": -13.0, "113": 14.8, "4545": -156.0}
        }
      }
    }
  }
}
```

Closes #31615

weizijun added a commit to weizijun/elasticsearch that referenced this issue Feb 20, 2019

Distance measures for dense and sparse vectors (elastic#37947)
* Distance measures for dense and sparse vectors

Introduce painless functions of
cosineSimilarity and dotProduct distance
measures for dense and sparse vector fields.

```js
{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilarity(params.queryVector, doc['my_dense_vector'].value)",
        "params": {
          "queryVector": [4, 3.4, -1.2]
        }
      }
    }
  }
}
```

```js
{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilaritySparse(params.queryVector, doc['my_sparse_vector'].value)",
        "params": {
          "queryVector": {"2": -0.5, "10" : 111.3, "50": -13.0, "113": 14.8, "4545": -156.0}
        }
      }
    }
  }
}
```

Closes elastic#31615

mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this issue Feb 22, 2019

Distance measures for dense and sparse vectors (elastic#37947)
* Distance measures for dense and sparse vectors

Introduce painless functions of
cosineSimilarity and dotProduct distance
measures for dense and sparse vector fields.

```js
{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilarity(params.queryVector, doc['my_dense_vector'].value)",
        "params": {
          "queryVector": [4, 3.4, -1.2]
        }
      }
    }
  }
}
```

```js
{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilaritySparse(params.queryVector, doc['my_sparse_vector'].value)",
        "params": {
          "queryVector": {"2": -0.5, "10" : 111.3, "50": -13.0, "113": 14.8, "4545": -156.0}
        }
      }
    }
  }
}
```

Closes elastic#31615
@psyapathy

This comment has been minimized.

Copy link

commented Mar 6, 2019

@mayya-sharipova
Hello Mayya, thank you for your work!

I need help. I just installed new Elastic,create index and try mapping by your example:

{
 "properties": {
   "my_vector": {
     "type": "dense_vector"
    },
    "my_text" : {
      "type" : "keyword"
    }
  }
}

and i get error:

{
    "error": {
        "root_cause": [
            {
                "type": "mapper_parsing_exception",
                "reason": "No handler for type [dense_vector] declared on field [my_vector]"
            }
        ],
        "type": "mapper_parsing_exception",
        "reason": "No handler for type [dense_vector] declared on field [my_vector]"
    },
    "status": 400
}

Thank you advance for reply!

@mayya-sharipova

This comment has been minimized.

Copy link
Contributor Author

commented Mar 7, 2019

@psyapathy What version of elasticsearch have you installed?

The indexing of vectors are available from v7.0.0-beta1, but querying of them will be avaialable only from v7.1.

@psyapathy

This comment has been minimized.

Copy link

commented Mar 11, 2019

@mayya-sharipova Thank you for reply!
it's happy and sad at the same time.
is there an alternative still under development?

@ailurus1991

This comment has been minimized.

Copy link

commented May 21, 2019

@mayya-sharipova hi mayya, I've installed ES7.1 and indexed documents with dense vector mapping successfully, but I didn't find a right way to query in documentation. Could you give me a hint?

@mayya-sharipova

This comment has been minimized.

Copy link
Contributor Author

commented May 22, 2019

@ailurus1991 Sorry, this is a deficiency of our documentation. The scoring is available only from 7.2
From 7.2 two functions as a part of script_score will be available cosineSimilarity and dotProduct

@prem6667

This comment has been minimized.

Copy link

commented Jun 26, 2019

@mayya-sharipova I just set up the version 7.2, but both the functions are not there. I can see that branch 7.x has these functions. Is there a way I can manually add these functions?

@mayya-sharipova

This comment has been minimized.

Copy link
Contributor Author

commented Jun 27, 2019

@prem6667 Sorry, we have decided to move these functions starting from 7.3.
Adding these functions manually involves non-trivial amount of work as besides painless functions, we need to add classes for supporting Doc and script values.
Also, please be aware, that these features are still experimental, and may change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.