Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert LSH hashes from using individual keyword fields to a single text field #42

Closed
alexklibisz opened this issue Feb 17, 2020 · 0 comments · Fixed by #43
Closed

Convert LSH hashes from using individual keyword fields to a single text field #42

alexklibisz opened this issue Feb 17, 2020 · 0 comments · Fixed by #43

Comments

@alexklibisz
Copy link
Owner

alexklibisz commented Feb 17, 2020

Currently lsh hashes are stored as individual keyword fields, i.e.:

{
  "vec_proc": {
    "1,1": "123",
    "1,2": "345",
    ...
}

Where "1,1" for the min-hash algorithm corresponds to "table 1, band 1". The reason I stored them this way is to enable boolean matching queries against the individual fields.

It turns out that with sufficiently many fields, elasticsearch starts complaining that you're exceeding the limit of total fields, with exceptions like this:

elasticsearch.helpers.errors.BulkIndexError: ('1 document(s) failed to index.', [{'index': {'_index': 'elastiknn-auto-similarity_jaccard-lsh-27983-1581962133', '_type': '_doc', '_id': 'RWBKVHABgla2WqqUhhQK', 'status': 400, 'error': {'type': 'illegal_argument_exception', 'reason': 'Limit of total fields [1000] in index [elastiknn-auto-similarity_jaccard-lsh-27983-1581962133] has been exceeded'}, 'data': {'dataset_index': 0, 'vec_raw': {'sparseBoolVector': {'trueIndices': [0, 2, 3, 5, 9, 18, 20, 22, 24, 25, 26, 27, 28, 41, 43, 44, 47, 50, 54, ... 15311, 15312], 'totalIndices': 27983}}}}}])

After reading the docs a bit more, it turns out I should be able to get the same query semantics using a text field with a boolean similarity:

DELETE testing

PUT testing

PUT testing/_mapping
{
  "properties": {
    "hashes": {
      "type": "text",
      "similarity": "boolean"
    }
  }
}

POST testing/_doc
{
  "hashes": "1,1,123 1,2,456 1,3,789"
}

GET testing/_search

GET testing/_search
{
  "query": {
    "match": {
      "hashes": {
        "query": "1,1,123 1,2,456 1,3,888"
      }
    }
  }
}

GET testing/_search
{
  "query": {
    "match": {
      "hashes": {
        "query": "1,3,888 1,1,123 1,2,456"
      }
    }
  }
}

Both of the searches return a score of 2 for the stored document.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant