Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ngram/Edgengram filters don't work with keyword repeat filters #22478

Closed
gibrown opened this issue Jan 6, 2017 · 15 comments
Closed

Ngram/Edgengram filters don't work with keyword repeat filters #22478

gibrown opened this issue Jan 6, 2017 · 15 comments
Labels
>enhancement :Search Relevance/Analysis How text is split into tokens

Comments

@gibrown
Copy link
Contributor

gibrown commented Jan 6, 2017

Elasticsearch version: 2.3.3

Plugins installed: [analysis-icu analysis-smartcn delete-by-query lang-javascript whatson analysis-kuromoji analysis-stempel elasticsearch-inquisitor head langdetect statsd/]

Description of the problem including expected versus actual behavior:

I want to index edgengrams from 3 to 15 chars, but also keep the original token in the field as well. This is being used for search as you type functionality. For both speed and relevancy reasons we've settled on 3 being the min num of chars that makes sense, but it leaves some gaps for non-whitespace separated languages and for words like 'pi'.

I thought I could do this using keyword_repeat and unique filters in my analyzer, but that doesn't seem to work with edgengram filters. Maybe I'm doing it wrong, but I haven't come up with a workaround yet.

Steps to reproduce:

PUT test_analyzer
{
  "settings": {
    "analysis": {
      "analyzer": {
        "edgengram_analyzer": {
          "filter": [
            "icu_normalizer",
            "icu_folding",
            "keyword_repeat",
            "edgengram_filter",
            "unique_filter"
          ],
          "type": "custom",
          "tokenizer": "icu_tokenizer"
        },
        "default": {
          "filter": [
            "icu_normalizer",
            "icu_folding"
          ],
          "type": "custom",
          "tokenizer": "icu_tokenizer"
        }
      },
      "filter": {
        "unique_filter": {
          "type": "unique",
          "only_on_same_position": "true"
        },
        "edgengram_filter": {
          "type": "edgeNGram",
          "min_gram": "3",
          "max_gram": "15"
        }
      }
    }
  },
  "mappings": {
    "my_type": {
      "properties": {
        "my_text": {
          "type": "string",
          "similarity": "BM25",
          "analyzer": "default",
          "fields": {
            "ngram": {
              "type": "string",
              "term_vector": "with_positions_offsets",
              "similarity": "BM25",
              "analyzer": "edgengram_analyzer",
              "search_analyzer": "default"
            },
            "word_count": {
              "type": "token_count",
              "analyzer": "default"
            }
          }
        }
      }
    }
  }
}

GET test_analyzer/_analyze 
{
  "analyzer": "edgengram_analyzer", 
  "text":     "Is this déjà vu?"
}

Output:

{
  "tokens": [
    {
      "token": "thi",
      "start_offset": 3,
      "end_offset": 7,
      "type": "word",
      "position": 1
    },
    {
      "token": "this",
      "start_offset": 3,
      "end_offset": 7,
      "type": "word",
      "position": 1
    },
    {
      "token": "dej",
      "start_offset": 8,
      "end_offset": 12,
      "type": "word",
      "position": 2
    },
    {
      "token": "deja",
      "start_offset": 8,
      "end_offset": 12,
      "type": "word",
      "position": 2
    }
  ]
}

I'd expect to get the tokens: is, thi, this, dej, deja, vu

The problem gets worse when looking at non-whitespace languages where many characters are tokenized into one character per token.

I could search across multiple fields, but that prevents me from matching on phrases and using those phrase matches to boost results. For instance if the user types in "hi ther" we should be able to match instances where the content had "hi there" and use that to boost those exact matches. We do this by adding a simple should clause:

            "bool": {
              "must": [
                {
                  "multi_match": {
                    "fields": [
                      "mlt_content.default.ngram"
                    ],
                    "query": "hi ther",
                    "operator": "and",
                    "type": "cross_fields"
                  }
                }
              ],
              "should": [
                {
                  "multi_match": {
                    "type": "phrase",
                    "fields": [
                      "mlt_content.default.ngram"
                    ],
                    "query": "hi ther"
                  }
                }
              ]
            }
          },
@dakrone
Copy link
Member

dakrone commented Jan 6, 2017

I'd expect to get the tokens: is, thi, this, dej, deja, vu

Since you use the icu_tokenizer your text is being split into four tokens:

{
  "tokens" : [ {
    "token" : "Is",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "<ALPHANUM>",
    "position" : 0
  }, {
    "token" : "this",
    "start_offset" : 3,
    "end_offset" : 7,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "deja",
    "start_offset" : 8,
    "end_offset" : 12,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "vu",
    "start_offset" : 13,
    "end_offset" : 15,
    "type" : "<ALPHANUM>",
    "position" : 3
  } ]
}

And then if you do the folding and keyword_repeat:

{
  "tokens" : [ {
    "token" : "is",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "<ALPHANUM>",
    "position" : 0
  }, {
    "token" : "is",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "<ALPHANUM>",
    "position" : 0
  }, {
    "token" : "this",
    "start_offset" : 3,
    "end_offset" : 7,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "this",
    "start_offset" : 3,
    "end_offset" : 7,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "deja",
    "start_offset" : 8,
    "end_offset" : 12,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "deja",
    "start_offset" : 8,
    "end_offset" : 12,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "vu",
    "start_offset" : 13,
    "end_offset" : 15,
    "type" : "<ALPHANUM>",
    "position" : 3
  }, {
    "token" : "vu",
    "start_offset" : 13,
    "end_offset" : 15,
    "type" : "<ALPHANUM>",
    "position" : 3
  } ]
}

If you then try to do edge ngrams for the is and vu terms, they are below
the min_gram threshold of 3 in your configuration, so they are dropped.

If you want to keep the whitespace, perhaps inject a shingle token filter (with
two shingles) in there so that is this becomes a token including the
whitespace, which you can then analyze with edgengrams to get is , is t, is th, is this.

@clintongormley clintongormley added :Search Relevance/Analysis How text is split into tokens discuss labels Jan 9, 2017
@clintongormley
Copy link

I've run into this same issue before. keyword_repeat only works for stemmers, but I wonder if this functionality should be extended to edge-ngrams.

@mikemccand what do you think?

@gibrown
Copy link
Contributor Author

gibrown commented Jan 9, 2017

FWIW, my current work around is to do always use the lang specific analysis field when I think I'm searching in a non-whitespace separated language (but I don't really trust my lang detection), and I use the lang specific field anytime the text is less than 3 chars, or if the trailing word is less than three chars (eg a search like "math pi" ).

Shingle tokens as a work around will still have the same problem of not letting me have sub 3 char tokens i think. I also suspect it would blow up the index size even more than including 1 and 2 char edgengrams would.

BTW, if we change this, can it be easily back ported to 2.x ;)

@gibrown
Copy link
Contributor Author

gibrown commented Apr 20, 2017

Heh, and now we've found a case where my workarounds don't work: "Game of Thrones".

Any updates here? I guess I could just adjust to edgengrams starting from 1 char just seems likely to cause lots of inefficiencies.

Shingle tokens sounds interesting (and maybe improves relevancy) but will also significantly increase index size.

@gibrown
Copy link
Contributor Author

gibrown commented Apr 20, 2017

Another idea (for anyone following along). I could have one edgegrams field per language and then specify a language analyzer that has stop words for that language. Would fix the worst cases, but still not fix something like "pi".

@mayya-sharipova
Copy link
Contributor

@gibrown Can you please confirm what tokens do you expect when you index ""Is this déjà vu?"
Are you expecting ngrams (3-15) as well?

"Is t"
"Is th"
...
Can you index using edge ngram tokenizer?
And if you need original tokens as well, can you use another field for this?

keyword_repeat is specifically designed to be followed by some stem filter. It is not relevant for edge ngrams.

@mayya-sharipova
Copy link
Contributor

cc @elastic/es-search-aggs

@gibrown
Copy link
Contributor Author

gibrown commented Mar 20, 2018

For edgengrams on "Is this déjà vu?" I would only expect the following tokens:
"is","thi","this","dej","deja","vu"

"is t" and "is th" would not be in the index.

Can you index using edge ngram tokenizer?

No we are using the icu_tokenizer. We are doing indexing across all languages. Technically we should even be using special tokenization for Japanese, Korean, and Chinese so we can get the tokenization correct there.

Thanks for taking a look.

Our workaround that we have deployed is to search both the edgengram field and an icu tokenized field that doesn't have any ngrams. We do this with a multi_match query that uses the cross_fields and AND as the operator. Makes for a more expensive query but it kinda works.

@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Mar 20, 2018

@gibrown If you found the workaround, would you mind if I close this issue?

@gibrown
Copy link
Contributor Author

gibrown commented Mar 20, 2018

I still think that some way to index edgengrams from X-Y plus also the original token would be a very worthwhile improvement. I would use it if it were available. I still think the keyword_repeat is the closest approximation. My workaround breaks if i am trying to do a phrase match. For instance: "is this dej"

Technically what I would love is a clearer language that lets me have multiple flows for extracting tokens:

  • extract the original token
  • extract a stemmed version of the token
  • extract the edgengrams for a token.

This lets me do an AND match on multiple tokens as well as a phrase match. Having them be in multiple fields has a number of drawbacks.

@romseygeek
Copy link
Contributor

I've been doing some work on making branches possible in TokenStreams (see https://issues.apache.org/jira/browse/LUCENE-8273). If that were combined with a generalisation of KeywordRepeatFilter, we could build an analysis chain that looked something like:

KeywordRepeatFilter(none, stem, ngram) -> repeats each token three times with a different keyword set
if (keyword == stem) then apply Stemmer
if (keyword == ngram) then apply EdgeNGramFilter

@gibrown
Copy link
Contributor Author

gibrown commented Jun 5, 2018

@romseygeek I love the idea of being able to have multiple paths of processing tokens. This would help in a lot of cases I've seen I think.

It feels like the analysis syntax would need a bit more structure than it currently has to handle this sort of thing.

@nomoa
Copy link
Contributor

nomoa commented Jun 6, 2018

We had exactly the same issue, problem is that not all filters support the keyword attribute. We ended up adding a new Token filter in a plugin we maintain to work around this limitation.
It would be great to to have such support upstream (either by making all filters aware the keyword attribute or by providing another way to really emit the original token).

@romseygeek
Copy link
Contributor

Added in #31208

@gibrown
Copy link
Contributor Author

gibrown commented Jun 25, 2018

Very excited that this is in 6.4. Thanks @romseygeek nice work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Relevance/Analysis How text is split into tokens
Projects
None yet
Development

No branches or pull requests

6 participants