Ngram/Edgengram filters don't work with keyword repeat filters #22478

gibrown · 2017-01-06T20:28:19Z

Elasticsearch version: 2.3.3

Plugins installed: [analysis-icu analysis-smartcn delete-by-query lang-javascript whatson analysis-kuromoji analysis-stempel elasticsearch-inquisitor head langdetect statsd/]

Description of the problem including expected versus actual behavior:

I want to index edgengrams from 3 to 15 chars, but also keep the original token in the field as well. This is being used for search as you type functionality. For both speed and relevancy reasons we've settled on 3 being the min num of chars that makes sense, but it leaves some gaps for non-whitespace separated languages and for words like 'pi'.

I thought I could do this using keyword_repeat and unique filters in my analyzer, but that doesn't seem to work with edgengram filters. Maybe I'm doing it wrong, but I haven't come up with a workaround yet.

Steps to reproduce:

PUT test_analyzer
{
  "settings": {
    "analysis": {
      "analyzer": {
        "edgengram_analyzer": {
          "filter": [
            "icu_normalizer",
            "icu_folding",
            "keyword_repeat",
            "edgengram_filter",
            "unique_filter"
          ],
          "type": "custom",
          "tokenizer": "icu_tokenizer"
        },
        "default": {
          "filter": [
            "icu_normalizer",
            "icu_folding"
          ],
          "type": "custom",
          "tokenizer": "icu_tokenizer"
        }
      },
      "filter": {
        "unique_filter": {
          "type": "unique",
          "only_on_same_position": "true"
        },
        "edgengram_filter": {
          "type": "edgeNGram",
          "min_gram": "3",
          "max_gram": "15"
        }
      }
    }
  },
  "mappings": {
    "my_type": {
      "properties": {
        "my_text": {
          "type": "string",
          "similarity": "BM25",
          "analyzer": "default",
          "fields": {
            "ngram": {
              "type": "string",
              "term_vector": "with_positions_offsets",
              "similarity": "BM25",
              "analyzer": "edgengram_analyzer",
              "search_analyzer": "default"
            },
            "word_count": {
              "type": "token_count",
              "analyzer": "default"
            }
          }
        }
      }
    }
  }
}

GET test_analyzer/_analyze 
{
  "analyzer": "edgengram_analyzer", 
  "text":     "Is this déjà vu?"
}

Output:

{
  "tokens": [
    {
      "token": "thi",
      "start_offset": 3,
      "end_offset": 7,
      "type": "word",
      "position": 1
    },
    {
      "token": "this",
      "start_offset": 3,
      "end_offset": 7,
      "type": "word",
      "position": 1
    },
    {
      "token": "dej",
      "start_offset": 8,
      "end_offset": 12,
      "type": "word",
      "position": 2
    },
    {
      "token": "deja",
      "start_offset": 8,
      "end_offset": 12,
      "type": "word",
      "position": 2
    }
  ]
}

I'd expect to get the tokens: is, thi, this, dej, deja, vu

The problem gets worse when looking at non-whitespace languages where many characters are tokenized into one character per token.

I could search across multiple fields, but that prevents me from matching on phrases and using those phrase matches to boost results. For instance if the user types in "hi ther" we should be able to match instances where the content had "hi there" and use that to boost those exact matches. We do this by adding a simple should clause:

            "bool": {
              "must": [
                {
                  "multi_match": {
                    "fields": [
                      "mlt_content.default.ngram"
                    ],
                    "query": "hi ther",
                    "operator": "and",
                    "type": "cross_fields"
                  }
                }
              ],
              "should": [
                {
                  "multi_match": {
                    "type": "phrase",
                    "fields": [
                      "mlt_content.default.ngram"
                    ],
                    "query": "hi ther"
                  }
                }
              ]
            }
          },

The text was updated successfully, but these errors were encountered:

dakrone · 2017-01-06T22:47:15Z

I'd expect to get the tokens: is, thi, this, dej, deja, vu

Since you use the icu_tokenizer your text is being split into four tokens:

{
  "tokens" : [ {
    "token" : "Is",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "<ALPHANUM>",
    "position" : 0
  }, {
    "token" : "this",
    "start_offset" : 3,
    "end_offset" : 7,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "deja",
    "start_offset" : 8,
    "end_offset" : 12,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "vu",
    "start_offset" : 13,
    "end_offset" : 15,
    "type" : "<ALPHANUM>",
    "position" : 3
  } ]
}

And then if you do the folding and keyword_repeat:

{
  "tokens" : [ {
    "token" : "is",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "<ALPHANUM>",
    "position" : 0
  }, {
    "token" : "is",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "<ALPHANUM>",
    "position" : 0
  }, {
    "token" : "this",
    "start_offset" : 3,
    "end_offset" : 7,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "this",
    "start_offset" : 3,
    "end_offset" : 7,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "deja",
    "start_offset" : 8,
    "end_offset" : 12,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "deja",
    "start_offset" : 8,
    "end_offset" : 12,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "vu",
    "start_offset" : 13,
    "end_offset" : 15,
    "type" : "<ALPHANUM>",
    "position" : 3
  }, {
    "token" : "vu",
    "start_offset" : 13,
    "end_offset" : 15,
    "type" : "<ALPHANUM>",
    "position" : 3
  } ]
}

If you then try to do edge ngrams for the is and vu terms, they are below
the min_gram threshold of 3 in your configuration, so they are dropped.

If you want to keep the whitespace, perhaps inject a shingle token filter (with
two shingles) in there so that is this becomes a token including the
whitespace, which you can then analyze with edgengrams to get is , is t, is th, is this.

clintongormley · 2017-01-09T17:16:32Z

I've run into this same issue before. keyword_repeat only works for stemmers, but I wonder if this functionality should be extended to edge-ngrams.

@mikemccand what do you think?

gibrown · 2017-01-09T17:46:18Z

FWIW, my current work around is to do always use the lang specific analysis field when I think I'm searching in a non-whitespace separated language (but I don't really trust my lang detection), and I use the lang specific field anytime the text is less than 3 chars, or if the trailing word is less than three chars (eg a search like "math pi" ).

Shingle tokens as a work around will still have the same problem of not letting me have sub 3 char tokens i think. I also suspect it would blow up the index size even more than including 1 and 2 char edgengrams would.

BTW, if we change this, can it be easily back ported to 2.x ;)

gibrown · 2017-04-20T16:39:13Z

Heh, and now we've found a case where my workarounds don't work: "Game of Thrones".

Any updates here? I guess I could just adjust to edgengrams starting from 1 char just seems likely to cause lots of inefficiencies.

Shingle tokens sounds interesting (and maybe improves relevancy) but will also significantly increase index size.

gibrown · 2017-04-20T16:59:08Z

Another idea (for anyone following along). I could have one edgegrams field per language and then specify a language analyzer that has stop words for that language. Would fix the worst cases, but still not fix something like "pi".

mayya-sharipova · 2018-03-20T19:45:22Z

@gibrown Can you please confirm what tokens do you expect when you index ""Is this déjà vu?"
Are you expecting ngrams (3-15) as well?

"Is t"
"Is th"
...
Can you index using edge ngram tokenizer?
And if you need original tokens as well, can you use another field for this?

keyword_repeat is specifically designed to be followed by some stem filter. It is not relevant for edge ngrams.

mayya-sharipova · 2018-03-20T19:45:43Z

cc @elastic/es-search-aggs

gibrown · 2018-03-20T20:28:02Z

For edgengrams on "Is this déjà vu?" I would only expect the following tokens:
"is","thi","this","dej","deja","vu"

"is t" and "is th" would not be in the index.

Can you index using edge ngram tokenizer?

No we are using the icu_tokenizer. We are doing indexing across all languages. Technically we should even be using special tokenization for Japanese, Korean, and Chinese so we can get the tokenization correct there.

Thanks for taking a look.

Our workaround that we have deployed is to search both the edgengram field and an icu tokenized field that doesn't have any ngrams. We do this with a multi_match query that uses the cross_fields and AND as the operator. Makes for a more expensive query but it kinda works.

mayya-sharipova · 2018-03-20T20:30:39Z

@gibrown If you found the workaround, would you mind if I close this issue?

gibrown · 2018-03-20T20:54:23Z

I still think that some way to index edgengrams from X-Y plus also the original token would be a very worthwhile improvement. I would use it if it were available. I still think the keyword_repeat is the closest approximation. My workaround breaks if i am trying to do a phrase match. For instance: "is this dej"

Technically what I would love is a clearer language that lets me have multiple flows for extracting tokens:

extract the original token
extract a stemmed version of the token
extract the edgengrams for a token.

This lets me do an AND match on multiple tokens as well as a phrase match. Having them be in multiple fields has a number of drawbacks.

romseygeek · 2018-05-21T15:18:18Z

I've been doing some work on making branches possible in TokenStreams (see https://issues.apache.org/jira/browse/LUCENE-8273). If that were combined with a generalisation of KeywordRepeatFilter, we could build an analysis chain that looked something like:

KeywordRepeatFilter(none, stem, ngram) -> repeats each token three times with a different keyword set
if (keyword == stem) then apply Stemmer
if (keyword == ngram) then apply EdgeNGramFilter

gibrown · 2018-06-05T18:58:39Z

@romseygeek I love the idea of being able to have multiple paths of processing tokens. This would help in a lot of cases I've seen I think.

It feels like the analysis syntax would need a bit more structure than it currently has to handle this sort of thing.

nomoa · 2018-06-06T12:45:40Z

We had exactly the same issue, problem is that not all filters support the keyword attribute. We ended up adding a new Token filter in a plugin we maintain to work around this limitation.
It would be great to to have such support upstream (either by making all filters aware the keyword attribute or by providing another way to really emit the original token).

romseygeek · 2018-06-20T09:32:15Z

Added in #31208

gibrown · 2018-06-25T22:01:45Z

Very excited that this is in 6.4. Thanks @romseygeek nice work.

clintongormley added :Search Relevance/Analysis How text is split into tokens discuss labels Jan 9, 2017

clintongormley removed the discuss label Jan 13, 2017

bluefuton mentioned this issue Apr 20, 2017

Reader Search: anything with "of" returns no results Automattic/wp-calypso#13284

Closed

clintongormley added discuss >enhancement labels Apr 25, 2017

mayya-sharipova removed the discuss label May 22, 2018

romseygeek mentioned this issue Jun 8, 2018

Multiplexing token filter #31208

Merged

romseygeek closed this as completed Jun 20, 2018

gibrown mentioned this issue Sep 14, 2018

Search: improved character/term matching esp for non-English Automattic/jetpack#10146

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ngram/Edgengram filters don't work with keyword repeat filters #22478

Ngram/Edgengram filters don't work with keyword repeat filters #22478

gibrown commented Jan 6, 2017

dakrone commented Jan 6, 2017

clintongormley commented Jan 9, 2017

gibrown commented Jan 9, 2017

gibrown commented Apr 20, 2017

gibrown commented Apr 20, 2017

mayya-sharipova commented Mar 20, 2018

mayya-sharipova commented Mar 20, 2018

gibrown commented Mar 20, 2018

mayya-sharipova commented Mar 20, 2018 •

edited

Loading

gibrown commented Mar 20, 2018

romseygeek commented May 21, 2018

gibrown commented Jun 5, 2018

nomoa commented Jun 6, 2018

romseygeek commented Jun 20, 2018

gibrown commented Jun 25, 2018

Ngram/Edgengram filters don't work with keyword repeat filters #22478

Ngram/Edgengram filters don't work with keyword repeat filters #22478

Comments

gibrown commented Jan 6, 2017

dakrone commented Jan 6, 2017

clintongormley commented Jan 9, 2017

gibrown commented Jan 9, 2017

gibrown commented Apr 20, 2017

gibrown commented Apr 20, 2017

mayya-sharipova commented Mar 20, 2018

mayya-sharipova commented Mar 20, 2018

gibrown commented Mar 20, 2018

mayya-sharipova commented Mar 20, 2018 • edited Loading

gibrown commented Mar 20, 2018

romseygeek commented May 21, 2018

gibrown commented Jun 5, 2018

nomoa commented Jun 6, 2018

romseygeek commented Jun 20, 2018

gibrown commented Jun 25, 2018

mayya-sharipova commented Mar 20, 2018 •

edited

Loading