token position problems of word_delimiter token filter #7391

hxuanji · 2014-08-22T06:11:09Z

Hi, all

I got the following index setting

{
    "settings": {
        "index": {
            "number_of_shards": 5,
            "number_of_replicas": 0,
            "analysis": {   
                "analyzer": {
                    "fielda_index": {
                         "type": "custom",
                         "tokenizer": "icu_tokenizer",
                         "filter": [ "words_delimiter", "icu_normalizer", "icu_folding"]
                    },
                    "fielda_search": {
                         "type": "custom",
                         "tokenizer": "icu_tokenizer",
                         "filter": ["dot_delimiter", "icu_normalizer", "icu_folding"]
                    }
                },
                "filter": {
                    "dot_delimiter":
                    {
                        "type": "word_delimiter",
                        "generate_word_parts": true,
                        "generate_number_parts": true,
                        "split_on_case_change": false,
                        "preserve_original": true,
                        "split_on_numerics": true
                    },
                    "words_delimiter":
                    {
                        "type": "word_delimiter",
                        "generate_word_parts": true,
                        "generate_number_parts": true,
                        "split_on_case_change": true,
                        "preserve_original": true,
                        "split_on_numerics": true
                    }

                }
            }
        }
    },
    "mappings": {
        "main": {
            "_source": {"enabled": true},
            "dynamic_date_formats": ["basic_date_time_no_millis"],
            "properties": {
                "name": { "type": "string", "index": "analyzed", "index_analyzer": "fielda_index", "search_analyzer": "fielda_search", "include_in_all": true}
            }
        }
    }
}

And I use the word "PowerShot" to run the two analyzers, here is the result:

fielda_index:   PowerShot(1) Power(1) Shot(2)
fielda_search:  PowerShot(1)

The number inside the paren is the token position.
My question is why the token position of "Shot" is 2. I think the positions of the tokens that are generated by the word_delimiter token filter should be all the same. Ideas?

Because of this, I encounter an problem when performing match_phrase query.
We know the match_phrase query not only match the token but also check the token positions.

So when I insert a document,

{"name": "Canon PowerShot D500"}

I cannot using the query

{"from": 0, "size": 100, "query":{"match_phrase": {"name":"Canon PowerShot D500"}}}

to find the document I just inserted, because the token position is not matched.

The tokens result of the two analyzers are:

fielda_index    Canon(1) PowerShot(2) Power(2) Shot(3) D500(4) D(4) 500(5)
fielda_search   Canon(1) PowerShot(2) D500(3) D(3) 500(4)

Obviously, the position 3 of fielda_search is "D500", but the "D500" token of fielda_index locates at position 4. So it cannot be found the desired document.

The reproducible gist script is https://gist.github.com/hxuanji/b94d9c3514d7b08005d2

So are there any reason why the token position of the tokens that generated by word_delimter filter behave like these?
Since the extra tokens generated from word_delimiter are just "extended" cases of the original token, I think the position should remains to the original one. Do I misunderstand something or any other reasons?

Best,
Ivan

The text was updated successfully, but these errors were encountered:

clintongormley · 2014-08-22T14:58:59Z

Hi @hxuanji

You are, unfortunately, correct. The WDF does generate new positions, which breaks the token filter contract. This is how it is in Lucene and currently there are no plans to change this in Lucene.

You can't use phrase queries with WDF.

You may be able to achieve what you want with the pattern capture instead.

hxuanji · 2014-08-25T10:09:49Z

Hi @clintongormley,

I have another question about it. Assume I modify the setting of the filters into:

"dot_delimiter":
                    {
                        "type" : "pattern_capture",
                        "preserve_original" : 1,
                        "patterns" : [
                          "([\\p{Ll}\\p{Lu}]+\\d*|\\d+)"
                       ]
                    },
                    "words_delimiter":
                    {
                        "type" : "pattern_capture",
                        "preserve_original" : 1,
                        "patterns" : [
                          "(\\p{Ll}+|\\p{Lt}+|\\p{Lu}+\\p{Ll}+|\\p{Lu}+)",
                          "(\\d+)"
                       ]
                    }

Now, the token position should be the same.
Now if I got the document:

{"name": "942430__n.jpg"}

Its token result of the two analyzers would be

fielda_index    942430__n.jpg(1) 942430(1) n(1) jpg(1)
fielda_search   942430__n.jpg(1) 942430(1) n(1) jpg(1)

As we see, the token positions are all located at pos 1.
But under this situation, I use the command:

{"from": 0, "size": 100, "query":{"match": {"name":{"query":"942430__n.jpg", "operator" : "and"}}}}

to query, but why does the result are included some documents whose tokens include only "n", such as {"name":"n" } ?

The reproducible gist: https://gist.github.com/hxuanji/8e58c0ffb391ced49439

Although I make sure the "and" operator, it seems only make sure the condition between "positions" not the "tokens". Does it make sense?

It seems I have some misunderstanding about the "matching rule" of "and" function.

Thanks a lot.

clintongormley · 2014-08-25T10:23:59Z

Hi @hxuanji

A trick for figuring out exactly what the query is doing is to use the validate-query API with the explain option:

curl -XPOST "http://localhost:9200/test/main/_validate/query?explain" -d'
{
  "query": {
    "match": {
      "name": {
        "query": "942430__n.jpg",
        "operator": "and"
      }
    }
  }
}'

This outputs:

     "explanation": "filtered(name:942430__n.jpg name:942430 name:n name:jpg)->cache(_type:main)"

So any of the terms in the same position are allowed. The and operator doesn't affect "stacked" terms. The reason for this is that these terms are like synonyms. You require one of the synonyms to be in position 0, but not all of them.

hxuanji · 2014-08-25T11:04:19Z

Hi, @clintongormley
I got it! Thanks for your help.

Ivan

dklotz · 2015-01-29T12:11:03Z

@clintongormley I think this problem with the positions of the word_delimiter filter should be mentioned on the respective reference / guide pages... Just ran into the same thing.

mikemccand · 2017-01-06T20:20:50Z

I am trying to fix this issue in Lucene: https://issues.apache.org/jira/browse/LUCENE-7619

It would mean you need to include c:WordDelimiterGraphFilter (once it's released) in your search-time analyzer.

vsiv · 2017-05-15T06:47:07Z

WordDelimiterGraphFilter is now released and available in v5.4. FYI to those who stumble upon this thread. thanks @mikemccand for this!!

V

hxuanji changed the title ~~token position problems of word_delimiter token.~~ token position problems of word_delimiter token filter Aug 22, 2014

clintongormley closed this as completed Aug 22, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

token position problems of word_delimiter token filter #7391

token position problems of word_delimiter token filter #7391

hxuanji commented Aug 22, 2014

clintongormley commented Aug 22, 2014

hxuanji commented Aug 25, 2014

clintongormley commented Aug 25, 2014

hxuanji commented Aug 25, 2014

dklotz commented Jan 29, 2015

mikemccand commented Jan 6, 2017

vsiv commented May 15, 2017 •

edited

token position problems of word_delimiter token filter #7391

token position problems of word_delimiter token filter #7391

Comments

hxuanji commented Aug 22, 2014

clintongormley commented Aug 22, 2014

hxuanji commented Aug 25, 2014

clintongormley commented Aug 25, 2014

hxuanji commented Aug 25, 2014

dklotz commented Jan 29, 2015

mikemccand commented Jan 6, 2017

vsiv commented May 15, 2017 • edited

vsiv commented May 15, 2017 •

edited