Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

token position problems of word_delimiter token filter #7391

Closed
hxuanji opened this issue Aug 22, 2014 · 7 comments
Closed

token position problems of word_delimiter token filter #7391

hxuanji opened this issue Aug 22, 2014 · 7 comments

Comments

@hxuanji
Copy link

hxuanji commented Aug 22, 2014

Hi, all

I got the following index setting

{
    "settings": {
        "index": {
            "number_of_shards": 5,
            "number_of_replicas": 0,
            "analysis": {   
                "analyzer": {
                    "fielda_index": {
                         "type": "custom",
                         "tokenizer": "icu_tokenizer",
                         "filter": [ "words_delimiter", "icu_normalizer", "icu_folding"]
                    },
                    "fielda_search": {
                         "type": "custom",
                         "tokenizer": "icu_tokenizer",
                         "filter": ["dot_delimiter", "icu_normalizer", "icu_folding"]
                    }
                },
                "filter": {
                    "dot_delimiter":
                    {
                        "type": "word_delimiter",
                        "generate_word_parts": true,
                        "generate_number_parts": true,
                        "split_on_case_change": false,
                        "preserve_original": true,
                        "split_on_numerics": true
                    },
                    "words_delimiter":
                    {
                        "type": "word_delimiter",
                        "generate_word_parts": true,
                        "generate_number_parts": true,
                        "split_on_case_change": true,
                        "preserve_original": true,
                        "split_on_numerics": true
                    }

                }
            }
        }
    },
    "mappings": {
        "main": {
            "_source": {"enabled": true},
            "dynamic_date_formats": ["basic_date_time_no_millis"],
            "properties": {
                "name": { "type": "string", "index": "analyzed", "index_analyzer": "fielda_index", "search_analyzer": "fielda_search", "include_in_all": true}
            }
        }
    }
}

And I use the word "PowerShot" to run the two analyzers, here is the result:

fielda_index:   PowerShot(1) Power(1) Shot(2)
fielda_search:  PowerShot(1)

The number inside the paren is the token position.
My question is why the token position of "Shot" is 2. I think the positions of the tokens that are generated by the word_delimiter token filter should be all the same. Ideas?

Because of this, I encounter an problem when performing match_phrase query.
We know the match_phrase query not only match the token but also check the token positions.

So when I insert a document,

{"name": "Canon PowerShot D500"}

I cannot using the query

{"from": 0, "size": 100, "query":{"match_phrase": {"name":"Canon PowerShot D500"}}}

to find the document I just inserted, because the token position is not matched.

The tokens result of the two analyzers are:

fielda_index    Canon(1) PowerShot(2) Power(2) Shot(3) D500(4) D(4) 500(5)
fielda_search   Canon(1) PowerShot(2) D500(3) D(3) 500(4)

Obviously, the position 3 of fielda_search is "D500", but the "D500" token of fielda_index locates at position 4. So it cannot be found the desired document.

The reproducible gist script is https://gist.github.com/hxuanji/b94d9c3514d7b08005d2

So are there any reason why the token position of the tokens that generated by word_delimter filter behave like these?
Since the extra tokens generated from word_delimiter are just "extended" cases of the original token, I think the position should remains to the original one. Do I misunderstand something or any other reasons?

Best,
Ivan

@hxuanji hxuanji changed the title token position problems of word_delimiter token. token position problems of word_delimiter token filter Aug 22, 2014
@clintongormley
Copy link

Hi @hxuanji

You are, unfortunately, correct. The WDF does generate new positions, which breaks the token filter contract. This is how it is in Lucene and currently there are no plans to change this in Lucene.

You can't use phrase queries with WDF.

You may be able to achieve what you want with the pattern capture instead.

@hxuanji
Copy link
Author

hxuanji commented Aug 25, 2014

Hi @clintongormley,

I have another question about it. Assume I modify the setting of the filters into:

"dot_delimiter":
                    {
                        "type" : "pattern_capture",
                        "preserve_original" : 1,
                        "patterns" : [
                          "([\\p{Ll}\\p{Lu}]+\\d*|\\d+)"
                       ]
                    },
                    "words_delimiter":
                    {
                        "type" : "pattern_capture",
                        "preserve_original" : 1,
                        "patterns" : [
                          "(\\p{Ll}+|\\p{Lt}+|\\p{Lu}+\\p{Ll}+|\\p{Lu}+)",
                          "(\\d+)"
                       ]
                    }

Now, the token position should be the same.
Now if I got the document:

{"name": "942430__n.jpg"}

Its token result of the two analyzers would be

fielda_index    942430__n.jpg(1) 942430(1) n(1) jpg(1)
fielda_search   942430__n.jpg(1) 942430(1) n(1) jpg(1)

As we see, the token positions are all located at pos 1.
But under this situation, I use the command:

{"from": 0, "size": 100, "query":{"match": {"name":{"query":"942430__n.jpg", "operator" : "and"}}}}

to query, but why does the result are included some documents whose tokens include only "n", such as {"name":"n" } ?

The reproducible gist: https://gist.github.com/hxuanji/8e58c0ffb391ced49439

Although I make sure the "and" operator, it seems only make sure the condition between "positions" not the "tokens". Does it make sense?

It seems I have some misunderstanding about the "matching rule" of "and" function.

Thanks a lot.

@clintongormley
Copy link

Hi @hxuanji

A trick for figuring out exactly what the query is doing is to use the validate-query API with the explain option:

curl -XPOST "http://localhost:9200/test/main/_validate/query?explain" -d'
{
  "query": {
    "match": {
      "name": {
        "query": "942430__n.jpg",
        "operator": "and"
      }
    }
  }
}'

This outputs:

     "explanation": "filtered(name:942430__n.jpg name:942430 name:n name:jpg)->cache(_type:main)"

So any of the terms in the same position are allowed. The and operator doesn't affect "stacked" terms. The reason for this is that these terms are like synonyms. You require one of the synonyms to be in position 0, but not all of them.

@hxuanji
Copy link
Author

hxuanji commented Aug 25, 2014

Hi, @clintongormley
I got it! Thanks for your help.

Ivan

@dklotz
Copy link

dklotz commented Jan 29, 2015

@clintongormley I think this problem with the positions of the word_delimiter filter should be mentioned on the respective reference / guide pages... Just ran into the same thing.

@mikemccand
Copy link
Contributor

I am trying to fix this issue in Lucene: https://issues.apache.org/jira/browse/LUCENE-7619

It would mean you need to include c:WordDelimiterGraphFilter (once it's released) in your search-time analyzer.

@vsiv
Copy link

vsiv commented May 15, 2017

WordDelimiterGraphFilter is now released and available in v5.4. FYI to those who stumble upon this thread. thanks @mikemccand for this!!

V

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants