Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wildcard search with special characters: Differences between ES 2.x and 5.x #22989

Closed
mweibel opened this issue Feb 6, 2017 · 8 comments
Closed
Labels
>docs General docs changes help wanted adoptme :Search/Search Search-related issues that do not fall into other categories

Comments

@mweibel
Copy link

mweibel commented Feb 6, 2017

I'm in the process of upgrading a project from ES 1.7 to ES 5.1 and noticed an integration test failing. After testing I made sure that the actual change happens between 2.x and 5.x.
Below is a minimal repro case.

When searching using a wildcard search for "brö" I get in both versions the same result. When searching using "brö{"/)?" which gets escaped as "brö{\\"/)", I get 2 results in ES 2.x and no results in ES 5.x.

I searched breaking changes of ES 2.x and ES 5.x but could not find anything related to that. Also searched for new query string search settings which could help in that regard but couldn't find it either.
Also I opened first a forum post with the same details: https://discuss.elastic.co/t/wildcard-search-with-special-characters-differences-between-es-1-7-and-es-5-1/73720/1

How to reproduce

ES 5 index creation

curl -X PUT 'http://localhost:9200/testwildcard' -d '{
  "mappings": {
    "testwildcard": {
      "properties": {
        "name": {
          "type": "keyword",
          "fields": {
            "search": {
              "type": "text"
            }
          }
        }
      }
    }
  }
}'

ES 2.x index creation

curl -XPUT 'http://localhost:9200/testwildcard' -d '{
  "mappings": {
    "testwildcard": {
      "properties": {
        "name": {
          "type": "string",
          "index": "not_analyzed",
          "fields": {
            "search": {
              "type": "string"
            }
          }
        }
      }
    }
  }
}'

Insert data

curl -XPUT 'http://localhost:9200/testwildcard/testwildcard/1' -d '{
 "name": "Brötchen"
}'

curl -XPUT 'http://localhost:9200/testwildcard/testwildcard/2' -d '{
 "name": "Frischbackbrötchen"
}'

curl -XPUT 'http://localhost:9200/testwildcard/testwildcard/3' -d '{
 "name": "Brot"
}'

works in both ES 2.x and ES 5.1:

curl -XPOST 'http://localhost:9200/testwildcard/_search' -d '{
  "explain": true,
  "query": {
    "query_string": {
      "query": "name.search:*brö*",
      "analyze_wildcard": true,
      "default_operator": "AND"
    }
  }
}'

doesn't work in ES 5.1 but works in ES 2.x:

curl -XPOST 'http://localhost:9200/testwildcard/_search' -d '{
  "explain": true,
  "query": {
    "query_string": {
      "query": "name.search:*brö\\{\\\\\\\"\\/\\)*",
      "analyze_wildcard": true,
      "default_operator": "AND"
    }
  }
}'
@jpountz
Copy link
Contributor

jpountz commented Feb 8, 2017

Good catch, the behaviour did indeed change as we are now only applying the character-level filters of the analysis chain (like ascii folding or lowercasing), which means we ignore the tokenizer and some filters like synonyms or stemming filters.

For most users this should be transparent and might even fix queries that did not work before. However in that case, this causes the punctuation characters to remain in *brö\\{\\\\\\\"\\/\\)*, which means it cannot match anything since your analyzer discards punctuation.

I am unclear how name.search:*brö\\{\\\\\\\"\\/\\)* should be parsed so I am tempted to just document this as a possible breaking change. I'll leave this issue open with the discuss label to see what others think.

@mweibel
Copy link
Author

mweibel commented Feb 8, 2017

@jpountz thanks for your reply and confirmation :) I'd also vote for documenting this just as a breaking change. Am I correct in the assumption the new but experimental normalizer feature could be used as an alternative? (besides ensuring only valid characters get passed to the search string)

@jpountz
Copy link
Contributor

jpountz commented Feb 8, 2017

Actually the normalizer feature works on top of this normalization which has originally been designed to make query parsing "do the right thing" when it comes to analyzing partial terms for prefix, wildcard or fuzzy queries.

@jimczi
Copy link
Contributor

jimczi commented Feb 9, 2017

I am unclear how name.search:brö\{\\\"\/\) should be parsed so I am tempted to just document this as a possible breaking change. I'll leave this issue open with the discuss label to see what others think.

I think the situation is not ideal because the punctuation is removed by the tokenizer even though it could be considered as a normalization phase. The alternative could be to add a char_filter that removes the punctuation but it could break the tokenization...
Just an idea but now that we have a normalizer type we could maybe allow to define a normalizer AND an analyzer for text fields only ?
It makes sense to restrict analyze_wildcard to single leading and trailing wildcard only. Today it is "correctly" applied on prefix query but not on suffix query. IMO for complex wildcard query we should document that this setting is not honored.

@clintongormley clintongormley added >docs General docs changes and removed discuss labels Feb 17, 2017
@jpountz
Copy link
Contributor

jpountz commented Feb 17, 2017

We agreed in FixitFriday on documenting this behaviour and potentially adapt based on feedback when we get some.

@clintongormley clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Query DSL labels Feb 14, 2018
@mayya-sharipova
Copy link
Contributor

cc @elastic/es-search-aggs

Still need to document better the details of working of analyze_wildcard

@cbuescher cbuescher added the help wanted adoptme label Nov 12, 2018
@xtrembaker
Copy link

xtrembaker commented Nov 14, 2018

Hello all !

I'm currently using ElasticSearch v6.2.3, and I encountered similar problem.

I got this mapping on my index:

{
    "settings": {
        "analysis": {
            "normalizer": {
                "lowercaseasciifolding": {
                    "type": "custom",
                    "char_filter": [],
                    "filter": [
                        "lowercase",
                        "asciifolding"
                    ]
                }
            }
        }
    },
    "mappings": {
        "doc": {
            "dynamic": false,
            "properties": {
                "uuid": {
                    "type": "text",
                    "index": false
                },
                "id": {
                    "type": "long",
                    "index": false
                },
                "account_id": {
                    "type": "integer",
                    "index": true
                },
                "budgea_category_id": {
                    "type": "integer",
                    "index": false
                },
                "budgea_id": {
                    "type": "text",
                    "index": false
                },
                "category_id": {
                    "type": "integer",
                    "index": true
                },
                "date": {
                    "type": "date",
                    "index": true
                },
                "rdate": {
                    "type": "date",
                    "index": true
                },
                "deleted_at": {
                    "type": "date",
                    "index": false
                },
                "updated_at": {
                    "type": "date",
                    "index": false
                },
                "created_at": {
                    "type": "date",
                    "index": false
                },
                "value": {
                    "type": "float",
                    "index": false
                },
                "type": {
                    "type": "text",
                    "index": false
                },
                "original_wording": {
                    "type": "text",
                    "index": false
                },
                "stemmed_wording": {
                    "type": "text",
                    "index": false
                },
                "wording": {
                    "type": "text",
                    "fields": {
                        "sort": {
                            "type": "keyword",
                            "normalizer": "lowercaseasciifolding"
                        },
                        "matching": {
                            "type": "text",
                            "analyzer": "french"
                        }
                    }
                }
            }
        }
    }
}

When running the following query, it does match

{  
   "explain":false,
   "from":0,
   "size":100,
   "query":{  
      "bool":{  
         "filter":[  
            {  
               "term":{  
                  "account_id":"2590000"
               }
            },
            {  
               "range":{  
                  "rdate":[  
                     {  
                        "gte":"2018-05-13"
                     }
                  ]
               }
            }
         ],
         "must":[  
            {  
               "query_string":{  
                  "default_field":"wording",
                  "query":"Itunes.Com\\/bill",
                  "analyze_wildcard":"true"
               }
            }
         ]
      }
   },
   "sort":[  
      {  
         "rdate":{  
            "order":"desc"
         }
      },
      "_doc"
   ]
}

But with this other query, ES doesn't match any result (notice the * around the query):

{  
   "explain":false,
   "from":0,
   "size":100,
   "query":{  
      "bool":{  
         "filter":[  
            {  
               "term":{  
                  "account_id":"2590000"
               }
            },
            {  
               "range":{  
                  "rdate":[  
                     {  
                        "gte":"2018-05-13"
                     }
                  ]
               }
            }
         ],
         "must":[  
            {  
               "query_string":{  
                  "default_field":"wording",
                  "query":"*Itunes.Com\\/bill*",
                  "analyze_wildcard":"true"
               }
            }
         ]
      }
   },
   "sort":[  
      {  
         "rdate":{  
            "order":"desc"
         }
      },
      "_doc"
   ]
}

Can anyone help me ? Is there a workaround to make this work ?

@jimczi
Copy link
Contributor

jimczi commented May 27, 2019

Closing as duplicate of #25940

@jimczi jimczi closed this as completed May 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>docs General docs changes help wanted adoptme :Search/Search Search-related issues that do not fall into other categories
Projects
None yet
Development

No branches or pull requests

7 participants