Wildcard search with special characters: Differences between ES 2.x and 5.x #22989

mweibel · 2017-02-06T07:20:53Z

I'm in the process of upgrading a project from ES 1.7 to ES 5.1 and noticed an integration test failing. After testing I made sure that the actual change happens between 2.x and 5.x.
Below is a minimal repro case.

When searching using a wildcard search for "brö" I get in both versions the same result. When searching using "brö{"/)?" which gets escaped as "brö{\\"/)", I get 2 results in ES 2.x and no results in ES 5.x.

I searched breaking changes of ES 2.x and ES 5.x but could not find anything related to that. Also searched for new query string search settings which could help in that regard but couldn't find it either.
Also I opened first a forum post with the same details: https://discuss.elastic.co/t/wildcard-search-with-special-characters-differences-between-es-1-7-and-es-5-1/73720/1

How to reproduce

ES 5 index creation

curl -X PUT 'http://localhost:9200/testwildcard' -d '{
  "mappings": {
    "testwildcard": {
      "properties": {
        "name": {
          "type": "keyword",
          "fields": {
            "search": {
              "type": "text"
            }
          }
        }
      }
    }
  }
}'

ES 2.x index creation

curl -XPUT 'http://localhost:9200/testwildcard' -d '{
  "mappings": {
    "testwildcard": {
      "properties": {
        "name": {
          "type": "string",
          "index": "not_analyzed",
          "fields": {
            "search": {
              "type": "string"
            }
          }
        }
      }
    }
  }
}'

Insert data

curl -XPUT 'http://localhost:9200/testwildcard/testwildcard/1' -d '{
 "name": "Brötchen"
}'

curl -XPUT 'http://localhost:9200/testwildcard/testwildcard/2' -d '{
 "name": "Frischbackbrötchen"
}'

curl -XPUT 'http://localhost:9200/testwildcard/testwildcard/3' -d '{
 "name": "Brot"
}'

works in both ES 2.x and ES 5.1:

curl -XPOST 'http://localhost:9200/testwildcard/_search' -d '{
  "explain": true,
  "query": {
    "query_string": {
      "query": "name.search:*brö*",
      "analyze_wildcard": true,
      "default_operator": "AND"
    }
  }
}'

doesn't work in ES 5.1 but works in ES 2.x:

curl -XPOST 'http://localhost:9200/testwildcard/_search' -d '{
  "explain": true,
  "query": {
    "query_string": {
      "query": "name.search:*brö\\{\\\\\\\"\\/\\)*",
      "analyze_wildcard": true,
      "default_operator": "AND"
    }
  }
}'

The text was updated successfully, but these errors were encountered:

jpountz · 2017-02-08T13:47:29Z

Good catch, the behaviour did indeed change as we are now only applying the character-level filters of the analysis chain (like ascii folding or lowercasing), which means we ignore the tokenizer and some filters like synonyms or stemming filters.

For most users this should be transparent and might even fix queries that did not work before. However in that case, this causes the punctuation characters to remain in *brö\\{\\\\\\\"\\/\\)*, which means it cannot match anything since your analyzer discards punctuation.

I am unclear how name.search:*brö\\{\\\\\\\"\\/\\)* should be parsed so I am tempted to just document this as a possible breaking change. I'll leave this issue open with the discuss label to see what others think.

mweibel · 2017-02-08T16:39:53Z

@jpountz thanks for your reply and confirmation :) I'd also vote for documenting this just as a breaking change. Am I correct in the assumption the new but experimental normalizer feature could be used as an alternative? (besides ensuring only valid characters get passed to the search string)

jpountz · 2017-02-08T17:45:28Z

Actually the normalizer feature works on top of this normalization which has originally been designed to make query parsing "do the right thing" when it comes to analyzing partial terms for prefix, wildcard or fuzzy queries.

jimczi · 2017-02-09T10:59:12Z

I am unclear how name.search:brö\{\\\"\/\) should be parsed so I am tempted to just document this as a possible breaking change. I'll leave this issue open with the discuss label to see what others think.

I think the situation is not ideal because the punctuation is removed by the tokenizer even though it could be considered as a normalization phase. The alternative could be to add a char_filter that removes the punctuation but it could break the tokenization...
Just an idea but now that we have a normalizer type we could maybe allow to define a normalizer AND an analyzer for text fields only ?
It makes sense to restrict analyze_wildcard to single leading and trailing wildcard only. Today it is "correctly" applied on prefix query but not on suffix query. IMO for complex wildcard query we should document that this setting is not honored.

jpountz · 2017-02-17T13:30:17Z

We agreed in FixitFriday on documenting this behaviour and potentially adapt based on feedback when we get some.

mayya-sharipova · 2018-03-21T22:06:14Z

cc @elastic/es-search-aggs

Still need to document better the details of working of analyze_wildcard

xtrembaker · 2018-11-14T08:05:30Z

Hello all !

I'm currently using ElasticSearch v6.2.3, and I encountered similar problem.

I got this mapping on my index:

{
    "settings": {
        "analysis": {
            "normalizer": {
                "lowercaseasciifolding": {
                    "type": "custom",
                    "char_filter": [],
                    "filter": [
                        "lowercase",
                        "asciifolding"
                    ]
                }
            }
        }
    },
    "mappings": {
        "doc": {
            "dynamic": false,
            "properties": {
                "uuid": {
                    "type": "text",
                    "index": false
                },
                "id": {
                    "type": "long",
                    "index": false
                },
                "account_id": {
                    "type": "integer",
                    "index": true
                },
                "budgea_category_id": {
                    "type": "integer",
                    "index": false
                },
                "budgea_id": {
                    "type": "text",
                    "index": false
                },
                "category_id": {
                    "type": "integer",
                    "index": true
                },
                "date": {
                    "type": "date",
                    "index": true
                },
                "rdate": {
                    "type": "date",
                    "index": true
                },
                "deleted_at": {
                    "type": "date",
                    "index": false
                },
                "updated_at": {
                    "type": "date",
                    "index": false
                },
                "created_at": {
                    "type": "date",
                    "index": false
                },
                "value": {
                    "type": "float",
                    "index": false
                },
                "type": {
                    "type": "text",
                    "index": false
                },
                "original_wording": {
                    "type": "text",
                    "index": false
                },
                "stemmed_wording": {
                    "type": "text",
                    "index": false
                },
                "wording": {
                    "type": "text",
                    "fields": {
                        "sort": {
                            "type": "keyword",
                            "normalizer": "lowercaseasciifolding"
                        },
                        "matching": {
                            "type": "text",
                            "analyzer": "french"
                        }
                    }
                }
            }
        }
    }
}

When running the following query, it does match

{  
   "explain":false,
   "from":0,
   "size":100,
   "query":{  
      "bool":{  
         "filter":[  
            {  
               "term":{  
                  "account_id":"2590000"
               }
            },
            {  
               "range":{  
                  "rdate":[  
                     {  
                        "gte":"2018-05-13"
                     }
                  ]
               }
            }
         ],
         "must":[  
            {  
               "query_string":{  
                  "default_field":"wording",
                  "query":"Itunes.Com\\/bill",
                  "analyze_wildcard":"true"
               }
            }
         ]
      }
   },
   "sort":[  
      {  
         "rdate":{  
            "order":"desc"
         }
      },
      "_doc"
   ]
}

But with this other query, ES doesn't match any result (notice the * around the query):

{  
   "explain":false,
   "from":0,
   "size":100,
   "query":{  
      "bool":{  
         "filter":[  
            {  
               "term":{  
                  "account_id":"2590000"
               }
            },
            {  
               "range":{  
                  "rdate":[  
                     {  
                        "gte":"2018-05-13"
                     }
                  ]
               }
            }
         ],
         "must":[  
            {  
               "query_string":{  
                  "default_field":"wording",
                  "query":"*Itunes.Com\\/bill*",
                  "analyze_wildcard":"true"
               }
            }
         ]
      }
   },
   "sort":[  
      {  
         "rdate":{  
            "order":"desc"
         }
      },
      "_doc"
   ]
}

Can anyone help me ? Is there a workaround to make this work ?

jimczi · 2019-05-27T11:39:19Z

Closing as duplicate of #25940

clintongormley added :Query DSL discuss labels Feb 6, 2017

clintongormley added >docs General docs changes and removed discuss labels Feb 17, 2017

clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Query DSL labels Feb 14, 2018

cbuescher added the help wanted adoptme label Nov 12, 2018

jimczi closed this as completed May 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wildcard search with special characters: Differences between ES 2.x and 5.x #22989

Wildcard search with special characters: Differences between ES 2.x and 5.x #22989

mweibel commented Feb 6, 2017

jpountz commented Feb 8, 2017

mweibel commented Feb 8, 2017

jpountz commented Feb 8, 2017

jimczi commented Feb 9, 2017

jpountz commented Feb 17, 2017

mayya-sharipova commented Mar 21, 2018

xtrembaker commented Nov 14, 2018 •

edited

Loading

jimczi commented May 27, 2019

Wildcard search with special characters: Differences between ES 2.x and 5.x #22989

Wildcard search with special characters: Differences between ES 2.x and 5.x #22989

Comments

mweibel commented Feb 6, 2017

How to reproduce

ES 5 index creation

ES 2.x index creation

Insert data

works in both ES 2.x and ES 5.1:

doesn't work in ES 5.1 but works in ES 2.x:

jpountz commented Feb 8, 2017

mweibel commented Feb 8, 2017

jpountz commented Feb 8, 2017

jimczi commented Feb 9, 2017

jpountz commented Feb 17, 2017

mayya-sharipova commented Mar 21, 2018

xtrembaker commented Nov 14, 2018 • edited Loading

jimczi commented May 27, 2019

xtrembaker commented Nov 14, 2018 •

edited

Loading