Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex in Completion Suggestion not working for Unicode (simple analyzer). #33305

Open
bbfsdev opened this issue Aug 31, 2018 · 6 comments
Open
Labels
>bug :Search/Suggesters "Did you mean" and suggestions as you type Team:Search Meta label for search team

Comments

@bbfsdev
Copy link

bbfsdev commented Aug 31, 2018

Elastic
Version: 6.3.0, Build: default/tar/424e937/2018-06-11T23:38:03.357887Z, JVM: 1.8.0_161
No plugins

Java
java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)

OS
Linux kk-dev 3.19.0-80-generic #88~14.04.1-Ubuntu SMP Fri Jan 13 14:54:07 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Description:
Regex in Completion Suggestion not working for Unicode (simple analyzer).
Eventually I would like to use other then simple analyzers.

Steps to reproduce:
PUT /example
{
"mappings": {
"_doc": {
"properties": {
"suggest_field": {
"type": "completion"
}
}
}
}
}

PUT example/_doc/0?refresh
{
"suggest_field" : {
"input": ["אחת שניים שלוש", "שניים שלוש ארבע"]
}
}

PUT example/_doc/1?refresh
{
"suggest_field" : {
"input": ["one two three", "two three four"]
}
}

// This one works well
GET example/_doc/_search
{
"suggest": {
"suggest_field": {
"regex": ".r.",
"completion": {
"field": "suggest_field"
}
}
}
}

// Result:
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 0,
"max_score": 0,
"hits": []
},
"suggest": {
"suggest_field": [
{
"text": ".r.",
"offset": 0,
"length": 5,
"options": [
{
"text": "one two three",
"_index": "example",
"_type": "_doc",
"_id": "1",
"_score": 1,
"_source": {
"suggest_field": {
"input": [
"one two three",
"two three four"
]
}
}
}
]
}
]
}
}

// This one fails, i.e., returns empty options
GET example/_doc/_search
{
"suggest": {
"suggest_field": {
"regex": ".א.",
"completion": {
"field": "suggest_field"
}
}
}
}

{
"took": 0,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 0,
"max_score": 0,
"hits": []
},
"suggest": {
"suggest_field": [
{
"text": ".א.",
"offset": 0,
"length": 5,
"options": []
}
]
}
}

Thanks!

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@astefan astefan added the :Search/Suggesters "Did you mean" and suggestions as you type label Aug 31, 2018
@bbfsdev
Copy link
Author

bbfsdev commented Aug 31, 2018

Correcting the json requests as they were broken with markdown:

PUT /example
{
  "mappings": {
    "_doc": {
      "properties": {
        "suggest_field": {
          "type": "completion"
        }
      }
    }
  }
}

PUT example/_doc/0?refresh
{
    "suggest_field" : {
      "input": ["אחת שניים שלוש", "שניים שלוש ארבע"]
    }
}

PUT example/_doc/1?refresh
{
    "suggest_field" : {
      "input": ["one two three", "two three four"]
    }
}

// This request don't return any result but should
GET example/_doc/_search
{
  "suggest": {
    "suggest_field": {
      "regex": ".*א.*",
      "completion": {
        "field": "suggest_field"
      }
    }
  }
} 

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": 0,
    "hits": []
  },
  "suggest": {
    "suggest_field": [
      {
        "text": ".*א.*",
        "offset": 0,
        "length": 5,
        "options": []
      }
    ]
  }
}

// This request works well.
GET example/_doc/_search
{
  "suggest": {
    "suggest_field": {
      "regex": ".*r.*",
      "completion": {
        "field": "suggest_field"
      }
    }
  }
}

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": 0,
    "hits": []
  },
  "suggest": {
    "suggest_field": [
      {
        "text": ".*r.*",
        "offset": 0,
        "length": 5,
        "options": [
          {
            "text": "one two three",
            "_index": "example",
            "_type": "_doc",
            "_id": "1",
            "_score": 1,
            "_source": {
              "suggest_field": {
                "input": [
                  "one two three",
                  "two three four"
                ]
              }
            }
          }
        ]
      }
    ]
  }
}

@jimczi jimczi added the >bug label Sep 3, 2018
@jimczi
Copy link
Contributor

jimczi commented Sep 3, 2018

There is a missing conversion in the regex completion query, the input is treated as UTF32 chars but the data is indexed using UTF8. This explains why it works fine with basic latin chars only. This query is in Lucene so I created https://issues.apache.org/jira/browse/LUCENE-8480 but I wonder how useful is this query for suggestions. Doing infix queries like the one provided in the example is slow, the suggester needs to visit all prefix that matches the regex so in this case all suggestions are visited. I wonder if we shouldn't deprecate/remove the ability to perform regex completion.

@bbfsdev
Copy link
Author

bbfsdev commented Sep 4, 2018

My corpus is not large. Currently we are happy with the speed.

What I try to accomplish is to match options like:
[two f] => [two three four] (skipping words) which makes completion much better.

The query parsed in to a regex. For example [two f] will be used as follows: ^two.*f.*

@jimczi
Copy link
Contributor

jimczi commented Sep 7, 2018

Ok thanks @bbfsdev . We've discussed the problems with regex completion query and decided to add a limit to the number of prefixes that can be expanded from a completion query. So for instance .*a should fail if the number of matching prefix is greater than the limit (we didn't decide what this limit should be). This is of course unrelated to the bug you opened which needs a proper fix. Also unrelated but the completion queries are all prefix based so you should replace a.*f.* with a.*f. The latter should perform better since it only needs to find prefixes that end with f.

@nitinvavdiya
Copy link

nitinvavdiya commented Aug 1, 2019

Any update or any alternative solutions on this issue?

@rjernst rjernst added the Team:Search Meta label for search team label May 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search/Suggesters "Did you mean" and suggestions as you type Team:Search Meta label for search team
Projects
None yet
Development

No branches or pull requests

6 participants