Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

\W not equivalent to [^\w] in the pattern analyzer #895

Closed
clintongormley opened this issue Apr 30, 2011 · 1 comment
Closed

\W not equivalent to [^\w] in the pattern analyzer #895

clintongormley opened this issue Apr 30, 2011 · 1 comment

Comments

@clintongormley
Copy link

According to the docs http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html \W should be the equivalent of [^\w], but it is not functioning in that way in the pattern analyzer:

curl -XPUT 'localhost:9200/test' -d '
{
    "settings":{
        "analysis": {
            "analyzer": {
                "one":{
                    "type": "pattern",
                    "pattern":"\\W+"
                },
                "two":{
                    "type": "pattern",
                    "pattern":"[^\\w]+"
                }
            }
        }
    }
}'

curl 'localhost:9200/test/_analyze?pretty=1&analyzer=one' -d '123 type_1-type_4'
# {
#   "tokens" : [ {
#     "token" : "type",
#     "start_offset" : 4,
#     "end_offset" : 8,
#     "type" : "word",
#     "position" : 1
#   }, {
#     "token" : "type",
#     "start_offset" : 11,
#     "end_offset" : 15,
#     "type" : "word",
#     "position" : 2
#   } ]
# }

curl 'localhost:9200/test/_analyze?pretty=1&analyzer=two' -d '123 type_1-type_4'
# {
#   "tokens" : [ {
#     "token" : "123",
#     "start_offset" : 0,
#     "end_offset" : 3,
#     "type" : "word",
#     "position" : 1
#   }, {
#     "token" : "type_1",
#     "start_offset" : 4,
#     "end_offset" : 10,
#     "type" : "word",
#     "position" : 2
#   }, {
#     "token" : "type_4",
#     "start_offset" : 11,
#     "end_offset" : 17,
#     "type" : "word",
#     "position" : 3
#   } ]
# }
@clintongormley
Copy link
Author

The bug is in Lucene's PatternAnalyzer (deprecated), which maps the \W+ regex to an isLetter() check only, excluding numbers and underscore.

Will be fixed by switching to an analyzer which uses the PatternTokenFilter instead. Closing in favour of #6717

emilykmarx pushed a commit to emilykmarx/elasticsearch that referenced this issue Dec 26, 2023
…astic#895)

Add some background on preferring a single Elasticsearch output vs multiple Elasticsearch outputs when events are routed to multiple indices within the same Cluster
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant