\W not equivalent to [^\w] in the pattern analyzer #895

clintongormley · 2011-04-30T11:56:52Z

According to the docs http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html \W should be the equivalent of [^\w], but it is not functioning in that way in the pattern analyzer:

curl -XPUT 'localhost:9200/test' -d '
{
    "settings":{
        "analysis": {
            "analyzer": {
                "one":{
                    "type": "pattern",
                    "pattern":"\\W+"
                },
                "two":{
                    "type": "pattern",
                    "pattern":"[^\\w]+"
                }
            }
        }
    }
}'

curl 'localhost:9200/test/_analyze?pretty=1&analyzer=one' -d '123 type_1-type_4'
# {
#   "tokens" : [ {
#     "token" : "type",
#     "start_offset" : 4,
#     "end_offset" : 8,
#     "type" : "word",
#     "position" : 1
#   }, {
#     "token" : "type",
#     "start_offset" : 11,
#     "end_offset" : 15,
#     "type" : "word",
#     "position" : 2
#   } ]
# }

curl 'localhost:9200/test/_analyze?pretty=1&analyzer=two' -d '123 type_1-type_4'
# {
#   "tokens" : [ {
#     "token" : "123",
#     "start_offset" : 0,
#     "end_offset" : 3,
#     "type" : "word",
#     "position" : 1
#   }, {
#     "token" : "type_1",
#     "start_offset" : 4,
#     "end_offset" : 10,
#     "type" : "word",
#     "position" : 2
#   }, {
#     "token" : "type_4",
#     "start_offset" : 11,
#     "end_offset" : 17,
#     "type" : "word",
#     "position" : 3
#   } ]
# }

The text was updated successfully, but these errors were encountered:

clintongormley · 2014-07-03T19:12:18Z

The bug is in Lucene's PatternAnalyzer (deprecated), which maps the \W+ regex to an isLetter() check only, excluding numbers and underscore.

Will be fixed by switching to an analyzer which uses the PatternTokenFilter instead. Closing in favour of #6717

…astic#895) Add some background on preferring a single Elasticsearch output vs multiple Elasticsearch outputs when events are routed to multiple indices within the same Cluster

clintongormley mentioned this issue Jul 3, 2014

PatternAnalyzer should use PatternTokenFilter instead #6717

Closed

clintongormley closed this as completed Jul 3, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

\W not equivalent to [^\w] in the pattern analyzer #895

\W not equivalent to [^\w] in the pattern analyzer #895

clintongormley commented Apr 30, 2011

clintongormley commented Jul 3, 2014

\W not equivalent to [^\w] in the pattern analyzer #895

\W not equivalent to [^\w] in the pattern analyzer #895

Comments

clintongormley commented Apr 30, 2011

clintongormley commented Jul 3, 2014