Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analysis: Add the pattern_capture token filter #3340

Closed
clintongormley opened this Issue Jul 16, 2013 · 0 comments

Comments

Projects
None yet
1 participant
@clintongormley
Copy link
Member

clintongormley commented Jul 16, 2013

The pattern_capture token filter, unlike the pattern tokenizer, emits a token for every capture group in the regular expression. Patterns are not anchored to the beginning and end of the string, so each pattern can match multiple times, and matches are allowed to overlap.

For instance a pattern like :

"(([a-z]+)(\d*))"

when matched against:

"abc123def456"

would produce the tokens: [ abc123, abc, 123, def456, def, 456 ]

If preserve_original is set to true then it would also emit the original token: abc123def456.

This is particularly useful for indexing text like camel-case code, eg stripHTML where a user may search for "strip html" or "striphtml":

curl -XPUT localhost:9200/test/  -d '
{
   "settings" : {
      "analysis" : {
         "filter" : {
            "code" : {
               "type" : "pattern_capture",
               "preserve_original" : 1,
               "patterns" : [
                  "(\\p{Ll}+|\\p{Lu}\\p{Ll}+|\\p{Lu}+)",
                  "(\\d+)"
               ]
            }
         },
         "analyzer" : {
            "code" : {
               "tokenizer" : "pattern",
               "filter" : [ "code", "lowercase" ]
            }
         }
      }
   }
}
'

When used to analyze the text "import static org.apache.commons.lang.StringEscapeUtils.escapeHtml", this emits the tokens: [ import, static, org, apache, commons, lang, stringescapeutils, string, escape, utils, escapehtml, escape, html ]

Another example is analyzing email addresses:

curl -XPUT localhost:9200/test/  -d '
{
   "settings" : {
      "analysis" : {
         "filter" : {
            "email" : {
               "type" : "pattern_capture",
               "preserve_original" : 1,
               "patterns" : [
                  "(\\w+)",
                  "(\\p{L}+)",
                  "(\\d+)",
                  "@(.+)"
               ]
            }
         },
         "analyzer" : {
            "email" : {
               "tokenizer" : "uax_url_email",
               "filter" : [ "email", "lowercase",  "unique" ]
            }
         }
      }
   }
}
'

When the above analyzer is used on an email address like john-smith_123@foo-bar.com it would produce the following tokens: [john-smith_123@foo-bar.com, john, smith_123, smith, 123, foo, foo-bar.com, bar, com]

Multiple patterns are required to allow overlapping captures, but also means that patterns are less dense and easier to understand.

Note: All tokens are emitted in the same position, and with the same character offsets, so when combined with highlighting, the whole original token will be highlighted, not just the matching subset. For instance, querying the above email address for "smith" would highlight "<em>john-smith_123@foo-bar.com</em>", not "john-<em>smith</em>_123@foo-bar.com"

clintongormley added a commit that referenced this issue Jul 16, 2013

Added the "pattern_capture" token filter from Lucene 4.4
The XPatternCaptureGroupTokenFilter.java file can be removed once we
upgrade to Lucene 4.4.

This change required the addition of the commaDelimited flag to getAsArray()
to disable parsing strings as comma-delimited values.

Closes #3340

mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015

Added the "pattern_capture" token filter from Lucene 4.4
The XPatternCaptureGroupTokenFilter.java file can be removed once we
upgrade to Lucene 4.4.

This change required the addition of the commaDelimited flag to getAsArray()
to disable parsing strings as comma-delimited values.

Closes elastic#3340
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.