-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
WhitespaceTokenizer uses Character.isWhitespace to decide what is whitespace. Here's a pertinent excerpt:
It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', '\u2007', '\u202F')
Perhaps Character.isWhitespace should have been called isLineBreakableWhitespace?
I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to work around but why leave this trap in by default?
Migrated from LUCENE-6874 by David Smiley (@dsmiley), resolved Nov 14 2015
Attachments: icu-datasucker.patch, LUCENE_6874_jflex.patch, LUCENE-6874.patch, LUCENE-6874-chartokenizer.patch (versions: 3), LUCENE-6874-jflex.patch, unicode-ws-tokenizer.patch (versions: 3)
Linked issues: