Skip to content

WhitespaceTokenizer should tokenize on NBSP [LUCENE-6874] #7932

@asfimport

Description

@asfimport

WhitespaceTokenizer uses Character.isWhitespace to decide what is whitespace. Here's a pertinent excerpt:

It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', '\u2007', '\u202F')

Perhaps Character.isWhitespace should have been called isLineBreakableWhitespace?

I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to work around but why leave this trap in by default?


Migrated from LUCENE-6874 by David Smiley (@dsmiley), resolved Nov 14 2015
Attachments: icu-datasucker.patch, LUCENE_6874_jflex.patch, LUCENE-6874.patch, LUCENE-6874-chartokenizer.patch (versions: 3), LUCENE-6874-jflex.patch, unicode-ws-tokenizer.patch (versions: 3)
Linked issues:

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions