WhitespaceTokenizer should tokenize on NBSP [LUCENE-6874]

WhitespaceTokenizer uses [Character.isWhitespace ](http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-) to decide what is whitespace.  Here's a pertinent excerpt:

> It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', '\u2007', '\u202F')


Perhaps Character.isWhitespace should have been called isLineBreakableWhitespace?

I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to work around but why leave this trap in by default?



---
Migrated from [LUCENE-6874](https://issues.apache.org/jira/browse/LUCENE-6874) by David Smiley (@dsmiley), resolved Nov 14 2015
Attachments: [icu-datasucker.patch](https://apache.github.io/lucene-jira-archive/attachments/LUCENE-6874/icu-datasucker.patch), [LUCENE_6874_jflex.patch](https://apache.github.io/lucene-jira-archive/attachments/LUCENE-6874/LUCENE_6874_jflex.patch), [LUCENE-6874.patch](https://apache.github.io/lucene-jira-archive/attachments/LUCENE-6874/LUCENE-6874.patch), [LUCENE-6874-chartokenizer.patch](https://apache.github.io/lucene-jira-archive/attachments/LUCENE-6874/LUCENE-6874-chartokenizer.patch) (versions: 3), [LUCENE-6874-jflex.patch](https://apache.github.io/lucene-jira-archive/attachments/LUCENE-6874/LUCENE-6874-jflex.patch), [unicode-ws-tokenizer.patch](https://apache.github.io/lucene-jira-archive/attachments/LUCENE-6874/unicode-ws-tokenizer.patch) (versions: 3)
Linked issues:
 - #6160
- #7937



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WhitespaceTokenizer should tokenize on NBSP [LUCENE-6874] #7932

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

WhitespaceTokenizer should tokenize on NBSP [LUCENE-6874] #7932

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions