Skip to content

WhitespaceTokenizer supports Java whitespace, should also support Unicode whitespace [LUCENE-5096] #6160

@asfimport

Description

@asfimport

The whitespace tokenizer supports only Java whitespace as defined in http://docs.oracle.com/javase/6/docs/api/java/lang/Character.html#isWhitespace(char)

A useful improvement would be to support also Unicode whitespace as defined in the Unicode property list http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt


Migrated from LUCENE-5096 by Jörg Prante (@jprante), resolved Dec 08 2015
Environment:

all

Linked issues:

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions