The whitespace tokenizer supports only Java whitespace as defined in http://docs.oracle.com/javase/6/docs/api/java/lang/Character.html#isWhitespace(char)
A useful improvement would be to support also Unicode whitespace as defined in the Unicode property list http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt
Migrated from LUCENE-5096 by Jörg Prante (@jprante), resolved Dec 08 2015
Environment:
Linked issues: