-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zero Width unicode characters #18
Conversation
Thank you very much for your contribution! After some research on the case, please help me understand this PR proposal better:
Again, thank you very much for your contribution, and looking forward to your thoughts and reasoning around those two very special cases in the Unicode world! |
Thanks for a quick response! I have to agree with you - for case ZWNJ it makes more sense to treat it by pruning it (I did not think this through long enough, clearly). For ZWSP, I thought it would behave like a hyphen - someone put it there to separate two tokens that do belong together, but did not want that to be visible other than, e.g., at the end of a line, when it should break on this token (whereas the hyphen visualizes this connection between the two tokens). At least, that's how I interpreted what I found on the net, but I have to admit I did not dive deeply into the unicode documents themselves. You can indeed easily argue that it should be treated as a (weird case of) space that is simply not visible. (Whether you put it in space or in hyphen, it may indeed be a good idea to make this processing optional?) Meanwhile, in our Web data adventures, we identified two more weird cases: U+FE0F and U+FEFF. I think for tokenization, U+FE0F should simply be ignored - but you can think about the second one, U+FEFF, which is supposed to be keeping two tokens together (as opposed to the ZWSP). If you indeed think about adding an option for the ZWSP processing mode, then maybe include ZWNBSP in a similar (but orthogonal) way. See also: |
The functionality you are suggesting for ZWSP is already supported in Unicode by the U+00AD SHY character, known as syllable hyphen. At least from yet another description about the purpose of Unicode zero-width characters, ZWSP would appear to be a very special space character that annoyingly is not supported by the As to the deprecated use of the BOM as ZWNBSP (U+FEFF), that seems to be meant to be used the same way as NBSP is, but without a space. Now, it is actually called the Word Joiner U+2060, and is supposed to be used in non-Indoeuropean scripts, so it seems you would not want to split words at this character. U+FE0F is a Variation Selector in Unicode. I don't understand how this character fits with the rest of this discussion? In summary, it seems the correct default behavior would then be:
|
Agreed. Shall I modify my pull request accordingly? (PS: Please indeed ignore the mention of U+FE0F, that is only relevant for the NLP library we use syntok together with.) |
If you are interested to do that, I would be glad to merge a fix for U+200B space handling to syntok: So, yes, please! 👍 |
Only U+200B should be processed, and treated as a space (not a hyphen).
Done! |
Thank you very much! I will create a new syntok release. |
Thanks for the coaching! |
Done; deployed with version 1.3.3 Arjen - much more, thank you very much for your contribution, the interesting findings about missing Unicode support in syntok for some of the more esoteric parts of the standard, and the great discussion! |
Great library!
Using it on an NLP task we study, I ran into a problem processing text drawn from the Web (where you find a lot of weird stuff!).
Specifically, we want to split on \u200B and \u200C that are known as zero width space (zwsp) and zero width non-joiner, respectively.
This pull request modifies the code to do that by adding these characters to the
hyphens_and_underscore
(you may want to modify the variable name to also refer to zwsp if you decide to integrate the changes, I thought first let's see if you like the proposal). I added examples of desired behavior to the tests.Background info: