You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue involves html-encoded characters such as & and < which are split into separate characters. This is not a problem with other tokenizers. E.g. the following text
**I'd like this & that < the other**
is parsed as
4 this this DT _ 3 dobj _ O
5 & & CC _ 4 cc _ O
6 amp amp NN _ 4 conj _ O
7 ; ; , pos2=: 3 punct _ O
8 that that DT pos2=WDT 3 dep _ O
9 & & CC pos2=NFP 8 cc _ O
10 lt lt JJ pos2=NN 8 conj _ O
11 ; ; : pos2=, 13 punct _ O
The text was updated successfully, but these errors were encountered:
[This issue imported from https://github.com/emorynlp/nlp4j-tokenization/issues/9]
I am working on a comparison of tokenizers for microblog texts, and am finding issues with nlpj 1.1.3 (from http://nlp.mathcs.emory.edu/nlp4j/nlp4j-appassembler-1.1.3.tgz).
This issue involves html-encoded characters such as & and < which are split into separate characters. This is not a problem with other tokenizers. E.g. the following text
is parsed as
The text was updated successfully, but these errors were encountered: