You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The first involes texts with fancy quotes, e.g. [ “@DevTheBarbie: ] | [ #Colorado’s ], which are being lumped into the same token as the twitter tokens they are precede or follow. The 's in "#Colorado’s" is a possessive and should be a separate token. Same for the opening " in " “@DevTheBarbie"
I am working on a comparison of tokenizers for microblog texts, and am finding issues with nlpj 1.1.3 (from http://nlp.mathcs.emory.edu/nlp4j/nlp4j-appassembler-1.1.3.tgz).
The first involes texts with fancy quotes, e.g. [ “@DevTheBarbie: ] | [ #Colorado’s ], which are being lumped into the same token as the twitter tokens they are precede or follow. The 's in "#Colorado’s" is a possessive and should be a separate token. Same for the opening " in " “@DevTheBarbie"
The online demo (http://nlp.mathcs.emory.edu:8080/nlp4j/NLP4JServlet) is handling these correctly, however.
I'm attaching the original input files, and the parses from NLP4J.
098.conll.txt
103.conll.txt
098.txt
103.txt
The text was updated successfully, but these errors were encountered: