Tokenization of html UTF-8 chars #15

cakelly · 2016-09-01T15:19:32Z

[This issue imported from https://github.com/emorynlp/nlp4j-tokenization/issues/9]
I am working on a comparison of tokenizers for microblog texts, and am finding issues with nlpj 1.1.3 (from http://nlp.mathcs.emory.edu/nlp4j/nlp4j-appassembler-1.1.3.tgz).

This issue involves html-encoded characters such as & and < which are split into separate characters. This is not a problem with other tokenizers. E.g. the following text

**I'd like this & that < the other**

is parsed as

4   this    this    DT  _   3   dobj    _   O
5   &   &   CC  _   4   cc  _   O
6   amp amp NN  _   4   conj    _   O
7   ;   ;   ,   pos2=:  3   punct   _   O
8   that    that    DT  pos2=WDT    3   dep _   O
9   &   &   CC  pos2=NFP    8   cc  _   O
10  lt  lt  JJ  pos2=NN 8   conj    _   O
11  ;   ;   :   pos2=,  13  punct   _   O

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenization of html UTF-8 chars #15

Tokenization of html UTF-8 chars #15

cakelly commented Sep 1, 2016

Tokenization of html UTF-8 chars #15

Tokenization of html UTF-8 chars #15

Comments

cakelly commented Sep 1, 2016