Minor tokenization issue with lowercase 'i' + contraction #26

NSchrading · 2015-02-15T02:43:09Z

spaCy correctly tokenizes capital "I" + contraction ('d, 'm, 'll, 've) e.g.:

from spacy.en import English
nlp = English()
tok = nlp("I'm")
print([x.lower_ for x in tok])

>>> ['i', "'m"]

but when the "I" is a lowercase ("i") it does not tokenize into two tokens:

from spacy.en import English
nlp = English()
tok = nlp("i'm")
print([x.lower_ for x in tok])

>>> ["i'm"]

Not a big deal, and this may be the intent, since we don't know if the user meant capital "I", but I can't think of any problems that would happen if it tokenized the lowercase version into two.

The text was updated successfully, but these errors were encountered:

honnibal · 2015-02-15T06:14:59Z

Thanks, this is a gap in the tokenization special-case data. I'll fix this. It should also handle stuff like "im".

honnibal · 2015-03-05T10:37:23Z

Fixed in version 0.70.

lock · 2018-05-09T18:31:58Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal closed this as completed Mar 5, 2015

NSchrading mentioned this issue Jul 23, 2015

Tokenizer splitting #42

Closed

lock bot locked as resolved and limited conversation to collaborators May 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor tokenization issue with lowercase 'i' + contraction #26

Minor tokenization issue with lowercase 'i' + contraction #26

NSchrading commented Feb 15, 2015

honnibal commented Feb 15, 2015

honnibal commented Mar 5, 2015

lock bot commented May 9, 2018

Minor tokenization issue with lowercase 'i' + contraction #26

Minor tokenization issue with lowercase 'i' + contraction #26

Comments

NSchrading commented Feb 15, 2015

honnibal commented Feb 15, 2015

honnibal commented Mar 5, 2015

lock bot commented May 9, 2018