Tokens can be lemmatized into the empty string #719

mlehl88 · 2017-01-04T18:36:36Z

It seems like the token 's' can sometimes become lemmatized into the empty string.

>>> from spacy.en import English
>>> nlp = English()
>>> my_string = """s..."""
>>> tokens = nlp(my_string)
>>> for t in tokens:
...     print("token: %s, lemma: %s" % (t, t.lemma_))
... 
token: s, lemma: 
token: ..., lemma: ...

In contrast, this does not happen if the string 's' is not followed by ellipsis dots:

>>> my_string = """s"""
>>> tokens = nlp(my_string)
>>> for t in tokens:
...     print("token: %s, lemma: %s" % (t, t.lemma_))
... 
token: s, lemma: s

This behaviour might cause unexpected results in downstream applications. One consequence is that textacy sometimes extracts the empty string as a keyword.

Environment

Operating System: Mac OS X Version 10.10.5
Python Version Used: 3.6.0
spaCy Version Used: 1.5
Environment Information: Anaconda virtual environment
Spacy model: en-1.1.0

The text was updated successfully, but these errors were encountered:

lock · 2018-05-09T01:39:07Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

mlehl88 mentioned this issue Jan 4, 2017

Empty string key term can be returned by textrank chartbeat-labs/textacy#58

Closed

ines added bug Bugs and behaviour differing from documentation lang / en English language data and models labels Jan 8, 2017

ines added this to the Update lemmatizer and morphology milestone Feb 18, 2017

ines added a commit that referenced this issue Mar 13, 2017

Add regression test for #719

46b17e5

honnibal closed this as completed in 413138d Mar 18, 2017

lock bot locked as resolved and limited conversation to collaborators May 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokens can be lemmatized into the empty string #719

Tokens can be lemmatized into the empty string #719

mlehl88 commented Jan 4, 2017

lock bot commented May 9, 2018

Tokens can be lemmatized into the empty string #719

Tokens can be lemmatized into the empty string #719

Comments

mlehl88 commented Jan 4, 2017

Environment

lock bot commented May 9, 2018