Unicode trouble with lemma_ #32

rsomeon · 2015-03-10T20:16:08Z

from spacy.en import English()
nlp = English()
nlp(u'me…')[0].lemma_

results in an exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "spacy/tokens.pyx", line 439, in spacy.tokens.Token.lemma_.__get__ (spacy/tokens.cpp:8854)
  File "spacy/strings.pyx", line 73, in spacy.strings.StringStore.__getitem__ (spacy/strings.cpp:1652)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 2: unexpected end of data

The text was updated successfully, but these errors were encountered:

syllog1sm · 2015-03-10T23:30:30Z

Is the encoding of your terminal/text file set to UTF8?

rsomeon · 2015-03-10T23:53:08Z

Fix and tests:
#33

NSchrading · 2015-03-16T06:23:55Z

I actually have this issue as well. You can test it with the word "fiancé"

s = "fiancé"
tok = nlp(s)
print(tok[0].lemma_)

  File "spacy/tokens.pyx", line 439, in spacy.tokens.Token.lemma_.__get__ (spacy/tokens.cpp:8854)
  File "spacy/strings.pyx", line 73, in spacy.strings.StringStore.__getitem__ (spacy/strings.cpp:1652)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 5: unexpected end of data

syllog1sm · 2015-03-25T13:43:13Z

Sorry to leave this for so long. I'm working on securing a major contract, that would ensure this project stays funded for a long time.

This is the first pull request to the code itself that I've wanted to merge, and I stalled on setting up the Contributors' License Agreement stuff.

I've adapted the Oracle Contributor's Agreement, and am using the signing process that Medium use, where you attach a file to the first pull request you make with a given GitHub username. This seems unambiguous enough.

I know that ignoring this for two weeks isn't the right way to make you feel like the project is worth bothering with, and I understand if you can't accept the CLA terms for whatever reason. But, if you still want to contribute this patch, please follow the steps here: https://github.com/honnibal/spaCy/blob/master/contributors/cla.md

rsomeon · 2015-03-26T10:28:23Z

It's me who should be apologetic, since I forgot about the dual-licensing. I should have just waited for you to come up with your own one or two line change, but the fix was so trivial I didn't have time to think twice.

I have read your CLA and you can consider it signed, and furthermore I don't want any kind of attribution. The downer is I'm not going to put my name in a pull request, because this account is a pseudonymous dumping ground for my sillier projects. We have already spent more effort talking about this than it takes to fix the bug, so my suggestion is you just commit in your own fix :D Sorry for being difficult, and thanks for the library.

lock · 2018-05-09T18:31:52Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

rsomeon closed this as completed Mar 26, 2015

NSchrading mentioned this issue Apr 13, 2015

Unicode trouble with lemma_ still not fixed #51

Closed

lock bot locked as resolved and limited conversation to collaborators May 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode trouble with lemma_ #32

Unicode trouble with lemma_ #32

rsomeon commented Mar 10, 2015

syllog1sm commented Mar 10, 2015

rsomeon commented Mar 10, 2015

NSchrading commented Mar 16, 2015

syllog1sm commented Mar 25, 2015

rsomeon commented Mar 26, 2015

lock bot commented May 9, 2018

Unicode trouble with lemma_ #32

Unicode trouble with lemma_ #32

Comments

rsomeon commented Mar 10, 2015

syllog1sm commented Mar 10, 2015

rsomeon commented Mar 10, 2015

NSchrading commented Mar 16, 2015

syllog1sm commented Mar 25, 2015

rsomeon commented Mar 26, 2015

lock bot commented May 9, 2018