Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode trouble with lemma_ #32

Closed
rsomeon opened this issue Mar 10, 2015 · 6 comments
Closed

Unicode trouble with lemma_ #32

rsomeon opened this issue Mar 10, 2015 · 6 comments

Comments

@rsomeon
Copy link

rsomeon commented Mar 10, 2015

from spacy.en import English()
nlp = English()
nlp(u'me…')[0].lemma_

results in an exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "spacy/tokens.pyx", line 439, in spacy.tokens.Token.lemma_.__get__ (spacy/tokens.cpp:8854)
  File "spacy/strings.pyx", line 73, in spacy.strings.StringStore.__getitem__ (spacy/strings.cpp:1652)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 2: unexpected end of data
@syllog1sm
Copy link
Contributor

Is the encoding of your terminal/text file set to UTF8?

@rsomeon
Copy link
Author

rsomeon commented Mar 10, 2015

Fix and tests:
#33

@NSchrading
Copy link

I actually have this issue as well. You can test it with the word "fiancé"

s = "fiancé"
tok = nlp(s)
print(tok[0].lemma_)
  File "spacy/tokens.pyx", line 439, in spacy.tokens.Token.lemma_.__get__ (spacy/tokens.cpp:8854)
  File "spacy/strings.pyx", line 73, in spacy.strings.StringStore.__getitem__ (spacy/strings.cpp:1652)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 5: unexpected end of data

@syllog1sm
Copy link
Contributor

Sorry to leave this for so long. I'm working on securing a major contract, that would ensure this project stays funded for a long time.

This is the first pull request to the code itself that I've wanted to merge, and I stalled on setting up the Contributors' License Agreement stuff.

I've adapted the Oracle Contributor's Agreement, and am using the signing process that Medium use, where you attach a file to the first pull request you make with a given GitHub username. This seems unambiguous enough.

I know that ignoring this for two weeks isn't the right way to make you feel like the project is worth bothering with, and I understand if you can't accept the CLA terms for whatever reason. But, if you still want to contribute this patch, please follow the steps here: https://github.com/honnibal/spaCy/blob/master/contributors/cla.md

@rsomeon
Copy link
Author

rsomeon commented Mar 26, 2015

It's me who should be apologetic, since I forgot about the dual-licensing. I should have just waited for you to come up with your own one or two line change, but the fix was so trivial I didn't have time to think twice.

I have read your CLA and you can consider it signed, and furthermore I don't want any kind of attribution. The downer is I'm not going to put my name in a pull request, because this account is a pseudonymous dumping ground for my sillier projects. We have already spent more effort talking about this than it takes to fix the bug, so my suggestion is you just commit in your own fix :D Sorry for being difficult, and thanks for the library.

@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants