Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Codec Issue #4

Closed
jd155 opened this issue Jan 11, 2016 · 5 comments
Closed

Codec Issue #4

jd155 opened this issue Jan 11, 2016 · 5 comments

Comments

@jd155
Copy link

jd155 commented Jan 11, 2016

Hi @cjhutto

When I run the code from the NLTK tutorial - http://www.nltk.org/howto/sentiment.html - about using Vader I get the error below. I worked out that I had to move the vader_lexicon.txt file into my NLTK sentiment folder, but that didn't solve this Codec problem.

Have run the code with both python 2 and 3.

Any ideas what I can do?

UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-4-76d3725b79f2> in <module>()
     57 sentences.extend(tricky_sentences)
     58 
---> 59 sid = SentimentIntensityAnalyzer()
     60 
     61 for sentence in sentences:

//anaconda/lib/python3.5/site-packages/nltk/sentiment/vader.py in __init__(self, lexicon_file)
    200     def __init__(self, lexicon_file="vader_lexicon.txt"):
    201         self.lexicon_file = os.path.join(os.path.dirname(__file__), lexicon_file)
--> 202         self.lexicon = self.make_lex_dict()
    203 
    204     def make_lex_dict(self):

//anaconda/lib/python3.5/site-packages/nltk/sentiment/vader.py in make_lex_dict(self)
    208         lex_dict = {}
    209         with codecs.open(self.lexicon_file, encoding='utf8') as infile:
--> 210             for line in infile:
    211                 (word, measure) = line.strip().split('\t')[0:2]
    212                 lex_dict[word] = float(measure)

//anaconda/lib/python3.5/codecs.py in __next__(self)
    709 
    710         """ Return the next decoded line from the input stream."""
--> 711         return next(self.reader)
    712 
    713     def __iter__(self):

//anaconda/lib/python3.5/codecs.py in __next__(self)
    640 
    641         """ Return the next decoded line from the input stream."""
--> 642         line = self.readline()
    643         if line:
    644             return line

//anaconda/lib/python3.5/codecs.py in readline(self, size, keepends)
    553         # If size is given, we call read() only once
    554         while True:
--> 555             data = self.read(readsize, firstline=True)
    556             if data:
    557                 # If we're at a "\r" read one extra character (which might

//anaconda/lib/python3.5/codecs.py in read(self, size, chars, firstline)
    499                 break
    500             try:
--> 501                 newchars, decodedbytes = self.decode(data, self.errors)
    502             except UnicodeDecodeError as exc:
    503                 if firstline:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xde in position 0: invalid continuation byte
@besirkurtulmus
Copy link

Hi @jd155,

Can you try encoding the string in unicode? Try adding a u in front of the strings.

eg. example_str = u"some sentence"

Let me know if that works or not.

@jd155
Copy link
Author

jd155 commented Feb 19, 2016

@besirkurtulmus Yes - this worked, thanks!

@ghost
Copy link

ghost commented Feb 23, 2016

@jd155 Hi, could you tell me where did you find the vader_lexicon.txt? Thanks a lot!

@ghost
Copy link

ghost commented Feb 23, 2016

@jd155 @besirkurtulmus Also, where did you encode the string? Since the error occurs at sid = SentimentIntensityAnalyzer() and I got the same error. Thanks.

@cjhutto
Copy link
Owner

cjhutto commented Dec 13, 2016

Thanks! The new update (with new pip install) has better compatibility support for Python 3 to address many of the encoding/decoding issues. The lexicon dictionary file has been encoded with UTF-8 unicode by default (I hope) for better cross-OS performance. I've also implemented code to automatically detect where the dictionary file is installed (as long as you didn't change it's location relative to where the actual "vaderSentiment.py" file got installed), so you no longer any need to manually put a copy of the dictionary file next to your python file (in the same folder)... from what I can tell, my implementation should work across OSs.

@cjhutto cjhutto closed this as completed Dec 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants