Codec Issue #4

jd155 · 2016-01-11T19:44:42Z

When I run the code from the NLTK tutorial - http://www.nltk.org/howto/sentiment.html - about using Vader I get the error below. I worked out that I had to move the vader_lexicon.txt file into my NLTK sentiment folder, but that didn't solve this Codec problem.

Have run the code with both python 2 and 3.

Any ideas what I can do?

UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-4-76d3725b79f2> in <module>()
     57 sentences.extend(tricky_sentences)
     58 
---> 59 sid = SentimentIntensityAnalyzer()
     60 
     61 for sentence in sentences:

//anaconda/lib/python3.5/site-packages/nltk/sentiment/vader.py in __init__(self, lexicon_file)
    200     def __init__(self, lexicon_file="vader_lexicon.txt"):
    201         self.lexicon_file = os.path.join(os.path.dirname(__file__), lexicon_file)
--> 202         self.lexicon = self.make_lex_dict()
    203 
    204     def make_lex_dict(self):

//anaconda/lib/python3.5/site-packages/nltk/sentiment/vader.py in make_lex_dict(self)
    208         lex_dict = {}
    209         with codecs.open(self.lexicon_file, encoding='utf8') as infile:
--> 210             for line in infile:
    211                 (word, measure) = line.strip().split('\t')[0:2]
    212                 lex_dict[word] = float(measure)

//anaconda/lib/python3.5/codecs.py in __next__(self)
    709 
    710         """ Return the next decoded line from the input stream."""
--> 711         return next(self.reader)
    712 
    713     def __iter__(self):

//anaconda/lib/python3.5/codecs.py in __next__(self)
    640 
    641         """ Return the next decoded line from the input stream."""
--> 642         line = self.readline()
    643         if line:
    644             return line

//anaconda/lib/python3.5/codecs.py in readline(self, size, keepends)
    553         # If size is given, we call read() only once
    554         while True:
--> 555             data = self.read(readsize, firstline=True)
    556             if data:
    557                 # If we're at a "\r" read one extra character (which might

//anaconda/lib/python3.5/codecs.py in read(self, size, chars, firstline)
    499                 break
    500             try:
--> 501                 newchars, decodedbytes = self.decode(data, self.errors)
    502             except UnicodeDecodeError as exc:
    503                 if firstline:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xde in position 0: invalid continuation byte

The text was updated successfully, but these errors were encountered:

besirkurtulmus · 2016-01-15T08:00:01Z

Hi @jd155,

Can you try encoding the string in unicode? Try adding a u in front of the strings.

eg. example_str = u"some sentence"

Let me know if that works or not.

jd155 · 2016-02-19T09:30:16Z

@besirkurtulmus Yes - this worked, thanks!

ghost · 2016-02-23T17:00:17Z

@jd155 Hi, could you tell me where did you find the vader_lexicon.txt? Thanks a lot!

ghost · 2016-02-23T17:06:59Z

@jd155 @besirkurtulmus Also, where did you encode the string? Since the error occurs at sid = SentimentIntensityAnalyzer() and I got the same error. Thanks.

cjhutto · 2016-12-13T17:06:36Z

Thanks! The new update (with new pip install) has better compatibility support for Python 3 to address many of the encoding/decoding issues. The lexicon dictionary file has been encoded with UTF-8 unicode by default (I hope) for better cross-OS performance. I've also implemented code to automatically detect where the dictionary file is installed (as long as you didn't change it's location relative to where the actual "vaderSentiment.py" file got installed), so you no longer any need to manually put a copy of the dictionary file next to your python file (in the same folder)... from what I can tell, my implementation should work across OSs.

cjhutto closed this as completed Dec 13, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Codec Issue #4

Codec Issue #4

jd155 commented Jan 11, 2016

besirkurtulmus commented Jan 15, 2016

jd155 commented Feb 19, 2016

ghost commented Feb 23, 2016

ghost commented Feb 23, 2016

cjhutto commented Dec 13, 2016

Codec Issue #4

Codec Issue #4

Comments

jd155 commented Jan 11, 2016

besirkurtulmus commented Jan 15, 2016

jd155 commented Feb 19, 2016

ghost commented Feb 23, 2016

ghost commented Feb 23, 2016

cjhutto commented Dec 13, 2016