UnicodeDecodeError #3

shayanb · 2016-11-17T19:27:32Z

I can't seem to be able to run this in Mac. is there any requirements not mentioned in setup.py?

🍺  python textrank.py summarize ./articles/3.txt
Traceback (most recent call last):
  File "textrank.py", line 219, in <module>
    cli()
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "textrank.py", line 214, in summarize
    summary = extractSentences(text)
  File "textrank.py", line 163, in extractSentences
    sentenceTokens = sent_detector.tokenize(text.strip())
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1226, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1274, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1265, in span_tokenize
    return [(sl.start, sl.stop) for sl in slices]
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1304, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
    prev = next(it)
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1280, in _slices_from_text
    if self.text_contains_sentbreak(context):
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1325, in text_contains_sentbreak
    for t in self._annotate_tokens(self._tokenize_words(text)):
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1460, in _annotate_second_pass
    for t1, t2 in _pair_iter(tokens):
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
    prev = next(it)
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 577, in _annotate_first_pass
    for aug_tok in tokens:
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 542, in _tokenize_words
    for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 9: ordinal not in range(128)

The text was updated successfully, but these errors were encountered:

Attenborough-zz · 2016-12-02T14:08:45Z

This is a well-documented issue in Python regarding encoding types. You can solve this by reloading sys and changing the encoding type. Of note is that if you create new text files you will be able to use this implementation on them if you set encoding to UFT-8.

Reyeselda95 · 2017-05-12T08:37:41Z

If you install it with pip3 or run it with python 3 it will work.
If you don't want (or you can't) do it with python3, you only need to put the following lines at the top of the textrank.py file:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

This happens because Python2.7 has the default encoding set to ASCII so it crushes when trying to decode on ASCII when the text is on UTF-8.

I putted it on the setup.py in order to install it with the change already done and the error don't pop up.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError #3

UnicodeDecodeError #3

shayanb commented Nov 17, 2016

Attenborough-zz commented Dec 2, 2016

Reyeselda95 commented May 12, 2017 •

edited

Loading

UnicodeDecodeError #3

UnicodeDecodeError #3

Comments

shayanb commented Nov 17, 2016

Attenborough-zz commented Dec 2, 2016

Reyeselda95 commented May 12, 2017 • edited Loading

Reyeselda95 commented May 12, 2017 •

edited

Loading