Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError #3

Open
shayanb opened this issue Nov 17, 2016 · 2 comments
Open

UnicodeDecodeError #3

shayanb opened this issue Nov 17, 2016 · 2 comments

Comments

@shayanb
Copy link

shayanb commented Nov 17, 2016

I can't seem to be able to run this in Mac. is there any requirements not mentioned in setup.py?

🍺  python textrank.py summarize ./articles/3.txt
Traceback (most recent call last):
  File "textrank.py", line 219, in <module>
    cli()
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "textrank.py", line 214, in summarize
    summary = extractSentences(text)
  File "textrank.py", line 163, in extractSentences
    sentenceTokens = sent_detector.tokenize(text.strip())
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1226, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1274, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1265, in span_tokenize
    return [(sl.start, sl.stop) for sl in slices]
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1304, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
    prev = next(it)
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1280, in _slices_from_text
    if self.text_contains_sentbreak(context):
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1325, in text_contains_sentbreak
    for t in self._annotate_tokens(self._tokenize_words(text)):
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1460, in _annotate_second_pass
    for t1, t2 in _pair_iter(tokens):
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
    prev = next(it)
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 577, in _annotate_first_pass
    for aug_tok in tokens:
  File "TextRank-master/virtualenv/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 542, in _tokenize_words
    for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 9: ordinal not in range(128)
@Attenborough-zz
Copy link

This is a well-documented issue in Python regarding encoding types. You can solve this by reloading sys and changing the encoding type. Of note is that if you create new text files you will be able to use this implementation on them if you set encoding to UFT-8.

@Reyeselda95
Copy link

Reyeselda95 commented May 12, 2017

If you install it with pip3 or run it with python 3 it will work.
If you don't want (or you can't) do it with python3, you only need to put the following lines at the top of the textrank.py file:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

This happens because Python2.7 has the default encoding set to ASCII so it crushes when trying to decode on ASCII when the text is on UTF-8.

I putted it on the setup.py in order to install it with the change already done and the error don't pop up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants