New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError: 'cp949' codec can't decode bytes #107

askemottelson opened this Issue Mar 23, 2016 · 3 comments


None yet
3 participants

askemottelson commented Mar 23, 2016

I'm getting this error on some specific rtf files.

stack trace:

File "/Library/Python/2.7/site-packages/textract/parsers/", line 57, in process
return parser.process(filename, encoding, **kwargs)
File "/Library/Python/2.7/site-packages/textract/parsers/", line 45, in process
unicode_string = self.decode(byte_string)
File "/Library/Python/2.7/site-packages/textract/parsers/", line 64, in decode
return text.decode(result['encoding'])

e.g. attached rtf-file (zipped)


This comment has been minimized.


deanmalmgren commented Mar 24, 2017

Thank you for providing the example! I am pretty sure this is a chardet version problem. I was able to successfully extract the text from your file when I pip install chardet==2.1.1. I am going to pin chardet to that version until the issue is resolved; hopefully that fixes the issue for you!


This comment has been minimized.


deanmalmgren commented Mar 28, 2017

Bummer. Rolling chardet back to 2.1.1 will work with py2 but it does not work with py3. I'm going to leave this open until chardet/chardet#98 is resolved. This issue will serve as documentation of the workaround for py2 users in the meantime.


This comment has been minimized.

mohammedyunus009 commented Oct 18, 2018

i was having the same error .in my ubuntu . I just installed this
sudo apt install unoconv .
and used this tool to convert doc to docx .(used exception handling).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment