Errror decode() argument 1 must be string, not None when run textract.process #135

frostchick · 2017-03-17T09:31:28Z

I am having trouble when convertTotext with UTF-8 file...
text = textract.process("1.pdf", method='pdfminer')

Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 150, in maybeDeferred
result = f(*args, **kw)
File "/usr/local/lib/python2.7/dist-packages/pydispatch/robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "td-net.py", line 178, in close
textData = convertToText(path,self.date) # convert pdf to text after download
File "td-net.py", line 239, in convertToText
text = textract.process("data/pdf/{1}/{0}.pdf".format(path,sDate), method='pdfminer')
File "/usr/local/lib/python2.7/dist-packages/textract/parsers/init.py", line 58, in process
return parser.process(filename, encoding, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/textract/parsers/utils.py", line 46, in process
unicode_string = self.decode(byte_string)
File "/usr/local/lib/python2.7/dist-packages/textract/parsers/utils.py", line 65, in decode
return text.decode(result['encoding'])
TypeError: decode() argument 1 must be string, not None

deanmalmgren · 2017-03-20T18:33:22Z

That's odd. Looks like chardet could not determine an encoding for your file 1.pdf. Can you try running chardet 1.pdf to see what the output looks like?

I wonder if this is related to #133 somehow...

JermellB · 2017-03-21T16:15:10Z

This is exactly the problem I was having.

Uninstalled and reinstalled from github and it worked? Really befuddled now... Maybe a failed pip --upgrade?

deanmalmgren · 2017-03-24T12:20:08Z

I just pinned chardet to 2.1.1 to address #107. I think this will likely address your issue as well. Try pulling from the latest master on github to see if that fixes it. I'm going to close this, but feel free to reopen if it remains a problem.

marija-stanojevic · 2018-11-15T19:35:36Z

Hello,

I have this issue. I went back to 2.1.1 and now I got another error: ModuleNotFoundError: No module named 'universaldetector' which happens because chardet 2.1.1 is too old. What should I do?

deanmalmgren mentioned this issue Mar 21, 2017

Fix for weird utf-8 chars. #137

Closed

deanmalmgren added bug enhancement and removed enhancement labels Mar 24, 2017

deanmalmgren mentioned this issue Mar 24, 2017

CP949 detected for what used to be ASCII chardet/chardet#98

Closed

deanmalmgren closed this as completed Mar 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Errror decode() argument 1 must be string, not None when run textract.process #135

Errror decode() argument 1 must be string, not None when run textract.process #135

frostchick commented Mar 17, 2017

deanmalmgren commented Mar 20, 2017

JermellB commented Mar 21, 2017 •

edited

deanmalmgren commented Mar 24, 2017

marija-stanojevic commented Nov 15, 2018

Errror decode() argument 1 must be string, not None when run textract.process #135

Errror decode() argument 1 must be string, not None when run textract.process #135

Comments

frostchick commented Mar 17, 2017

deanmalmgren commented Mar 20, 2017

JermellB commented Mar 21, 2017 • edited

deanmalmgren commented Mar 24, 2017

marija-stanojevic commented Nov 15, 2018

JermellB commented Mar 21, 2017 •

edited