You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am having trouble when convertTotext with UTF-8 file...
text = textract.process("1.pdf", method='pdfminer')
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 150, in maybeDeferred
result = f(*args, **kw)
File "/usr/local/lib/python2.7/dist-packages/pydispatch/robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "td-net.py", line 178, in close
textData = convertToText(path,self.date) # convert pdf to text after download
File "td-net.py", line 239, in convertToText
text = textract.process("data/pdf/{1}/{0}.pdf".format(path,sDate), method='pdfminer')
File "/usr/local/lib/python2.7/dist-packages/textract/parsers/init.py", line 58, in process
return parser.process(filename, encoding, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/textract/parsers/utils.py", line 46, in process
unicode_string = self.decode(byte_string)
File "/usr/local/lib/python2.7/dist-packages/textract/parsers/utils.py", line 65, in decode
return text.decode(result['encoding'])
TypeError: decode() argument 1 must be string, not None
The text was updated successfully, but these errors were encountered:
That's odd. Looks like chardet could not determine an encoding for your file 1.pdf. Can you try running chardet 1.pdf to see what the output looks like?
I just pinned chardet to 2.1.1 to address #107. I think this will likely address your issue as well. Try pulling from the latest master on github to see if that fixes it. I'm going to close this, but feel free to reopen if it remains a problem.
I have this issue. I went back to 2.1.1 and now I got another error: ModuleNotFoundError: No module named 'universaldetector' which happens because chardet 2.1.1 is too old. What should I do?
I am having trouble when convertTotext with UTF-8 file...
text = textract.process("1.pdf", method='pdfminer')
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 150, in maybeDeferred
result = f(*args, **kw)
File "/usr/local/lib/python2.7/dist-packages/pydispatch/robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "td-net.py", line 178, in close
textData = convertToText(path,self.date) # convert pdf to text after download
File "td-net.py", line 239, in convertToText
text = textract.process("data/pdf/{1}/{0}.pdf".format(path,sDate), method='pdfminer')
File "/usr/local/lib/python2.7/dist-packages/textract/parsers/init.py", line 58, in process
return parser.process(filename, encoding, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/textract/parsers/utils.py", line 46, in process
unicode_string = self.decode(byte_string)
File "/usr/local/lib/python2.7/dist-packages/textract/parsers/utils.py", line 65, in decode
return text.decode(result['encoding'])
TypeError: decode() argument 1 must be string, not None
The text was updated successfully, but these errors were encountered: