Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pdftotext isn't included on non-linux OSes #21

Merged
merged 5 commits into from Aug 5, 2014

Conversation

deanmalmgren
Copy link
Owner

Extracting PDFs doesn't work on windows, because windows doesn't come with pdftotext:

In [3]: textract.process("example.pdf")
---------------------------------------------------------------------------
ShellError                                Traceback (most recent call last)
<ipython-input-3-41fdbe49a77b> in <module>()
----> 1 textract.process("example.pdf")

c:\python27\lib\site-packages\textract\parsers\__init__.pyc in process(filename, **kwargs)
     24         raise exceptions.ExtensionNotSupported(ext)
     25
---> 26     return filetype_module.extract(filename, **kwargs)

c:\python27\lib\site-packages\textract\parsers\pdf.pyc in extract(filename, method, **kwargs)
      8     method = method or 'pdftotext'
      9     if method == 'pdftotext':
---> 10         return extract_pdftotext(filename)
     11     elif method == 'pdfminer':
     12         return extract_pdfminer(filename)

c:\python27\lib\site-packages\textract\parsers\pdf.pyc in extract_pdftotext(filename)
     17 def extract_pdftotext(filename):
     18     """Extract text from pdfs using the pdftotext command line utility."""
---> 19     pipe = run('pdftotext %(filename)s -' % locals())
     20     return pipe.stdout.read()
     21

c:\python27\lib\site-packages\textract\shell.pyc in run(command)
     16     # if pipe is busted, raise an error (unlike Fabric)
     17     if pipe.returncode != 0:
---> 18         raise exceptions.ShellError(pipe.returncode)
     19
     20     return pipe

ShellError: Command failed with exit code 1
In [4]: import pdftotext
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-4-46fa7238b159> in <module>()
----> 1 import pdftotext

ImportError: No module named pdftotext

Maybe require pdftotext or xpdf support? http://en.wikipedia.org/wiki/Pdftotext

@deanmalmgren
Copy link
Owner

Thanks for posting this problem. I haven't done very thorough testing on non-Ubuntu OSes so I'm sure that things like this will come up.

One thing is clear: we should throw a better error if the pdftotext command doesn't exist and give instructions for installing it.

But I've got two follow-up questions here that are on a related topic:

  1. This package uses a few command line shell commands to extract text from documents when its convenient (pdftotext and antiword come to mind at the moment). Since I'm an Ubuntu/Mac user, I happen to know that most of these commands are installable very easily using Ubuntu's package manager or homebrew. Is there a similar package manager project on Windows that installs things like this? It would be nice to include window's instructions on the installation instructions
  2. In the particular case of pdf's, there is a pure python implementation here. In this particular situation, it would probably make sense to fall back to using that instead of the pdftotext default so that you don't even see an error in the first place. Does that make sense?

Thanks again for the comment. I'll post some code to get this fixed (or at least improved) ASAP and I look forward to your thoughts on 1 and 2 above.

@viktor-evdokimov
Copy link

Just to confirm that by default OS X doesn't have pdftotext installed as well, so you need to manually install it.

@aphexcx aphexcx changed the title pdftotext isn't included on Windows pdftotext isn't included on non-linux OSes Aug 4, 2014
@deanmalmgren
Copy link
Owner

@ojosdegris @fabiantheblind @aphexcx I added some documentation for installing things on OSX that I believe is correct. Hopefully this will help make it easier to install on non-Ubuntu distributions.

@aphexcx I also improved the error messages to be more helpful in the event that an executable is not installed and, in the case of pdfs, have this falling back to using pdfminer, which should exist no matter what because it is a python package that is installed by the requirements of textract.

Hopefully this fixes the issue. If there continue to be problems, let me know!

deanmalmgren pushed a commit that referenced this pull request Aug 5, 2014
pdftotext isn't included on non-linux OSes
@deanmalmgren deanmalmgren merged commit 489f799 into master Aug 5, 2014
@deanmalmgren
Copy link
Owner

With these changes, I'm going to close this issue for now but wouldn't be surprised if the OSX installation instructions could be improved.

@deanmalmgren deanmalmgren deleted the more-robust-operating-system branch August 12, 2014 14:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants