pdftotext isn't included on non-linux OSes #21

deanmalmgren · 2014-08-05T16:21:25Z

Extracting PDFs doesn't work on windows, because windows doesn't come with pdftotext:

In [3]: textract.process("example.pdf")
---------------------------------------------------------------------------
ShellError                                Traceback (most recent call last)
<ipython-input-3-41fdbe49a77b> in <module>()
----> 1 textract.process("example.pdf")

c:\python27\lib\site-packages\textract\parsers\__init__.pyc in process(filename, **kwargs)
     24         raise exceptions.ExtensionNotSupported(ext)
     25
---> 26     return filetype_module.extract(filename, **kwargs)

c:\python27\lib\site-packages\textract\parsers\pdf.pyc in extract(filename, method, **kwargs)
      8     method = method or 'pdftotext'
      9     if method == 'pdftotext':
---> 10         return extract_pdftotext(filename)
     11     elif method == 'pdfminer':
     12         return extract_pdfminer(filename)

c:\python27\lib\site-packages\textract\parsers\pdf.pyc in extract_pdftotext(filename)
     17 def extract_pdftotext(filename):
     18     """Extract text from pdfs using the pdftotext command line utility."""
---> 19     pipe = run('pdftotext %(filename)s -' % locals())
     20     return pipe.stdout.read()
     21

c:\python27\lib\site-packages\textract\shell.pyc in run(command)
     16     # if pipe is busted, raise an error (unlike Fabric)
     17     if pipe.returncode != 0:
---> 18         raise exceptions.ShellError(pipe.returncode)
     19
     20     return pipe

ShellError: Command failed with exit code 1

In [4]: import pdftotext
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-4-46fa7238b159> in <module>()
----> 1 import pdftotext

ImportError: No module named pdftotext

Maybe require pdftotext or xpdf support? http://en.wikipedia.org/wiki/Pdftotext

deanmalmgren · 2014-08-04T11:45:56Z

Thanks for posting this problem. I haven't done very thorough testing on non-Ubuntu OSes so I'm sure that things like this will come up.

One thing is clear: we should throw a better error if the pdftotext command doesn't exist and give instructions for installing it.

But I've got two follow-up questions here that are on a related topic:

This package uses a few command line shell commands to extract text from documents when its convenient (pdftotext and antiword come to mind at the moment). Since I'm an Ubuntu/Mac user, I happen to know that most of these commands are installable very easily using Ubuntu's package manager or homebrew. Is there a similar package manager project on Windows that installs things like this? It would be nice to include window's instructions on the installation instructions
In the particular case of pdf's, there is a pure python implementation here. In this particular situation, it would probably make sense to fall back to using that instead of the pdftotext default so that you don't even see an error in the first place. Does that make sense?

Thanks again for the comment. I'll post some code to get this fixed (or at least improved) ASAP and I look forward to your thoughts on 1 and 2 above.

viktor-evdokimov · 2014-08-04T15:49:46Z

Just to confirm that by default OS X doesn't have pdftotext installed as well, so you need to manually install it.

…ystems

…a helpful error message pointing to the documentation

deanmalmgren · 2014-08-05T16:25:11Z

@ojosdegris @fabiantheblind @aphexcx I added some documentation for installing things on OSX that I believe is correct. Hopefully this will help make it easier to install on non-Ubuntu distributions.

@aphexcx I also improved the error messages to be more helpful in the event that an executable is not installed and, in the case of pdfs, have this falling back to using pdfminer, which should exist no matter what because it is a python package that is installed by the requirements of textract.

Hopefully this fixes the issue. If there continue to be problems, let me know!

…sier to find them

pdftotext isn't included on non-linux OSes

deanmalmgren · 2014-08-05T17:55:23Z

With these changes, I'm going to close this issue for now but wouldn't be surprised if the OSX installation instructions could be improved.

deanmalmgren added the cross-platform label Aug 4, 2014

aphexcx changed the title ~~pdftotext isn't included on Windows~~ pdftotext isn't included on non-linux OSes Aug 4, 2014

ff6347 mentioned this pull request Aug 5, 2014

Command failed with exit code 127 #26

Closed

Dean Malmgren added 4 commits August 5, 2014 10:34

made installation instructions more modular for different operating s…

cb73a25

…ystems

ShellError now detects when an executable is not installed and gives …

8970a76

…a helpful error message pointing to the documentation

added sensible fallback to pdfminer (a python package) for pdf_parser

1ca2a53

added note about installing python header files for osx

82794ed

moved the installation instructions into standalone url to make it ea…

769c864

…sier to find them

deanmalmgren pushed a commit that referenced this pull request Aug 5, 2014

Merge pull request #21 from deanmalmgren/more-robust-operating-system

489f799

pdftotext isn't included on non-linux OSes

deanmalmgren merged commit 489f799 into master Aug 5, 2014

deanmalmgren mentioned this pull request Aug 5, 2014

lxml required, not in requirements or setup.py #19

Merged

deanmalmgren deleted the more-robust-operating-system branch August 12, 2014 14:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pdftotext isn't included on non-linux OSes #21

pdftotext isn't included on non-linux OSes #21

deanmalmgren commented Aug 5, 2014

deanmalmgren commented Aug 4, 2014

viktor-evdokimov commented Aug 4, 2014

deanmalmgren commented Aug 5, 2014

deanmalmgren commented Aug 5, 2014

pdftotext isn't included on non-linux OSes #21

pdftotext isn't included on non-linux OSes #21

Conversation

deanmalmgren commented Aug 5, 2014

deanmalmgren commented Aug 4, 2014

viktor-evdokimov commented Aug 4, 2014

deanmalmgren commented Aug 5, 2014

deanmalmgren commented Aug 5, 2014