Skip to content

Commit

Permalink
Merge pull request #25 from deanmalmgren/ps
Browse files Browse the repository at this point in the history
.ps support
  • Loading branch information
Dean Malmgren committed Aug 6, 2014
2 parents 1a7f05f + 7c5e1dd commit 00d03f5
Show file tree
Hide file tree
Showing 8 changed files with 34 additions and 3 deletions.
2 changes: 2 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,8 @@ Currently supporting

* ``.pdf`` via `pdftotext <http://poppler.freedesktop.org/>`__ (default) or `pdfminer <https://euske.github.io/pdfminer/>`__

* ``.ps`` via `ps2text <http://pages.cs.wisc.edu/~ghost/doc/pstotext.htm>`__

* ``.txt`` via python builtins

Please recommend other file types by either mentioning them on the
Expand Down
10 changes: 9 additions & 1 deletion docs/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ package manager before installing textract from pypi.

.. code-block:: bash
apt-get install python-dev libxml2-dev libxslt1-dev antiword poppler-utils
apt-get install python-dev libxml2-dev libxslt1-dev antiword poppler-utils pstotext
pip install textract
Expand All @@ -40,6 +40,11 @@ source code with homebrew and installing textract from pypi.
brew link libxml2 libxslt
pip install textract
.. note::

ps2text is not currently a part of homebrew so ``.ps`` extraction
must be enabled by manually installing from source.

.. note::

Depending on how you have python configured on your system with
Expand Down Expand Up @@ -83,6 +88,9 @@ documenation about how to install the textract dependencies, please
required by the ``.pdf`` parser (there is a pure python fallback
that works if pdftotext isn't installed).

- `pstotext <http://pages.cs.wisc.edu/~ghost/doc/pstotext.htm>`__
is required by the ``.ps`` parser.

2. Add a requirements file to the `requirements directory
<https://github.com/deanmalmgren/textract/tree/master/requirements>`__
of the project with the lower-cased name of your operating system
Expand Down
8 changes: 8 additions & 0 deletions docs/python_package.rst
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,14 @@ textract.parsers.pptx_parser module
:undoc-members:
:show-inheritance:

textract.parsers.ps_parser module
---------------------------------

.. automodule:: textract.parsers.ps_parser
:members:
:undoc-members:
:show-inheritance:

textract.parsers.txt_parser module
----------------------------------

Expand Down
3 changes: 3 additions & 0 deletions requirements/debian
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,6 @@ antiword

# parse pdfs
poppler-utils

# parse postscript files
pstotext
Binary file added tests/ps/example.ps
Binary file not shown.
2 changes: 2 additions & 0 deletions tests/run_functional_tests.sh
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,9 @@ validate_example ${BASEDIR}/txt/little_bo_peep.txt 1c5fb4478d84c3b3296746e491895
validate_example ${BASEDIR}/html/snow-fall.html acc2d8c49094e56474006cab3d3768eb
validate_example ${BASEDIR}/html/what-we-do.html 1fb0263bf62317365cb30246d9e094be
validate_example ${BASEDIR}/eml/example.eml cb59a5fad8ed8b849e15d53449b1de3f
validate_example ${BASEDIR}/ps/example.ps bdd41be3e24d7ded69be1e5732f7c8fc
validate_example ${BASEDIR}/json/json_is_my_best_friend.json dc0503f1b5a213d67cc08829b12df99e
validate_example ${BASEDIR}/odt/i_heart_odt.odt f64b15c1acf5cebb1a91896221696da7

# exit with the sum of the status
exit ${EXIT_CODE}
4 changes: 2 additions & 2 deletions textract/exceptions.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,8 @@ def __init__(self, ext):
def __str__(self):
return self.render((
'The filename extension %(ext)s is not yet supported by\n'
'textract. Please suggest this filename extension here:\n'
' https://github.com/deanmalmgren/textract/issues'
'textract. Please suggest this filename extension here:\n\n'
' https://github.com/deanmalmgren/textract/issues\n'
))


Expand Down
8 changes: 8 additions & 0 deletions textract/parsers/ps_parser.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
from ..shell import run


def extract(filename, **kwargs):
"""Extract text from postscript files using pstotext command.
"""
pipe = run('pstotext %(filename)s' % locals())
return pipe.stdout.read()

0 comments on commit 00d03f5

Please sign in to comment.