Skip to content

Commit

Permalink
Merge d56ae84 into 032eb9e
Browse files Browse the repository at this point in the history
  • Loading branch information
Dean Malmgren committed Jun 15, 2015
2 parents 032eb9e + d56ae84 commit 35af092
Show file tree
Hide file tree
Showing 11 changed files with 46 additions and 5 deletions.
3 changes: 3 additions & 0 deletions docs/changelog.rst
Expand Up @@ -11,6 +11,8 @@ latest changes in development for next release

.. THANKS FOR CONTRIBUTING; MENTION WHAT YOU DID IN THIS SECTION HERE!
* support for ``.rtf`` files (`#84`_)


1.2.0
-----
Expand Down Expand Up @@ -224,3 +226,4 @@ latest changes in development for next release
.. _#78: https://github.com/deanmalmgren/textract/issues/78
.. _#79: https://github.com/deanmalmgren/textract/issues/79
.. _#82: https://github.com/deanmalmgren/textract/issues/82
.. _#84: https://github.com/deanmalmgren/textract/issues/84
3 changes: 3 additions & 0 deletions docs/index.rst
Expand Up @@ -74,6 +74,8 @@ file types by either mentioning them on the `issue tracker

* ``.ps`` via `ps2text`_

* ``.rtf`` via `unrtf`_

* ``.tiff`` via `tesseract-ocr`_

* ``.txt`` via python builtins
Expand All @@ -93,6 +95,7 @@ file types by either mentioning them on the `issue tracker
.. _ps2text: http://pages.cs.wisc.edu/~ghost/doc/pstotext.htm
.. _python-docx: https://python-docx.readthedocs.org/en/latest/
.. _python-pptx: https://python-pptx.readthedocs.org/en/latest/
.. _unrtf: http://www.gnu.org/software/unrtf/
.. _SpeechRecognition: https://pypi.python.org/pypi/SpeechRecognition/
.. _sox: http://sox.sourceforge.net/
.. _tesseract-ocr: https://code.google.com/p/tesseract-ocr/
Expand Down
6 changes: 3 additions & 3 deletions docs/installation.rst
Expand Up @@ -22,7 +22,7 @@ package manager before installing textract from pypi.

.. code-block:: bash
apt-get install python-dev libxml2-dev libxslt1-dev antiword poppler-utils pstotext tesseract-ocr \
apt-get install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr \
flac ffmpeg lame libmad0 libsox-fmt-mp3 sox
pip install textract
Expand All @@ -45,7 +45,7 @@ pypi.
.. code-block:: bash
brew cask install xquartz
brew install poppler antiword tesseract
brew install poppler antiword unrtf tesseract
pip install textract
.. brew install libxml2 libxslt antiword poppler tesseract
Expand Down Expand Up @@ -103,7 +103,7 @@ documenation about how to install the textract dependencies, please
- `pstotext <http://pages.cs.wisc.edu/~ghost/doc/pstotext.htm>`_
is required by the ``.ps`` parser.

- `tesseract-ocr <https://code.google.com/p/tesseract-ocr/>`_
- `tesseract-ocr <https://code.google.com/p/tesseract-ocr/>`_
is required by the ``.jpg``, ``.png`` and ``.gif`` parser.

- `sox <http://sox.sourceforge.net/>`_
Expand Down
3 changes: 3 additions & 0 deletions requirements/debian
Expand Up @@ -7,6 +7,9 @@ libxslt1-dev
# parse word documents
antiword

# parse rtf documents
unrtf

# parse image files
tesseract-ocr

Expand Down
Binary file modified tests/mp3/standardized_text.mp3
Binary file not shown.
1 change: 1 addition & 0 deletions tests/rtf/raw_text.rtf

Large diffs are not rendered by default.

11 changes: 11 additions & 0 deletions tests/rtf/raw_text.txt
@@ -0,0 +1,11 @@
I love word documents. They are lovely. They make me so happy I could smile. And that is why I wrote this package.

Sample text is hard. That is where "http://hipsum.co" comes in handy.

Semiotics church-key VHS, Truffaut cliche actually vegan. Cray Austin pop-up disrupt letterpress, kitsch fixie Cosby sweater cliche craft beer PBR&B. Gentrify cornhole Tonx McSweeney's, Shoreditch keffiyeh ethnic Marfa 90's kogi American Apparel. Shabby chic distillery church-key locavore beard, food truck chillwave sartorial deep v flannel authentic Tumblr narwhal kogi organic. Cred vegan jean shorts Banksy forage Neutra dreamcatcher, hashtag Bushwick polaroid pork belly flannel keytar Portland post-ironic. Cred hoodie vegan, food truck leggings Austin pour-over banjo trust fund before they sold out cray Intelligentsia plaid typewriter. Williamsburg XOXO plaid Carles Austin tofu.
Carles Tonx keffiyeh, leggings 90's lo-fi kogi viral semiotics Brooklyn biodiesel tousled bespoke kitsch. Vinyl Tonx art party Thundercats retro, viral asymmetrical artisan bicycle rights bitters master cleanse Kickstarter YOLO. Seitan street art semiotics twee skateboard, PBR&B VHS hashtag meh. Thundercats semiotics shabby chic forage single-origin coffee retro, 3 wolf moon iPhone mumblecore 90's trust fund Intelligentsia. Beard gluten-free seitan, VHS sartorial pork belly gastropub meh whatever authentic synth. Beard single-origin coffee irony fixie, before they sold out Pitchfork kitsch readymade. Helvetica butcher wayfarers, lomo artisan hashtag Brooklyn four loko fanny pack 90's mustache 8-bit.
Meh jean shorts selfies, crucifix selvage Helvetica Carles PBR Vice Banksy roof party master cleanse ugh PBR&B. Lo-fi freegan salvia photo booth, Wes Anderson skateboard Odd Future. Etsy art party Bushwick keffiyeh. Pork belly 3 wolf moon butcher mustache. YOLO raw denim lo-fi, hoodie gentrify Schlitz 8-bit sriracha Shoreditch retro brunch. Williamsburg farm-to-table beard, mlkshk Banksy fap kogi Etsy art party squid semiotics. XOXO church-key Pitchfork mlkshk irony tote bag.
Farm-to-table brunch tattooed hoodie keytar, literally selvage authentic trust fund deep v Thundercats Kickstarter narwhal locavore. Swag disrupt chambray, leggings shabby chic gastropub YOLO plaid hoodie Williamsburg Godard mixtape. Retro Godard keytar biodiesel, freegan paleo Etsy you probably haven't heard of them Pitchfork Schlitz readymade small batch cred. Pug trust fund paleo, 90's fixie typewriter next level banjo. Banksy occupy authentic master cleanse Bushwick fingerstache selfies, direct trade craft beer cliche +1 cray. Locavore four loko biodiesel Neutra chia mlkshk. Fanny pack YOLO Portland, mlkshk PBR&B single-origin coffee drinking vinegar 8-bit flannel gentrify stumptown pop-up.
Oh. You need a little dummy text for your mockup? How quaint.
I bet you are still using Bootstrap too

1 change: 1 addition & 0 deletions tests/rtf/standardized_text.rtf

Large diffs are not rendered by default.

7 changes: 7 additions & 0 deletions tests/test_rtf.py
@@ -0,0 +1,7 @@
import unittest

import base


class DocTestCase(base.ShellParserTestCase, unittest.TestCase):
extension = 'rtf'
4 changes: 2 additions & 2 deletions textract/__init__.py
@@ -1,3 +1,3 @@
VERSION = "1.2.0"

from .parsers import process

VERSION = "1.2.0"
12 changes: 12 additions & 0 deletions textract/parsers/rtf_parser.py
@@ -0,0 +1,12 @@
from .utils import ShellParser


class Parser(ShellParser):
"""Extract text from rtf files using unrtf.
"""

def extract(self, filename, **kwargs):
# http://superuser.com/a/243089/126633
stdout, stderr = self.run('unrtf --text "%(filename)s"' % locals())
text_conversion = stdout.split('-'*17+'\n', 1)[-1]
return text_conversion

0 comments on commit 35af092

Please sign in to comment.