Skip to content

Commit

Permalink
updated documentation to reflect the new -O functionality
Browse files Browse the repository at this point in the history
  • Loading branch information
Dean Malmgren committed Jan 31, 2015
1 parent 55f63fc commit ecadac8
Show file tree
Hide file tree
Showing 2 changed files with 35 additions and 4 deletions.
30 changes: 30 additions & 0 deletions docs/python_package.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,34 @@ inferred using `chardet <https://github.com/chardet/chardet>`_)::
import textract
text = textract.process('path/to/file.extension', encoding='ascii')

.. _additional-options:

Additional options
------------------

Some parsers also enable additional options which can be passed in as keyword
arguments to the ``textract.process`` function. Here is a quick table of
available options that are available to the different types of parsers:

====== ========= ===========================================================
parser option description
====== ========= ===========================================================
gif language Specify `the language`_ for OCR-ing text with tesseract
jpg language Specify `the language`_ for OCR-ing text with tesseract
png language Specify `the language`_ for OCR-ing text with tesseract
pdf language For use when ``method='tesseract'``, specify `the language`_
tiff language Specify `the language`_ for OCR-ing text with tesseract
====== ========= ===========================================================

As an example of using these additional options, you can extract text from a
Norwegian PDF using Tesseract OCR like this::

text = textract.process(
'path/to/norwegian.pdf',
method='tesseract',
language='nor',
)


A look under the hood
---------------------
Expand Down Expand Up @@ -71,3 +99,5 @@ work.
:undoc-members:
:show-inheritance:


.. _the language: https://code.google.com/p/tesseract-ocr/downloads/list
9 changes: 5 additions & 4 deletions textract/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,17 +49,18 @@ def get_parser():
)
parser.add_argument(
'-m', '--method', default='',
help='specify a method of extraction for formats that support it',
help='Specify a method of extraction for formats that support it',
)
parser.add_argument(
'-o', '--output', type=argparse.FileType('w'), default='-',
help='output raw text in this file',
help='Output raw text in this file',
)
parser.add_argument(
'-O', '--option', type=str, action=AddToNamespaceAction,
help=(
'add arbitrary options to various parsers of the form '
'KEYWORD=VALUE'
'Add arbitrary options to various parsers of the form '
'KEYWORD=VALUE. A full list of available KEYWORD options is '
'available at http://bit.ly/textract-options'
),
)
parser.add_argument(
Expand Down

0 comments on commit ecadac8

Please sign in to comment.