This package is organized to make it as easy as possible to add new extensions and support the continued growth and coverage of textract. For almost all applications, you will just have to do something like this:
import textract
text = textract.process('path/to/file.extension')
to obtain text from a document. You can also pass keyword arguments to textract.process
, for example, to use a particular method for parsing a pdf like this:
import textract
text = textract.process('path/to/a.pdf', method='pdfminer')
or to specify a particular output encoding (input encodings are inferred using chardet):
import textract
text = textract.process('path/to/file.extension', encoding='ascii')
Some parsers also enable additional options which can be passed in as keyword arguments to the textract.process
function. Here is a quick table of available options that are available to the different types of parsers:
parser | option | description |
---|---|---|
gif | language | Specify the language for OCR-ing text with tesseract |
jpg | language | Specify the language for OCR-ing text with tesseract |
png | language | Specify the language for OCR-ing text with tesseract |
language | For use when method='tesseract' , specify the language |
|
tiff | language | Specify the language for OCR-ing text with tesseract |
As an example of using these additional options, you can extract text from a Norwegian PDF using Tesseract OCR like this:
text = textract.process(
'path/to/norwegian.pdf',
method='tesseract',
language='nor',
)
When textract.process('path/to/file.extension')
is called, textract.process
looks for a module called textract.parsers.extension_parser
that also contains a Parser
.
textract.parsers.process
Importantly, the textract.parsers.extension_parser.Parser
class must inherit from textract.parsers.utils.BaseParser
.
textract.parsers.utils.BaseParser
Many of the parsers rely on command line utilities to do some of the parsing. For convenience, the textract.parsers.utils.ShellParser
class includes some convenience methods for streamlining access to the command line.
textract.parsers.utils.ShellParser
There are quite a few parsers included with textract
. Rather than elaborating all of them, here are a few that demonstrate how parsers work.
textract.parsers.epub_parser.Parser
textract.parsers.doc_parser.Parser