-
Notifications
You must be signed in to change notification settings - Fork 594
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Dean Malmgren
committed
Jul 7, 2014
1 parent
e9860b8
commit 184b313
Showing
13 changed files
with
70 additions
and
10 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,3 +6,6 @@ libxslt1-dev | |
|
||
# parse word documents | ||
antiword | ||
|
||
# parse pdfs | ||
poppler-utils |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,3 +3,4 @@ | |
argcomplete | ||
python-pptx | ||
python-docx | ||
pdfminer |
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
from ..shell import run | ||
from ..exceptions import UnknownMethod | ||
|
||
|
||
def extract(filename, method=None, **kwargs): | ||
"""Extract text from pdf files using ``method``. | ||
""" | ||
method = method or 'pdftotext' | ||
if method == 'pdftotext': | ||
return extract_pdftotext(filename) | ||
elif method == 'pdfminer': | ||
return extract_pdfminer(filename) | ||
else: | ||
raise UnknownMethod(method) | ||
|
||
|
||
def extract_pdftotext(filename): | ||
"""Extract text from pdfs using the pdftotext command line utility.""" | ||
pipe = run('pdftotext %(filename)s -' % locals()) | ||
return pipe.stdout.read() | ||
|
||
|
||
def extract_pdfminer(filename): | ||
"""Extract text from pdfs using pdfminer.""" | ||
pipe = run('pdf2txt.py %(filename)s' % locals()) | ||
return pipe.stdout.read() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters