New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added option to pass language parameter to tesseract for non-English OCR documents #76

Merged
merged 1 commit into from Jan 30, 2015

Conversation

Projects
None yet
2 participants
@anderser
Contributor

anderser commented Sep 15, 2014

Useful for extracting text from non-English documents. If keyword argument ‘language’ is specified to pdf parser, then the image parser uses this when triggering tesseract. Tesseract language data can be downloaded from here https://code.google.com/p/tesseract-ocr/downloads/list

Example use:

from textract.parsers.pdf_parser import Parser
norwegian_text = Parser().extract('scanned_nor.pdf', method='tesseract', language='nor')
Added option to pass language parameter to tesseract
If keyword argument ‘language’ is specified to pdf parser, then the
image parser uses this when triggering tesseract.
@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Jan 30, 2015

@anderser I am really embarrassed I haven't done anything with this yet.

I'd really like to merge this into the project and, more generally, make it easy to support these kind of options from the command line, too. textract's command line interface is very simplistic which is great. It really only has a --method flag for specifying a particular method for extracting text from pdfs, for example.

I'd love to get your feedback on how we could make options like language='nor' available on the command line as well. What would you think of something like this:

prompt$ textract --method tesseract --option language=nor path/to/some.pdf

As other options get added for different types of parsers, people can just continue to use the --option flag to specify these things on the command line. Does this seem reasonable? Do you have any other suggestions for what the command line interface could look like?

@deanmalmgren deanmalmgren merged commit 8f7b32c into deanmalmgren:master Jan 30, 2015

1 check failed

continuous-integration/travis-ci The Travis CI build failed
Details

deanmalmgren added a commit that referenced this pull request Jan 30, 2015

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Jan 31, 2015

Thanks for adding the language option, @anderser. I added the command line functionality to get this to work so you can also specify language=nor on the command line using the --option/-O command line switch. I also updated the documentation and added this in the latest textract release.

Apologies once again for not getting this incorporated sooner. Its a great addition and I really appreciate it.

If you have any improvements to the docs or any test cases you'd like to add, that would be great!

@anderser

This comment has been minimized.

Contributor

anderser commented Feb 3, 2015

Sorry for my late answer. Good to know it has been merged, now I can finally leave my fork :) I think your thoughts/suggestion on the command line issue is great. Looks flexible and easy to use.
Next time I'll remember to add docs as well. Keep up the good work with this great lib.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment