Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Added option to pass language parameter to tesseract for non-English OCR documents #76
Useful for extracting text from non-English documents. If keyword argument ‘language’ is specified to pdf parser, then the image parser uses this when triggering tesseract. Tesseract language data can be downloaded from here https://code.google.com/p/tesseract-ocr/downloads/list
@anderser I am really embarrassed I haven't done anything with this yet.
I'd really like to merge this into the project and, more generally, make it easy to support these kind of options from the command line, too. textract's command line interface is very simplistic which is great. It really only has a
I'd love to get your feedback on how we could make options like
prompt$ textract --method tesseract --option language=nor path/to/some.pdf
As other options get added for different types of parsers, people can just continue to use the
Jan 30, 2015
1 check failed
added a commit
this pull request
Jan 30, 2015
Thanks for adding the language option, @anderser. I added the command line functionality to get this to work so you can also specify
Apologies once again for not getting this incorporated sooner. Its a great addition and I really appreciate it.
If you have any improvements to the docs or any test cases you'd like to add, that would be great!
Sorry for my late answer. Good to know it has been merged, now I can finally leave my fork :) I think your thoughts/suggestion on the command line issue is great. Looks flexible and easy to use.