New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multi-page tesseract extraction of PDFs. #66

Merged
merged 1 commit into from Sep 8, 2014

Conversation

Projects
None yet
2 participants
@pudo
Contributor

pudo commented Aug 31, 2014

This will use pdftoppm (included in poppler) to split up a PDF into one image per page, then run OCR on each image and concatenate the result.

I would like to make this so that the PDF extractor first tries pdftotext, and if no text can be extracted, defaults back to OCR. That would have some performance impact but be really thorough.

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Sep 5, 2014

Thanks for the contribution, @pudo. In my version of your branch (889dcae) I did some minor reconfiguring of the test to get it to work properly. Apologies if that was confusing at all.

Things still don't quite appear to be working, I'm guessing because there are some unicode differences on our machines for some reason. Here's what I get when I run textract -m tesseract tests/pdf/ocr_text.pdf [pastebin] which isn't consistent with what you have in tests/pdf/ocr_text.txt. If this output is satisfactory, I'm happy to overwrite and merge everything in; I just want to be sure this is what you expect and that there isn't some other problem. Just let me know how you want to proceed.

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Sep 6, 2014

Looking at the differences between our extractions a little more closely with meld this morning, it is pretty clear that yours is significantly better. What version of tesseract are you using? Can you elaborate a bit more on your build environment?

Here's what I've got:

vagrant@dev:/vagrant$ tesseract -v
tesseract 3.02
vagrant@dev:/vagrant$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 12.04 LTS
Release:    12.04
Codename:   precise
vagrant@dev:/vagrant$ uname -a
Linux dev 3.2.0-23-generic #36-Ubuntu SMP Tue Apr 10 20:39:51 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
@pudo

This comment has been minimized.

Contributor

pudo commented Sep 6, 2014

Hey @deanmalmgren, thanks for your patience with my slightly broken PR. I'm surprised to learn that my extract is better, I'm using a very similar version (albeit on a different platform):

~  tesseract -v
tesseract 3.02.02
 leptonica-1.69
  libjpeg 8d : libpng 1.5.17 : libtiff 4.0.3 : zlib 1.2.5
➜  ~  uname -a
Darwin isaac.local 13.3.0 Darwin Kernel Version 13.3.0: Tue Jun  3 21:27:35 PDT 2014; root:xnu-2422.110.17~1/RELEASE_X86_64 x86_64

I wonder if testing against specific extraction results is the best way to do this, of if a Makefile that invokes system commands directly to generate the gold standards might be a better route to go?

Also, I'm curious if you have any thoughts on whether it would make sense to introduce a mode that chains pdftotext and tesseract - I would very much like that for my use case where documents come in both the text-PDF and image-PDF variety.

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Sep 6, 2014

OK, if its just a tesseract version issue, I'll just overwrite your version of ocr_text.txt with my version and merge this PR. The reason I'm going to do that is because the Vagrant development environment is identical to what builds on travis-ci, so its easy to make sure things are working properly.

I like the idea of "chaining pdftotext and tesseract" to somehow make it seamless to extract text from pdfs, regardless of whether they are raw or scanned pdfs. In fact, I think this should be the default behavior. Want to add that in a separate PR or add it to this one?

I'm curious about your Makefile idea for testing. Want to mock something up for one or two filetypes and share it in a separate PR? I just refactored the testing framework and am definitely open to further improvements.

@pudo

This comment has been minimized.

Contributor

pudo commented Sep 8, 2014

Ok, great! If you could overwrite the ocr_text.txt, then I'll take a stab at the other two issues as soon as I can.

@deanmalmgren deanmalmgren merged commit a526c02 into deanmalmgren:master Sep 8, 2014

1 check failed

continuous-integration/travis-ci The Travis CI build failed
Details

deanmalmgren added a commit that referenced this pull request Sep 8, 2014

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Sep 8, 2014

Sounds good. I merged in your branch and am now using my version of ocr_text.txt. I look forward to checking out your "chaining" and "Makefile" ideas.

Thanks again for the contribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment