Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Processing of non-PDF sources #15

Open
bland328 opened this issue Jun 9, 2020 · 2 comments
Open

Feature request: Processing of non-PDF sources #15

bland328 opened this issue Jun 9, 2020 · 2 comments

Comments

@bland328
Copy link

bland328 commented Jun 9, 2020

I realize this may be thoroughly outside the intended scope of this project, but it would be wonderful if it would process not just PDF files, but a variety of image files (tiff and jpg come to mind). Perhaps passing them directly to to tesseract-ocr and outputting the results as text files?

Thanks for the fantastic Unraid docker container, and for your consideration!

@jo-me
Copy link
Contributor

jo-me commented Nov 16, 2020

it could work almost out of the box.
ocrmypdf can process images according to the docs
https://ocrmypdf.readthedocs.io/en/latest/cookbook.html#option-use-ocrmypdf-single-images-only

ocrmypdf-auto only needs to allow the extension to be processed. I added jpg to the extension list, but ocrmypdf failed with a picture i took on my phone due to invalid DPI values. Did not try it out further.

The way to go would probably be to use img2pdf for images and then feed them to ocrmypdf.

@cmccambridge
Copy link
Owner

An initial step toward this support is now available with @jo-me's latest updates. The image now supports the .jpg extension by passing a jpg file directly to ocrmypdf, though from @jo-me's experiments, it sounds like this is not sufficient for proper OCR in all cases.

I will keep this issue open to track the feature request and see whether it is reasonable to add img2pdf preprocessing in the container in a future update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants