Feature request: Processing of non-PDF sources #15

bland328 · 2020-06-09T15:36:50Z

I realize this may be thoroughly outside the intended scope of this project, but it would be wonderful if it would process not just PDF files, but a variety of image files (tiff and jpg come to mind). Perhaps passing them directly to to tesseract-ocr and outputting the results as text files?

Thanks for the fantastic Unraid docker container, and for your consideration!

jo-me · 2020-11-16T17:05:54Z

it could work almost out of the box.
ocrmypdf can process images according to the docs
https://ocrmypdf.readthedocs.io/en/latest/cookbook.html#option-use-ocrmypdf-single-images-only

ocrmypdf-auto only needs to allow the extension to be processed. I added jpg to the extension list, but ocrmypdf failed with a picture i took on my phone due to invalid DPI values. Did not try it out further.

The way to go would probably be to use img2pdf for images and then feed them to ocrmypdf.

cmccambridge · 2020-11-20T03:18:48Z

An initial step toward this support is now available with @jo-me's latest updates. The image now supports the .jpg extension by passing a jpg file directly to ocrmypdf, though from @jo-me's experiments, it sounds like this is not sufficient for proper OCR in all cases.

I will keep this issue open to track the feature request and see whether it is reasonable to add img2pdf preprocessing in the container in a future update.

cmccambridge mentioned this issue Nov 20, 2020

Upgrade packages #18

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Processing of non-PDF sources #15

Feature request: Processing of non-PDF sources #15

bland328 commented Jun 9, 2020

jo-me commented Nov 16, 2020

cmccambridge commented Nov 20, 2020

Feature request: Processing of non-PDF sources #15

Feature request: Processing of non-PDF sources #15

Comments

bland328 commented Jun 9, 2020

jo-me commented Nov 16, 2020

cmccambridge commented Nov 20, 2020