I had been using pdfsandwich to
create searchable PDFs from non-searchable PDFs. However, it's a pain to collect
all the dependencies if e.g. you don't have root access. So I thought to package
them up with Julia's BinaryBuilder to make installation simple. However, I
wasn't able to cross-compile pdfsandwich
itself. But since tesseract is doing
the hard work anyway, I thought I would just write the glue script myself. It
turns out there are several of
these
already.
I believe I have likely diverged from the pdfsandwich
implementation since I
haven't used ImageMagick's convert
which is one of the dependencies of
pdfsandwich
. Since the job can be done very simply, e.g.
- convert each page of the PDF to an image
- possibly clean it up with
unpaper
- use tesseract to create a single-page searchable PDF
- combine the PDFs,
I decided to not look at the source of pdfsandwich
when creating my implementation so I can stick to an MIT
license, which is the usual one in the Julia community.
It more-or-less works on MacOS (both Intel and Apple Silicon) and Linux.
Next steps:
- Allow choice of training data used for tesseract
- Look at what settings should be used for
unpaper
- Robustify and test on more files
- Add better tests?
using SearchablePDFs
file = ocr("test/test_rasterized.pdf")
Call
using SearchablePDFs
SearchablePDFs.comonicon_install()
to install a CLI script powered by Comonicon.jl to ~/.julia/bin/searchable
. Add that folder to your PATH to be able to use searchable
as an executable.