Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dramatically improve deskew performance with leptonica #61

Closed
wants to merge 4 commits into from

Conversation

jbarlow83
Copy link
Collaborator

ImageMagick does a pitifully slow job of deskewing, so I looked around for other options. It turns out that Leptonica, the library Tesseract uses image manipulation, has its own deskewing function, and it works great.

On a real benchmark for deskew + OCR on a mixed bag of files, I got 371 seconds as the baseline time, and 216 seconds when Leptonica handled the deskew -- meaning this improvement will reduce the total processing time for a deskew+OCR job by 40%. Looking at deskew without OCR as a standalone test, Leptonica's deskew absolutely runs circles around ImageMagick, getting the job done 30-40x faster.

It also seems to produce better quality images. I found a few cases where ImageMagick introduces a halftone pattern into the background after rotation, for example.

Leptonica has a PyPI package (pylepthonica) but it's huge and out of date. So I wrote this custom tiny one.

More implementation notes in the commits.

Portability should be fine – it relies on Python's find_library to do the nasty work of finding shared library it needs. Tested on OS X 10.9 and Ubuntu 13.10. I am not sure if it will work on platforms where liblept is compiled as 64-bit and Python is running as 32-bit. I understand if you want to delay this in the interest of stabilizing your v2.0 release.

ImageMagick is still needed (just barely) for an invocation of identify somewhere.

Jim Barlow added 4 commits January 19, 2014 14:28
A few design notes:
Leptonica's deskew is far superior to ImageMagick's convert -deskew command --
around 30-40x faster.  Subjectively the output appears to this contributor to
be of higher quality as well.  The difference is the algorithm: ImageMagick
uses the complex Hough transform to find the skew angle, while Leptonica uses
the simpler method, Postl's variance of differential line sums -- conceptually, shear the image and check for straight horizontal.  In this case
simplicity wins.  Finding the skew angle is the bulk of the work.

Leptonica's author explains the advantages of his approach here:
http://www.leptonica.com/skew-measurement.html

Leptonica is the low-level library that Tesseract depends on.  Hence, this
project already depends on Leptonica.  Leptonica can read and write most
common image file types on its own.

Unfortunately its error handling is poor: it seldom returns any meaningful
error codes.  The best it manages is writing messages to stderr, which in
the context of a verbose script is just confusing since the error's source
is not indicated.  The problem is compounded by Tesseract's use of Leptonica,
which will produce exactly the same errors in some cases.  So we trap stderr
between calls to Leptonica and parse it for a few different types of error
message.

leptonica.py is Python 2/3 compatible and set up to provide access to other
Leptonica functions as needed.  Of particular interest are its orientation
detection (including flip and rotation errors) which it does by comparing
text ascenders to descenders.

There is a PyPI "pylepthonica" package, however it is out of date by a few
years, and it implements all of Leptonica with Python wrappers -- so it is
massive, with one .py file at 2.5 MB.  This module is loosely inspired by
pyleptonica but more modern, up to date, and contains only limited
functionality.
Leptonica does not interpret those extensions correctly.  However, when
asked to produce a .pnm file, it will produce the expected .pbm/pgm/ppm
file depending on the input.  So ask it to produce a .pnm and then
adjust the extension.

And add a test case.
@jbarlow83
Copy link
Collaborator Author

I discovered that using -c (clean) broke unpaper because Leptonica was actually generating PNG files with the extension .pbm/pgm/ppm. Tesseract did not have any trouble accepting these misnamed PNGs, so I did not notice. Unpaper, however, will not accept them. I implemented a workaround that convinces Leptonica to generate the desired output file that both Tesseract and Unpaper accept.

As I side note I think it may be better to use TIFF as the intermediate format, since it can contain all of the other formats anyway, and all of the programs seem to accept. Another benefit woudl be maintaining JPEG compression for large color images without expanding the whole thing to disk.

@fritz-hh
Copy link
Owner

Thank you for your work. I had realized that deskew performance was not good, but I did not know that it was possible to improve it that much.
Indeed, I only plan to fix bugs until release of v2. I plan to integrate these improvements (as well as auto rotate if possible) in v3.

@fritz-hh
Copy link
Owner

It seems to be possible to get skew angle with tesseract v3.03 (using the option -psm 0)
(Probably tesseract use leptonica et compute skew. So performance should be good too.).

It would propose to do deskew using tesseract in ocrmypdf v3.x

I keep this pull request open until the change mentionned above is implemented in v3.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants