New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dramatically improve deskew performance with leptonica #61
Conversation
A few design notes: Leptonica's deskew is far superior to ImageMagick's convert -deskew command -- around 30-40x faster. Subjectively the output appears to this contributor to be of higher quality as well. The difference is the algorithm: ImageMagick uses the complex Hough transform to find the skew angle, while Leptonica uses the simpler method, Postl's variance of differential line sums -- conceptually, shear the image and check for straight horizontal. In this case simplicity wins. Finding the skew angle is the bulk of the work. Leptonica's author explains the advantages of his approach here: http://www.leptonica.com/skew-measurement.html Leptonica is the low-level library that Tesseract depends on. Hence, this project already depends on Leptonica. Leptonica can read and write most common image file types on its own. Unfortunately its error handling is poor: it seldom returns any meaningful error codes. The best it manages is writing messages to stderr, which in the context of a verbose script is just confusing since the error's source is not indicated. The problem is compounded by Tesseract's use of Leptonica, which will produce exactly the same errors in some cases. So we trap stderr between calls to Leptonica and parse it for a few different types of error message. leptonica.py is Python 2/3 compatible and set up to provide access to other Leptonica functions as needed. Of particular interest are its orientation detection (including flip and rotation errors) which it does by comparing text ascenders to descenders. There is a PyPI "pylepthonica" package, however it is out of date by a few years, and it implements all of Leptonica with Python wrappers -- so it is massive, with one .py file at 2.5 MB. This module is loosely inspired by pyleptonica but more modern, up to date, and contains only limited functionality.
Leptonica does not interpret those extensions correctly. However, when asked to produce a .pnm file, it will produce the expected .pbm/pgm/ppm file depending on the input. So ask it to produce a .pnm and then adjust the extension. And add a test case.
I discovered that using -c (clean) broke unpaper because Leptonica was actually generating PNG files with the extension .pbm/pgm/ppm. Tesseract did not have any trouble accepting these misnamed PNGs, so I did not notice. Unpaper, however, will not accept them. I implemented a workaround that convinces Leptonica to generate the desired output file that both Tesseract and Unpaper accept. As I side note I think it may be better to use TIFF as the intermediate format, since it can contain all of the other formats anyway, and all of the programs seem to accept. Another benefit woudl be maintaining JPEG compression for large color images without expanding the whole thing to disk. |
Thank you for your work. I had realized that deskew performance was not good, but I did not know that it was possible to improve it that much. |
It seems to be possible to get skew angle with tesseract v3.03 (using the option -psm 0) It would propose to do deskew using tesseract in ocrmypdf v3.x I keep this pull request open until the change mentionned above is implemented in v3.x |
ImageMagick does a pitifully slow job of deskewing, so I looked around for other options. It turns out that Leptonica, the library Tesseract uses image manipulation, has its own deskewing function, and it works great.
On a real benchmark for deskew + OCR on a mixed bag of files, I got 371 seconds as the baseline time, and 216 seconds when Leptonica handled the deskew -- meaning this improvement will reduce the total processing time for a deskew+OCR job by 40%. Looking at deskew without OCR as a standalone test, Leptonica's deskew absolutely runs circles around ImageMagick, getting the job done 30-40x faster.
It also seems to produce better quality images. I found a few cases where ImageMagick introduces a halftone pattern into the background after rotation, for example.
Leptonica has a PyPI package (pylepthonica) but it's huge and out of date. So I wrote this custom tiny one.
More implementation notes in the commits.
Portability should be fine – it relies on Python's find_library to do the nasty work of finding shared library it needs. Tested on OS X 10.9 and Ubuntu 13.10. I am not sure if it will work on platforms where liblept is compiled as 64-bit and Python is running as 32-bit. I understand if you want to delay this in the interest of stabilizing your v2.0 release.
ImageMagick is still needed (just barely) for an invocation of identify somewhere.