Dramatically improve deskew performance with leptonica #61

jbarlow83 · 2014-01-20T03:24:00Z

ImageMagick does a pitifully slow job of deskewing, so I looked around for other options. It turns out that Leptonica, the library Tesseract uses image manipulation, has its own deskewing function, and it works great.

On a real benchmark for deskew + OCR on a mixed bag of files, I got 371 seconds as the baseline time, and 216 seconds when Leptonica handled the deskew -- meaning this improvement will reduce the total processing time for a deskew+OCR job by 40%. Looking at deskew without OCR as a standalone test, Leptonica's deskew absolutely runs circles around ImageMagick, getting the job done 30-40x faster.

It also seems to produce better quality images. I found a few cases where ImageMagick introduces a halftone pattern into the background after rotation, for example.

Leptonica has a PyPI package (pylepthonica) but it's huge and out of date. So I wrote this custom tiny one.

More implementation notes in the commits.

Portability should be fine – it relies on Python's find_library to do the nasty work of finding shared library it needs. Tested on OS X 10.9 and Ubuntu 13.10. I am not sure if it will work on platforms where liblept is compiled as 64-bit and Python is running as 32-bit. I understand if you want to delay this in the interest of stabilizing your v2.0 release.

ImageMagick is still needed (just barely) for an invocation of identify somewhere.

A few design notes: Leptonica's deskew is far superior to ImageMagick's convert -deskew command -- around 30-40x faster. Subjectively the output appears to this contributor to be of higher quality as well. The difference is the algorithm: ImageMagick uses the complex Hough transform to find the skew angle, while Leptonica uses the simpler method, Postl's variance of differential line sums -- conceptually, shear the image and check for straight horizontal. In this case simplicity wins. Finding the skew angle is the bulk of the work. Leptonica's author explains the advantages of his approach here: http://www.leptonica.com/skew-measurement.html Leptonica is the low-level library that Tesseract depends on. Hence, this project already depends on Leptonica. Leptonica can read and write most common image file types on its own. Unfortunately its error handling is poor: it seldom returns any meaningful error codes. The best it manages is writing messages to stderr, which in the context of a verbose script is just confusing since the error's source is not indicated. The problem is compounded by Tesseract's use of Leptonica, which will produce exactly the same errors in some cases. So we trap stderr between calls to Leptonica and parse it for a few different types of error message. leptonica.py is Python 2/3 compatible and set up to provide access to other Leptonica functions as needed. Of particular interest are its orientation detection (including flip and rotation errors) which it does by comparing text ascenders to descenders. There is a PyPI "pylepthonica" package, however it is out of date by a few years, and it implements all of Leptonica with Python wrappers -- so it is massive, with one .py file at 2.5 MB. This module is loosely inspired by pyleptonica but more modern, up to date, and contains only limited functionality.

Leptonica does not interpret those extensions correctly. However, when asked to produce a .pnm file, it will produce the expected .pbm/pgm/ppm file depending on the input. So ask it to produce a .pnm and then adjust the extension. And add a test case.

jbarlow83 · 2014-01-22T05:56:07Z

I discovered that using -c (clean) broke unpaper because Leptonica was actually generating PNG files with the extension .pbm/pgm/ppm. Tesseract did not have any trouble accepting these misnamed PNGs, so I did not notice. Unpaper, however, will not accept them. I implemented a workaround that convinces Leptonica to generate the desired output file that both Tesseract and Unpaper accept.

As I side note I think it may be better to use TIFF as the intermediate format, since it can contain all of the other formats anyway, and all of the programs seem to accept. Another benefit woudl be maintaining JPEG compression for large color images without expanding the whole thing to disk.

fritz-hh · 2014-01-23T18:39:16Z

Thank you for your work. I had realized that deskew performance was not good, but I did not know that it was possible to improve it that much.
Indeed, I only plan to fix bugs until release of v2. I plan to integrate these improvements (as well as auto rotate if possible) in v3.

fritz-hh · 2014-09-30T20:11:28Z

It seems to be possible to get skew angle with tesseract v3.03 (using the option -psm 0)
(Probably tesseract use leptonica et compute skew. So performance should be good too.).

It would propose to do deskew using tesseract in ocrmypdf v3.x

I keep this pull request open until the change mentionned above is implemented in v3.x

Jim Barlow added 4 commits January 19, 2014 14:28

Replace ImageMagick-convert with Leptonica

6703434

Fix a silly typo, and other minor cleanup

8cfbdaf

femifrak mentioned this pull request May 29, 2014

original images not kept unaltered #78

Closed

jbarlow83 closed this Jul 28, 2015

OCRmyPDF-issuebot mentioned this pull request Sep 14, 2015

original images not kept unaltered ocrmypdf/OCRmyPDF#8

Closed

jbarlow83 deleted the for-upstream/leptdeskew branch November 30, 2017 19:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dramatically improve deskew performance with leptonica #61

Dramatically improve deskew performance with leptonica #61

jbarlow83 commented Jan 20, 2014

jbarlow83 commented Jan 22, 2014

fritz-hh commented Jan 23, 2014

fritz-hh commented Sep 30, 2014

Dramatically improve deskew performance with leptonica #61

Dramatically improve deskew performance with leptonica #61

Conversation

jbarlow83 commented Jan 20, 2014

jbarlow83 commented Jan 22, 2014

fritz-hh commented Jan 23, 2014

fritz-hh commented Sep 30, 2014