Skip to content
A set of handy scripts to make the tesseract training process a bit easier.
Python Shell
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


by Derek Dohler,

A basic set of tools to making it easier to train Tesseract. No Cube training yet.

None, just download and run the appropriate scripts.

-Tesseract 3.02 (everything except should work with 3.01)
-Python 2.6
-Pango, and Cairo bindings for Python necessary to automatically generate training images.

Python Scripts: Takes a ground-truth text file and automatically generates image files from the text, for use in training tesseract. Everything is hard coded at the moment, no command-line options yet. Eventually I'd like to have this generate the boxfile too. Merges nearby boxes in a boxfile resulting from tesseract oversegmenting characters. This is a common error that Tesseract makes and this script will quickly fix most instances of this problem. Changes a boxfile to match a ground-truth text file. Will abort and complain if the number of boxes doesn't match the number of characters in the file, so run this only after your boxes are in the right places.

Shell scripts: Uses ImageMagick to convert the PNG output from to TIFF files. Tesseract can read PNG files, but sometimes seems to prefer TIFF. Makes boxfiles from images generated by Steps through the remaining training steps one by one. Needs to be in the same folder as your other scripts, and as the training files. All other scripts automate the steps necessary for tesseract training, and are named appropriately.

Suggested workflow:
4. and + manual editing until boxfile is correct.
You can’t perform that action at this time.