A set of handy scripts to make the tesseract training process a bit easier.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.gitignore
README
Tesseract3Box.py
__init__.py
align_boxfile.py
auto_train.sh
clustering.sh
combine.sh
compute_charset.sh
dict.sh
make_boxes.sh
merge_boxes.py
png2tif.sh
shape_clustering.sh
text2img.py
training_mode.sh
utils.py

README

tess_school
--------------
by Derek Dohler, dohler@gmail.com

A basic set of tools to making it easier to train Tesseract. No Cube training yet.

Installation:
None, just download and run the appropriate scripts.

Dependencies:
-Tesseract 3.02 (everything except shape_clustering.sh should work with 3.01)
-Python 2.6
-ImageMagick
-Pango, and Cairo bindings for Python necessary to automatically generate training images.

Python Scripts:
-text2img.py: Takes a ground-truth text file and automatically generates image files from the text, for use in training tesseract. Everything is hard coded at the moment, no command-line options yet. Eventually I'd like to have this generate the boxfile too.

-merge_boxes.py: Merges nearby boxes in a boxfile resulting from tesseract oversegmenting characters. This is a common error that Tesseract makes and this script will quickly fix most instances of this problem.

-align_boxfile.py: Changes a boxfile to match a ground-truth text file. Will abort and complain if the number of boxes doesn't match the number of characters in the file, so run this only after your boxes are in the right places.

Shell scripts:
-png2tif.sh: Uses ImageMagick to convert the PNG output from text2img.py to TIFF files. Tesseract can read PNG files, but sometimes seems to prefer TIFF.
-make_boxes.sh: Makes boxfiles from images generated by text2img.py
-auto_train.sh: Steps through the remaining training steps one by one. Needs to be in the same folder as your other scripts, and as the training files. All other scripts automate the steps necessary for tesseract training, and are named appropriately.

Suggested workflow:
1. text2img.py
2. png2tif.sh
3. make_boxes.sh
4. mergeboxes.py and align_boxfile.py + manual editing until boxfile is correct.
5. auto_train.sh