Skip to content
The CIS language aware OCR document error profiler
C++ CMake C Objective-C Perl Dockerfile
Branch: master
Clone or download


Source code for the language-aware OCR document error profiler. See the Profiler Manual for a description.


The profiler has originally been written by Uli Reffle as part of his PhD thesis in computational linguistics at CIS during the IMPACT project (2008-2011).

It has been further developed as a CLARIN-D Kurationsprojekt by Florian Fink at CIS.

Its underlying technology is described in the following publications:

Mihov, Stoyan, and Klaus U. Schulz. 2004. “Fast Approximate Search in Large Dictionaries.” Computational Linguistics 30 (4). MIT Press: 451–77.

Reffle, Ulrich. 2011. Algorithmen und Methoden zur dokumentenspezifischen Analyse historischer und OCR-erfasster Texte. Verlag Dr. Hut.

Reffle, Ulrich, and Christoph Ringlstetter. 2013. “Unsupervised Profiling of OCRed Historical Documents.” Pattern Recognition 46 (5): 1346–57. doi:

Schulz, Klaus U., and Stoyan Mihov. 2002. “Fast String Correction with Levenshtein Automata.” International Journal on Document Analysis and Recognition 5 (1). Springer: 67–85.

You can’t perform that action at this time.