OCR Evaluation Data
This is data that can be used to evaluate the Ocular historical document OCR system, which can be found here: https://github.com/tberg12/ocular. It contains train/dev/test splits for several books, each with pre-extracted lines, and gold transcriptions for the dev and test sets.
Some subset of these documents were used for testing in the following publications:
If you find any mistakes in the gold transcriptions found in this repository, please let us know. We would like for the transcriptions to be as accurate as possible.
Running Ocular with the data
To train and evaluate with this data, use the following options during font training:
-inputDocPath documents/BOOK/train -extractedLinesPath extractions/BOOK -evalInputDocPath documents/BOOK/test -evalExtractedLinesPath extractions/BOOK
Retrieving image files
The actual page images are not stored in this repository. Instead, the pre-extracted lines are. However, all image files in
documents can be found on the Primeros Libros website. The urls for the images are formulaic, and can be determined from the image filename. The filename template
corresponds to the url template
For example, the image for filename
can be found at