Skip to content

brobertson/lacebuilder

Repository files navigation

Lacebuilder

image

image

Documentation Status

Lacebuilder is a friendly command-line application that generates packages for the Lace in-browser OCR to TEI web editing application. Point it to an image directory and corresponding hOCR output directory, as well as to a simple xml metadata file, and it produces the .xar packages that can be installed in Lace through eXist-db's drag-and-drop package manager.

Features

  • Generates a base image package for all derived OCR runs, binarizing all images
  • Generates OCR output packages with the enhanced data used to make editing OCR easy in Lace, including word spellcheck status and dehyphenation
  • Automatically corrects the word bounding boxes of kraken hOCR output

Examples

lacebuilder offers two subcommands, packimages and packtexts. These have their own parameters. The parameters --outputdir and --metadatafile are common to both of the subcommands, so they are set before them. At present, you cannot chain the subcommands. To access the --help for the subcommands, you must properly set these output parameters, thus:

lacebuilder --outputdir /tmp/ --metadatafile /tmp/myfile_meta.xml packtexts --help

Building an image package:

lacebuilder --outputdir /home/brucerob/ --metadatafile ~/Test_Lacebuilder/552464779_meta.xml packimages  --imagedir ~/Test_Tarantella/test outputdir: /home/brucerob/
generating image xar archive
Binarizing and compressing images
image archive of 111 images saved to /home/brucerob/552464779_images.xar

More information is required to build an hOCR output text package because Lace uses it to store multiple OCR 'runs' of a given image set and eventually to search and compile runs that have been completed using the same classifier:

lacebuilder --outputdir /home/brucerob/ --metadatafile ~/Test_Tarantella/552464779_meta.xml packtexts  --hocrdir ~/Test_Tarantella/test_hocr_out/ --classifier ~/Downloads/Kraken-Greek-Classifiers-and-Samples/porson_2020-10-10-11-54-25_best.mlmodel --imagexarfile ~/552464779_images.xar
dehyphenating
spellchecking
generating hocr xar
accuracy 91%, Greek acc. 91%; completed 00%, Greek completed 00%
total:  20669 ; total correct: 11369
writing this data to  /tmp/tmpo0_6nin6total.xml
text archive from date 2021-01-30-16-05-42 saved to /home/brucerob/552464779-2021-01-30-16-05-42-porson_2020-10-10-11-54-25_best-texts.xar

Example Including Archive.org Files and Tesseract Processing

Here is a sequence of bash commands that convert a meta.xml file and zip archive of jp2 image files into Lace packages:

mkdir /tmp/Pliny
cd /tmp/Pliny/
mv ~/Downloads/epistularumlibr00plin_* ./
unzip epistularumlibr00plin_jp2.zip
cd epistularumlibr00plin_jp2/
parallel -P 6 opj_decompress -i  {} -o {.}.png ::: *jp2
mkdir epistularumlibr00plin_png
mv *png epistularumlibr00plin_png/
mkdir epistularumlibr00plin_hocr
parallel  -P 6  "tesseract   -l lat  {}  epistularumlibr00plin_hocr/{/.} hocr" ::: epistularumlibr00plin_png/*png
lacebuilder --outputdir . --metadatafile epistularumlibr00plin_meta.xml  packimages --imagedir epistularumlibr00plin_jp2/epistularumlibr00plin_png/
lacebuilder --outputdir . --metadatafile epistularumlibr00plin_meta.xml  packtexts --imagexarfile epistularumlibr00plin_images.xar --hocrdir epistularumlibr00plin_jp2/epistularumlibr00plin_hocr/ --ocr-engine tesseract --classifier lat --verbose

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published