OCR-D workflow configurations based on makefiles
Makefilization offers the following advantages:
- incremental builds (steps already processed for another configuration or in a failed run need not be repeated) and automatic dependencies (new files will force all their dependents to update)
- persistency of configuration and results
- encapsulation and ease of use
- sharing configurations and repeating experiments
- less writing effort, fast templating
- parallelization across workspaces
Nevertheless, there are also some disadvantages:
- depends on directories (fileGrps) as targets, which is hard to get correct under all circumstances
- must mediate between filesystem perspective (understood by
make) and METS perspective
makecannot handle target names with spaces in them (at all)
(This means that fileGrp directories must not have spaces. Local file paths may contain spaces though, if the respective processors support that.)
To install system dependencies for this package, run...
...in a privileged context for Ubuntu (like a Docker container).
Or equivalently, install the following packages:
Additionally, you must of course install ocrd itself along with its dependencies in the current shell environment. Moreover, depending on the specific configurations you want to use (i.e. the processors it contains), additional modules must be installed. See OCR-D setup guide for instructions.
... if you are in a (Python) virtual environment. Otherwise specify the installation prefix directory via environment variable
$VIRTUAL_ENV/bin is in your
PATH, you can then call:
cd WORKSPACE && make [OPTIONS] -f WORKFLOW-CONFIG.mk make -C WORKSPACE [OPTIONS] -f WORKFLOW-CONFIG.mk
... for processing single workspace directory, or ...
ocrd-make [OPTIONS] -f WORKFLOW-CONFIG.mk WORKSPACE...
... for processing multiple workspaces at once (with the same interface as above).
OPTIONSare the usual options controlling GNU make (e.g.
-jfor parallel processing).
WORKFLOW_CONFIG.mkis one of the configuration makefiles you find here or created yourself.
WORKSPACEis a directory with a
all(the default) for all such directories that we can
Calling workflows is possible from anywhere in your filesystem, but for the
WORKFLOW_CONFIG.mk you may need to:
- either provide the
*.mkconfigurations in the source directory at installation time (to ensure they are installed under the installation prefix and can always be found by file name only)
- or provide full paths at runtime (by absolute path name, or relative to the CWD).
(The previous version of
ocrd-make tried to copy or symlink all makefiles to the runtime directory. You can still use those, but should remove the old
Workflows are processed like software builds: File groups (depending on one another) are the targets to be built in each workspace, and all workspaces are built recursively. A build is finished when all targets exist and none are older than their respective prerequisites (e.g. image files).
To run a configuration...
Activate working environment (virtualenv) and change to the target directory.
Choose (or create) a workflow configuration makefile.
(Yes, you can have to look inside and browse its rules!)
cd WORKSPACE && make [OPTIONS] -f WORKFLOW-CONFIG.mk # or make -C WORKSPACE [OPTIONS] -f WORKFLOW-CONFIG.mk
... for processing single workspace directory, or ...
ocrd-make [OPTIONS] -f WORKFLOW-CONFIG.mk all
(The special target
all(which is also the default goal) will search for all workspaces in the current directory recursively.) You can also run on a subset of workspaces by passing these as goals on the command line...
ocrd-make -f WORKFLOW-CONFIG.mk PATH/TO/WORKSPACE1 PATH/TO/WORKSPACE2 ...
To get help:
To get a short description of the chosen configuration:
[ocrd-]make -f CONFIGURATION.mk info
To see the command sequence that would be executed for the chosen configuration (in the format of
[ocrd-]make -f CONFIGURATION.mk show
To run a workflow server for the command sequence that would be executed for the chosen configuration (to be controlled via
ocrd workflow client or HTTP):
[ocrd-]make -f CONFIGURATION.mk server
To create workspaces from directories which contain image files:
To get help for the import tool:
To perform various tasks via XSLT on PAGE-XML files (these all share the same options, including
page-add-nsprefix-pc # adds namespace prefix 'pc:' page-remove-metadataitem # remove all MetadataItem entries page-remove-dead-regionrefs # remove non-existing regionRefs page-remove-empty-readingorder # remove empty ReadingOrder or groups page-remove-words # remove all Word (and Glyph) entries page-remove-glyphs # remove all Glyph entries page-fix-coords # replace negative values in coordinates by zero page-move-alternativeimage-below-page # try to push page-level AlternativeImage back to subsegments page-textequiv-lines-to-regions # project text from TextLines to TextRegions (concat with LF in between) page-textequiv-words-to-lines # project text from Words to TextLines (concat with spaces in between) page-extract-lines # extract TextLine/TextEquiv/Unicode consequtively page-extract-words # extract Word/TextEquiv/Unicode consequtively page-extract-glyphs # extract Glyph/TextEquiv/Unicode consequtively
To perform the same transformations, but as a workspace processor:
ocrd-page-transform -P xsl page-remove-words.xsl cat <<'EOF' > my-transform.xsl <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:pc="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15"> <xsl:output method="xml" standalone="yes" encoding="UTF-8" omit-xml-declaration="no"/> <xsl:template match="//pc:Word"/> <xsl:template match="node()|text()|@*"> <xsl:copy> <xsl:apply-templates select="node()|text()|@*"/> </xsl:copy> </xsl:template> </xsl:stylesheet> EOF ocrd-page-transform -P xsl my-transform.xsl
To spawn a new configuration file, in the directory of the source repository, do:
Furthermore, you can add any options that
make understands (see
make --help or
info make 'Options Summary'). For example,
--dry-runto just simulate the run
--questionto just check whether anything needs to be built at all
--silentto suppress echoing recipes
--jobsto run on workspaces in parallel
--max-loadto set the maximum load level in parallel mode
--always-maketo consider all targets out-of-date (i.e. unconditionally rebuild)
--old-fileto consider some target up-to-date w.r.t. its prerequisites (i.e. unconditionally keep) but older than its dependents (i.e. unconditionally ignore)
--new-fileto consider some target newer than its dependents (i.e. unconditionally update them)
For example, to rebuild anything after the fileGrp
ocrd-make -f CONFIGURATION.mk -W OCR-D-BIN all
You can also use that pattern to specify any fileGrp other than the
.DEFAULT_GOAL of your configuration as the overall target. For example, to build anything up to the fileGrp
ocrd-make -f CONFIGURATION.mk .DEFAULT_GOAL=OCR-D-SEG-LINE all
If you run
make in the workspace directly instead of having
ocrd-make do it recursively, then no
all target exists and you can directly set the target fileGrp to replace
make -C WORKSPACE -f CONFIGURATION.mk -W OCR-D-BIN make -C WORKSPACE -f CONFIGURATION.mk OCR-D-SEG-LINE
There are 2 special variables. To process only a subset of pages in all fileGrps, use
PAGES. For example, to only consider pages
ocrd-make -f CONFIGURATION.mk all PAGES=PHYS_0005..PHYS_0007 make -C WORKSPACE -f CONFIGURATION.mk PAGES=PHYS_0005..PHYS_0007
And to override the default (or configured) log levels for all processors and libraries, use
LOGLEVEL. For example, to get debugging everywhere, do:
ocrd-make -f CONFIGURATION.mk all LOGLEVEL=DEBUG make -C WORKSPACE -f CONFIGURATION.mk LOGLEVEL=DEBUG
To write new configurations, first choose a (sufficiently descriptive) makefile name, and spawn a new file for that:
make -C workflow-configuration NEW-CONFIGURATION.mk (or copy from an existing configuration).
Next, edit the file to your needs: Write rules using file groups as prerequisites/targets in the normal GNU make syntax. The first target defined must be the default goal that builds the very last file group for that configuration, or else a variable
.DEFAULT_GOAL pointing to that target must be set anywhere in the makefile.
Keep the comments and the
include Makefiledirective in the file.
Change/customize at least the
infotarget, and the
Copy/paste rules from the existing configurations.
Define variables with the names of all target/prerequisite file groups, so rules and dependent targets can re-use them (and the names can be easily changed later).
Try to utilise the provided static pattern rule (which takes the target as output file group and the prerequisite as input file group) for all processing steps. The rule covers any OCR-D compliant processor with no more than 1 output file group. Use it by simply defining the target-specific variable
OPTIONS) and giving no recipe whatsoever.
When any of your processors use GPU resources, you must prevent races for GPU memory during parallel execution.
You can achieve this by simply setting
GPU = 1for that target when using the static pattern rule, or by using
sem --id OCR-D-GPUSEMwhen writing your own recipes.
Alternatively, you can either prevent using GPUs globally by (un)setting
CUDA_VISIBLE_DEVICES=, or prevent running parallel jobs (on multiple CPUs) by passing
INPUT = OCR-D-GT-SEG-LINE $(INPUT): ocrd workspace find -G $@ --download ocrd workspace find -G OCR-D-IMG --download # just in case # You can use variables for file group names to keep the rules brief: BIN = $(INPUT)-BINPAGE # This is how you use the pattern rule from Makefile (included below): # The prerequisite will become the input file group, # the target will become the output file group, # the recipe will call the executable given by TOOL, # also generating a JSON parameter file from PARAMS: $(BIN): $(INPUT) $(BIN): TOOL = ocrd-olena-binarize $(BIN): PARAMS = "impl": "sauvola-ms-split" # or equivalently: $(BIN): OPTIONS = -P impl sauvola-ms-split # You can also use the file group names directly: OCR-D-OCR-TESS: $(BIN) OCR-D-OCR-TESS: TOOL = ocrd-tesserocr-recognize OCR-D-OCR-TESS: PARAMS = "textequiv_level": "glyph", "model": "frk+deu" # or equivalently: OCR-D-OCR-TESS: OPTIONS = -P textequiv_level glyph -P model frk+deu # This uses more than 1 input file group and no output file group, # which works with the standard recipe as well (but mind the ordering): EVAL: $(INPUT) OCR-D-OCR-TESS EVAL: TOOL = ocrd-cor-asv-ann-evaluate # Because the first target in this file was $(BIN), # we must override the default goal to be our desired overall target: .DEFAULT_GOAL = EVAL # ALWAYS necessary: include Makefile
OCR-D ground truth
data_structure_text/dta repository, which includes both layout and text annotation down to the textline level, but very coarse segmentation, the following character error rate (CER) was measured:
Hence, it appears that consistently (across different OCRs) ...
- denoising with Ocropy (with
noise_maxsize=3.0) does not help
- deskewing with Ocropy on the page level usually helps
- additional deskewing and flipping with Tesseract on the region level usually deteriorates
- binarization with
sauvola-ms-splitis better than
However, this result is still preliminary. Both the processor implementations evolve and the GT annotations get fixed over time.
To make writing (and reading) configurations as simple as possible, they are expressed as rules operating on METS file groups (i.e. workspace-local). For convenience, the most common recipe pattern involving only 1 input and 1 output file group via some OCR-D CLI is available via static pattern rule, which merely takes the target-specific variables
TOOL (the CLI executable) and optionally
PARAMS (a JSON-formatted list of parameter assignments) or
OPTIONS (a white-space separated list of parameter assignments). Custom rules are possible as well. If the makefile does not start with the overall target, it must specify its
.DEFAULT_GOAL, so callers can run without knowledge of the target names.
Rules that are not configuration-specific (like the static pattern rule) are all shared by including a common
Makefile at the end of configuration makefiles (which gets copied from
workflow.mk at install time).
make always operates on the level of the workspace directory (i.e. only one at a time), where targets are fileGrps and the default goal is the maximum fileGrp.
For running entire collections of workspaces (possibly in parallel), recursive
make has been abandoned in favour of the
ocrd-make. Its command-line interface looks like
make, but the targets are workspaces and the default goal is
all (which recursively
finds all workspaces).
GPU vs CPU parallelism
When executing workflows in parallel across workspaces (with
--jobs) on multiple CPUs, it must be ensured that not too many OCR-D processors which use GPU resources are running concurrently (to prevent over-allocation of GPU memory). Thus, make needs to know:
- which processors (have/want to) use GPU resources, and
- how many such processors can run in parallel.
It can then synchronize these processors with a semaphore. This is achieved by expanding the static pattern rule with a synchronisation mechanism (based on GNU parallel). Workflow configurations can use that by setting the target-specific variable
GPU to a non-empty value for the respective rules. (Custom recipes will have to use
sem --id OCR-D-GPUSEM.)
That way, races are prevented, but also GPUs cannot become the bottleneck: When all GPUs are busy, processors will fall back to CPU.
workspace vs page parallelism
When executing workflows in parallel across workspaces (with
--jobs) on multiple CPUs, it must be ensured that OCR-D processors do not use local multiprocessing facilities themselves (to prevent over-allocation of CPUs).
In the current state of affairs, OCR-D processors cannot be run in parallel across pages via multiprocessing. (At least, they are never implemented that way.) That may change in the future with a new OCR-D API. But still, many processors do already use libraries like OpenMP or OpenBLAS which use multiprocessing locally within pages. This can be controlled via environment variables like
This is achieved by exporting these variables to all recipes with a value of
-j is used, or half the number of physical CPUs (unless
NTHREADS is explicitly given) otherwise.