Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Provenance for OCRProcessing/Processing and Content #35
The current OCRProcessing statement is rather rudimentary in not allowing identifiers for each ProcessingStep and being able to link features in the recognition results to particular steps. For example, in our pipeline we frequently use tesseract's page segmentation with ocropus's recognition, so TextLine elements are sourced from one ProcessingStep and their text content is from another one.
A particular use case is when postprocessing like spell checkers add additional variants to String tags (something we'd like to see also) and it may be unclear if the variant is produced by the recognition engine itself or the spell checker.