hocr-pdf: Use lxml.etree, iterate ocr_line > ocr_word #19

kba · 2016-09-14T11:29:57Z

This should not change existing behavior, but additionally allow processing non-XHTML-namespaced non-span ocr_line.

In the long run, integrating the code from jbarlow83/OCRmyPDF would be useful. Making the assumptions hocr-pdf makes on the hocr explicit would help too.

Use lxml.etree, iterate ocr_line > ocr_word

64f3399

kba mentioned this pull request Sep 14, 2016

Ported to python dinosauria123/gcv2hocr#3

Merged

zuphilip mentioned this pull request Sep 14, 2016

Use lxml.etree, iterate ocr_line > ocr_word ocropus/hocr-tools#57

Merged

zuphilip closed this Sep 14, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hocr-pdf: Use lxml.etree, iterate ocr_line > ocr_word #19

hocr-pdf: Use lxml.etree, iterate ocr_line > ocr_word #19

kba commented Sep 14, 2016

hocr-pdf: Use lxml.etree, iterate ocr_line > ocr_word #19

hocr-pdf: Use lxml.etree, iterate ocr_line > ocr_word #19

Conversation

kba commented Sep 14, 2016