Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PAGE XML contains TextEquiv with empty Unicode #1

Closed
stweil opened this issue Aug 6, 2022 · 2 comments
Closed

PAGE XML contains TextEquiv with empty Unicode #1

stweil opened this issue Aug 6, 2022 · 2 comments

Comments

@stweil
Copy link
Member

stweil commented Aug 6, 2022

The PAGE XML files contain lots of text regions without text in their TextEquiv and a few text files without text in their TextEquiv:

# Text regions without text.
% git grep "^                <Unicode></Unicode>" | wc -l
   17097
# Text lines without text:
% git grep "^                    <Unicode></Unicode>" | wc -l  
       6

Text from regions with text in lines but without text in the region gets lost when the PAGE XML file is converted to pure text using ocr-transform.

@stweil
Copy link
Member Author

stweil commented Aug 6, 2022

https://github.com/UB-Mannheim/Fibeln also has 11 files which contain text regions without text. Those files where also created using Transkribus. This indicates that it might be a general problem of that software.

@tsmdt
Copy link
Collaborator

tsmdt commented Sep 8, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants