-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
page__text.xsl is not honoring the reading order #138
Comments
The XSL which is used for that conversion is too simple to handle more complex PAGE XML. I don't know whether a better XSL is available from other projects. Did you try whether |
I would consider this a serious bug, not an enhancement. |
It's both. PAGE XML is complex, so I would never expect a perfect tool which supports all of its features. |
It's not imperfection by not supporting some features, it's producing a wrong result if it's not honoring the reading order, for a lot of real world PAGE XML files. |
The texts in the XML file also look strange when I look at them with |
The file in #138 (comment) was created (by a SBB contractor) using Aletheia and uses their encoding scheme, which uses a lot of PUA characters, which in part is based on MUFI (See (https://www.primaresearch.org/www/assets/tools/Special%20Characters%20in%20Aletheia.pdf)). So it's UTF-8, but with private characters. But encoding is an entirely different beast :-) (dinglehopper-extract gives different characters due to normalization, but that's not the issue here.) |
Sry, did not see this earlier. But I had the exact same use case. It's not so difficult to properly handle PAGE reading order in XSLT 1.0. This was solved along with #151. (You can even pass XSLT parameters for what hierarchy level you want to extract from (default is highest) or what separators to use for concatenation: ocr-fileformat/xslt/page__text.xsl Lines 14 to 21 in 3e32ef6
See
|
Probably fixed in #151 |
page__text.xsl is not honoring the reading order in the PAGE-XML (
pc:ReadingOrder
), which gives completely false results. For this page, I get this text (shortened):For comparison,
dinglehopper-extract
gives the correct text:Image from the ZIP (converted to JPEG), for easier understanding:
The text was updated successfully, but these errors were encountered: