New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TextBlocks and paragraphs #53
Comments
Yes, my understanding is that this would be fine, |
That is correct. Note that ALTO being agnostic towards the semantic structure of a page, it does not really attempt the encoding of paragraphs which are logical units rather. |
The main role of an OCR encoding is to make it possible to render the page as text. Now, paragraphs are certainly logical units, since they encode a semantic break. But they also have a "physical" role, since they indicate a "hard" newline, unlike the "soft" newlines at the end of textlines. Knowledge of this "hard" newline is critical to correctly rendering text. I don't mind calling textblocks "text blocks", rather than paragraphs, but I do think there should be a general agreement that a textblock is not simply "a block of text", as per the current definition, but rather "a block of text, separated from other blocks of text by a hard newline". |
Hi Assaf,
at a first look your statement seem ok, but looking into detail there might be also other reasons for separating one textblock from another, e.g. formatting, shaping information or other references done on this element level.
That's why the textblock is just a textblock and dies not say anything about the relation to the surrounding elements.
As outlined by Clemens this is done e.g. on METS structure where multiple textblocks can be bound together as paragraphs in a logical structure by structMap. By the way paragraphs can be across columns and even pages - due to this it is finally not possible to describe them within a page description level .
In case of further questions do not hesitate to get back to us for samples or references of use-cases.
Regards,
Jo
Joachim Bauer
Senior System Engineer
CCS Content Conversion Specialists
Von meinem iPhone gesendet
Am 19.10.2018 um 21:17 schrieb Assaf Urieli <notifications@github.com<mailto:notifications@github.com>>:
The main role of an OCR encoding is to make it possible to render the page as text. Now, paragraphs are certainly logical units, since they encode a semantic break. But they also have a "physical" role, since they indicate a "hard" newline, unlike the "soft" newlines at the end of textlines. Knowledge of this "hard" newline is critical to correctly rendering text.
I don't mind calling textblocks "text blocks", rather than paragraphs, but I do think there should be a general agreement that a textblock is not simply "a block of text", as per the current definition, but rather "a block of text, separated from other blocks of text by a hard newline".
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#53 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AFuTWQSj87_yQfO7bneSLBOTdfD6Ueqdks5umiUfgaJpZM4XJ7dj>.
|
I see your point about paragraphs crossing columns and pages, which means a single paragraph will necessarily be broken into multiple text blocks. Furthermore, if I understand correctly, Alto has no standard way of encoding the difference between a hard newline (at the end of a paragraph) and a soft newline (at the end of a text block forming part of a paragraph spanning several columns or pages). This is unfortunate, since we do have other indicators of the semantic structure, such as It seems a shame to have to rely on two meta-layers (METS + ALTO) to indicate the hard vs soft newline, which complicates things considerably. |
Hi Assaf, You have to keep in mind that ALTO has been designed to describe OCR output at page level, and not to encode the logical structure of a document. Consequently, we can't define an ALTO textblock as "a block of text, separated from other blocks of text by a hard newline". Because OCR engines don't recognize hard newlines (they only try to). They detect lines and aggregates of lines, which sometimes turn out to be logical paragraphs. Best, |
In an effort to keep ahead of schema issues, ones without a direct schema implication will be closed if deemed to be no longer active or if the discussion has gone full circle. They can be reopened if requested. |
The Alto4 xsd defines a
TextBlock
simply as "a block of text".Take a document such as: https://archive.org/details/newyorktimes00unse/page/n1
Is it accepted usage to have one
TextBlock
per paragraph, and oneComposedBlock
for each column, containing the paragraphTextBlocks
?If not, how are paragraphs marked?
The text was updated successfully, but these errors were encountered: