Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TextBlocks and paragraphs #53

Closed
urieli opened this issue Oct 5, 2018 · 7 comments
Closed

TextBlocks and paragraphs #53

urieli opened this issue Oct 5, 2018 · 7 comments
Assignees

Comments

@urieli
Copy link

urieli commented Oct 5, 2018

The Alto4 xsd defines a TextBlock simply as "a block of text".

Take a document such as: https://archive.org/details/newyorktimes00unse/page/n1
Is it accepted usage to have one TextBlock per paragraph, and one ComposedBlock for each column, containing the paragraph TextBlocks?

If not, how are paragraphs marked?

@artunit
Copy link
Member

artunit commented Oct 11, 2018

Yes, my understanding is that this would be fine, ComposedBlock can be any grouping of TextBlocks. I believe ComposedBlock is used most commonly for identifying articles.

@cneud
Copy link
Member

cneud commented Oct 19, 2018

That is correct. Note that ALTO being agnostic towards the semantic structure of a page, it does not really attempt the encoding of paragraphs which are logical units rather.

@urieli
Copy link
Author

urieli commented Oct 19, 2018

The main role of an OCR encoding is to make it possible to render the page as text. Now, paragraphs are certainly logical units, since they encode a semantic break. But they also have a "physical" role, since they indicate a "hard" newline, unlike the "soft" newlines at the end of textlines. Knowledge of this "hard" newline is critical to correctly rendering text.

I don't mind calling textblocks "text blocks", rather than paragraphs, but I do think there should be a general agreement that a textblock is not simply "a block of text", as per the current definition, but rather "a block of text, separated from other blocks of text by a hard newline".

@Jo-CCS
Copy link
Member

Jo-CCS commented Oct 22, 2018 via email

@urieli
Copy link
Author

urieli commented Oct 29, 2018

I see your point about paragraphs crossing columns and pages, which means a single paragraph will necessarily be broken into multiple text blocks.
Thus, it is obvious that a hard newline is not a necessary condition for marking the end of a text block.
The question is: is a hard newline is a sufficient condition for marking the end of a text block?
This would be helpful for me, as it would make it clear exactly where text blocks should end.

Furthermore, if I understand correctly, Alto has no standard way of encoding the difference between a hard newline (at the end of a paragraph) and a soft newline (at the end of a text block forming part of a paragraph spanning several columns or pages). This is unfortunate, since we do have other indicators of the semantic structure, such as IDNEXT.

It seems a shame to have to rely on two meta-layers (METS + ALTO) to indicate the hard vs soft newline, which complicates things considerably.

@cneud cneud self-assigned this Nov 29, 2018
@jpmoreux
Copy link
Member

Hi Assaf,

You have to keep in mind that ALTO has been designed to describe OCR output at page level, and not to encode the logical structure of a document.

Consequently, we can't define an ALTO textblock as "a block of text, separated from other blocks of text by a hard newline". Because OCR engines don't recognize hard newlines (they only try to). They detect lines and aggregates of lines, which sometimes turn out to be logical paragraphs.

Best,

@artunit
Copy link
Member

artunit commented Sep 28, 2019

In an effort to keep ahead of schema issues, ones without a direct schema implication will be closed if deemed to be no longer active or if the discussion has gone full circle. They can be reopened if requested.

@artunit artunit closed this as completed Sep 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants