Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ALTO output: Missing <SP> tags between <String> tags #78

Open
jbarth-ubhd opened this issue Dec 22, 2017 · 24 comments
Open

ALTO output: Missing <SP> tags between <String> tags #78

jbarth-ubhd opened this issue Dec 22, 2017 · 24 comments
Labels

Comments

@jbarth-ubhd
Copy link

Perhaps this is not an error.
Kind regards,
J. Barth

@kba
Copy link
Collaborator

kba commented Dec 22, 2017

Can you provide sample data and how you ran the tool?

@zuphilip
Copy link
Member

zuphilip commented Dec 22, 2017

I guess you output the ALTO files directly from ABBYY, because we don't yet provide a transormation from ABBYY to ALTO. Then this should be an example: https://digi.bib.uni-mannheim.de/~stweil/ocr-praxis/Testseiten/alto/417576986_0031.xml . The <SP> stands AFAIK for space and it does validate in this form.

@jbarth-ubhd
Copy link
Author

jbarth-ubhd commented Dec 22, 2017

Yes, I'll try to find out if <SP> (=space) is really necessary between <String>s in ALTO.

@zuphilip
Copy link
Member

I guess that it still validates without the SP tags. Moreover, most of the information (HPOS, WIDTH) can be calculated from the line above and below, but if the width of a space is important for some application, then it might be easier to have this data directly. I don't know what the VPOS information for a space says or whether it is also determined by some other values.

@jbarth-ubhd
Copy link
Author

jbarth-ubhd commented Dec 22, 2017

On ALTO 2.1 .xsd it looks like this:

  <xsd:sequence maxOccurs="unbounded">
    <xsd:element name="String" type="StringType"/>
    <xsd:element name="SP" minOccurs="0"> ...
    </xsd:element>
  </xsd:sequence>

So strictly speaking it seems that <SP> is not necessary, but the <sequence> seems to imply it.

@zuphilip
Copy link
Member

but the seems to imply it.

Not sure. I only see here, that, if <SP> occurs, then it has to occur after a <String>.

@stweil
Copy link
Member

stweil commented Nov 22, 2018

Here is an ALTO file generated with Tesseract (see tesseract-ocr/tesseract#2067). Another page was processed by ABBYY Finereader.

While ABBYY adds the <SP> tags, Tesseract (and ocr-fileformat) does not. As the <String> tags contain the surrounding box positions and the distance of two text boxes can be calculated without additional information, that looks sufficient at a first glance. But without the <SP> the DFG viewer does not separate the words!

I am not sure whether this is a bug of the DFG viewer (and Kitodo Presentation) or whether ALTO requires explicit tags for the whitespace between words. Perhaps @sebastian-meyer or @cneud know the answer?

@stweil
Copy link
Member

stweil commented Nov 22, 2018

The ALTO documentation says "A TextBlock is divided into lines and those are divided into strings, spaces and hyphens". I don't interpret that as a strict requirement that spaces are required, and nor does the .xsd. It's clear that spaces are required if the strings are given without HPOS and WIDTH attributes, but I think it is redundant if those attributes are available.

@amitdo
Copy link

amitdo commented Nov 23, 2018

The ALTO spec itself needs to clarify this issue.

@stweil
Copy link
Member

stweil commented Nov 23, 2018

Clemens has created an issue for that: altoxml/schema#54 (thank you).

@cneud
Copy link
Contributor

cneud commented Nov 23, 2018

Thanks for flagging this, I will put it on the agenda for our next ALTO board call which will be held November 29th.

@mittagessen
Copy link

mittagessen commented Nov 24, 2018

To chip in, I've interpreted the standard that the <SP><String> alternation is mandatory (sequence definition of <TextLine> contents) and that whitespace should never occur inside a <String> and this is how I implemented it.

@stweil
Copy link
Member

stweil commented Nov 24, 2018

If a <String> never contains whitespace, then <SP> is completely redundant. Does ALTO allow overlapping words in a row? If yes, does that require a separating space with negative width? :-)

@mittagessen
Copy link

If a never contains whitespace, then is completely redundant.

Why? Whitespace is a character like any other and personally I would've taken the decision to encode it explicitly using <String> if the standard wouldn't heavily imply that you shouldn't do that. Of course, you can throw away the data and let people compute inter-word spacing implicitly provided through word bounding boxes but it isn't like tesseract, kraken or any other sequence classification based OCR engine doesn't output a label for whitespace (and the boundaries of that activation can almost certainly differ from the boundary of the activations of the adjacent letters). I'd rather not throw away metadata that some weird subdiscipline in the humanities that only the 8 people participating in it have ever heard about might need.

Does ALTO allow overlapping words in a row?

ALTO luckily allows overlapping elements in constrast to PageXML.

@stweil
Copy link
Member

stweil commented Nov 24, 2018

Then how would you encode two overlapping words if you are forced to put a <SP> between them?

@mittagessen
Copy link

mittagessen commented Nov 24, 2018

Just have overlapping bounding boxes? Presumably there is still a reading order that determines the ordering of the <String> tags. But yeah it helps that I decided a long time ago that words are a waaay to squishy concept and arbitrarily defined anything bounded by whitespace is a separate word/segment for serialization purposes (not only for ALTO). Of course, I you want to encode a proper tokenization, this data model shouldn't be used. On the other hand, I'm of firm conviction that starting to do that in a raw OCR serialization format is only going to lead to madness.

@cneud
Copy link
Contributor

cneud commented Dec 13, 2018

Just to follow up - I'm afraid a quick resolve is not really around the corner...the issue was discussed in the last ALTO board call, with the core elements of the discussion summarized here.

While the general feeling was that the use of <SP> is not mandatory, some more research into ALTO's history is required to determine the original authors exact intentions.

An expansion of the <SP> tag with a width attribute has been identified among board members as a possibility to create more useful future applications for the <SP> tag.

If one really wants to be on the safe side, the quick solution right now would be to indeed include <SP> in the output of any ALTO export implementation as it is also straightforward to remove in post-processing.

@mittagessen
Copy link

ALTO's history is required to determine the original authors exact intentions.

As a note, most of the character-based classification systems common at the time ALTO was originally specified didn't treat whitespace as a proper glyph, i.e. whitespace is just something bordered by other glyphs and is never seen by the classifier as such. This at least explains the existence of a separate <SP> tag.

@stweil
Copy link
Member

stweil commented Dec 13, 2018

Thank you, @cneud, @mittagessen and the ALTO board.

As the current DFG viewer expects the <SP> tags, I think that programs like ocr-transform should produce them, too. Pull request tesseract-ocr/tesseract#2117 adds the tags to Tesseract's new ALTO output, so that output is now compatible with the DFG viewer.

@zuphilip
Copy link
Member

zuphilip commented Dec 30, 2019

The addition of the <SP> should be handled upstream in the corresponding transformation. Currently, we use hocr2alto and page2alto. We can keep this issue here open as a reminder.

@zuphilip zuphilip changed the title ALTO output from Abbyy11r8 contains <SP> between <String> ALTO output: Missing <SP> tags between <String> tags Dec 30, 2019
@filak
Copy link
Contributor

filak commented Jan 2, 2020

According to the ALTO XSD the SP tag is optional - minOccurs="0"

And I do not see a way how to reliably calculate HEIGHT/WIDTH/VPOS/HPOS attributes from the hOCR data for the SP tag.

IMHO - proper handling of optional SP tag should be fixed by DFG viewer.

@albig
Copy link

albig commented Jan 3, 2020

If the <SP> is not mandatory, we have to "ignore" it in the styles of the fulltext view and always make a space after a <STRING>.

This is what I've done in the DFG-Viewer styles now. Please have a look at the current master of the DFG-Viewer at test.dfg-viewer.de.

Please compare the example from above in current master and in version 5.0 of DFG-Viewer and report change requests.

@filak
Copy link
Contributor

filak commented Jan 3, 2020

@albig IMHO the second one seems better from user perspective - it is more readable/compact.

@sebastian-meyer
Copy link

@albig IMHO the spacing looks better now (in master), but the linebreaks seem a bit random...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants