Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple Shape elements in one TextLine #42

Closed
jlerouge opened this issue Sep 9, 2016 · 9 comments
Closed

Multiple Shape elements in one TextLine #42

jlerouge opened this issue Sep 9, 2016 · 9 comments
Assignees

Comments

@jlerouge
Copy link

jlerouge commented Sep 9, 2016

Hello,

Considering this change between Alto v3.0 and v3.1 :

    <xsd:element name="TextLine" maxOccurs="unbounded">
        <xsd:annotation>
            <xsd:documentation>A single line of text.</xsd:documentation>
        </xsd:annotation>
        <xsd:complexType>
            <xsd:sequence>
                <xsd:sequence maxOccurs="unbounded">
+                   <xsd:element name="Shape" type="ShapeType" minOccurs="0" maxOccurs="1"/>
                    <xsd:element name="String" type="StringType"/>
                    <xsd:element name="SP" type="SPType" minOccurs="0"/>
                </xsd:sequence>
                (...)
            </xsd:sequence>
            (...)
        </xsd:complexType>
    </xsd:element>

I guess this is relative to the following in the v3.1 changelog :

.2. Added support for using different shapes for the elements String, TextLine, all PageSpaceType elements and on all BlockType elements.

I see a problem here, which is multiple shape elements can be direct children of a TextLine. According to the schema, the following constructions are allowed :

Ex. 1: No Shape element in the TextLine ✅

<TextLine>
    <String />
    <SP />
    <String />
    <SP />
    <String />
</TextLine>

Ex. 2: One Shape element at the beginning of the TextLine ✅

<TextLine>
    <Shape />
    <String />
    <SP />
    <String />
    <SP />
    <String />
</TextLine>

Ex. 3: One Shape element before each String element of the TextLine ❗

<TextLine>
    <Shape />
    <String />
    <SP />
    <Shape />
    <String />
    <SP />
    <Shape />
    <String />
</TextLine>

In the 3rd situation, which Shape element should be selected as the correct shape of the line ?

I suggest that TextLine can have at most one Shape child element, at the beginning of the sequence, like this :

    <xsd:element name="TextLine" maxOccurs="unbounded">
        <xsd:annotation>
            <xsd:documentation>A single line of text.</xsd:documentation>
        </xsd:annotation>
        <xsd:complexType>
            <xsd:sequence>
+               <xsd:element name="Shape" type="ShapeType" minOccurs="0" maxOccurs="1"/>
                <xsd:sequence maxOccurs="unbounded">
                    <xsd:element name="String" type="StringType"/>
                    <xsd:element name="SP" type="SPType" minOccurs="0"/>
                </xsd:sequence>
                (...)
            </xsd:sequence>
            (...)
        </xsd:complexType>
    </xsd:element>
@Jo-CCS
Copy link
Member

Jo-CCS commented Mar 2, 2017

Former CR issue #22 Allow shape element usage (IMPACT)

@Jo-CCS
Copy link
Member

Jo-CCS commented Mar 2, 2017

Hi jlerouge,

thanks for your posting and sorry for the delay on response.
I have reviewed your posting one more as first time it was not clear to me. After short discussion I noted the point and think I see the difference in understanding now.
The shape on the Textline is not supposed to describe the area of the following element. The itself will be described in a as child of the , so like:

<Textline>
  <string>
     <shape>
  </string>
  <string>
     <shape>
  </string> 
</Textline>

In historical papers it might be that something which is from content point of view belongs to the text line (e.g. even is written on the outer border ).
So a the text line might be built out of multiple shape elements. This is correct in the moment, that textline might belong out of two separate areas, where even a polygon would force to have a connection of all areas.

I do not remember if we had discussed the scenario on this specifically, but I also do not see a critical problem with it.

This is the same as on "PageSpaceType", which has also mutliple sub-elements (minOccurs="unbounded").
For BlockType and StringType it is different, as the bounding sequence is minOccurs="1".

We will discuss this once more on the board. Perhaps we extend the annotation to prevent misunderstanding but keep possibility as described above.

@jukervin
Copy link
Member

jukervin commented Mar 3, 2017

I think this is a mistake on our part and it should be fixed as proposed above.

@Jo-CCS
Copy link
Member

Jo-CCS commented Mar 3, 2017

Definately as stated above it is inconsistent on the different types.
The need to describe multiple shapes I think is rare, to limit to one will cause breaking backwards compatibility and will force again a major version change.

@evelienket
Copy link
Member

I agree, this issue is bug in the schema. But because allowing multiple shapes for a textline is not according the described use cases/specifications I think we can fix it in the next minor release.

This bug brings me to the question of creating test cases. Usually if I find a bug I would create a test case that you could run in a regression test. Would it be worthwhile to create testcases for ALTO as well?

@cneud
Copy link
Member

cneud commented Sep 22, 2017

@evelienket Excellent idea about the test cases - perhaps we can use Schematron for this.

@evelienket
Copy link
Member

evelienket commented Oct 16, 2017

There are four levels where the Space element can be used: PageSpaceType, BlockType, TextLine, StringType
It seems that problem described occurs on PageSpaceType and TextLine.

Proposed fix for PageSpace:

<xsd:complexType name="PageSpaceType">
    <xsd:annotation>
        <xsd:documentation>A region on a page</xsd:documentation>
    </xsd:annotation>
    <xsd:sequence>
        <xsd:element name="Shape" type="ShapeType" minOccurs="0"  maxOccurs="1"/>
	<xsd:sequence minOccurs="0" maxOccurs="unbounded">
		<xsd:group ref="BlockGroup"/>
	</xsd:sequence>
    </xsd:sequence>
    ...
</xsd:complexType>

Proposed fix for TextLine:

<xsd:element name="TextLine" maxOccurs="unbounded">
    <xsd:annotation>
        <xsd:documentation>A single line of text.</xsd:documentation>
    </xsd:annotation>
    <xsd:complexType>
        <xsd:sequence>
            <xsd:element name="Shape" type="ShapeType" minOccurs="0" maxOccurs="1"/>
            <xsd:sequence maxOccurs="unbounded">
                <xsd:element name="String" type="StringType"/>
                <xsd:element name="SP" type="SPType" minOccurs="0"/>
            </xsd:sequence>
            ...
        </xsd:sequence>
        ...
    </xsd:complexType>
</xsd:element>

Next step is to add XML-files with correct and and wrong shape elements and a new version of the xsd.

@Jo-CCS
Copy link
Member

Jo-CCS commented Jan 22, 2018

Fix is included in version 4-0 (now in draft status for public review)

@cneud
Copy link
Member

cneud commented Apr 24, 2018

Fixed in v4.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants