Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process Result tracking (IMPACT) #27

Closed
Jo-CCS opened this issue Sep 10, 2014 · 3 comments
Closed

Process Result tracking (IMPACT) #27

Jo-CCS opened this issue Sep 10, 2014 · 3 comments
Assignees

Comments

@Jo-CCS
Copy link
Member

Jo-CCS commented Sep 10, 2014

Champion: Clemens Neudecker
Submitter: Impact
Submitted: 2013-02
Status: discussion


submitted - initial status when proposal is submitted

discussion - proposal is being discussed within the board

review - xsd code is being reviewed

accepted - proposal is accepted

rejected - proposal is rejected

draft - accepted proposal is in public commenting period

published - proposal is published in a schema version

Backwards compatible ??
To ALTO version ?

Purpose
A lot of software tools and also human interactions are involved in different steps of the digitisation process. Each of them may affect an ALTO file by doing some refinements or corrections. From our point of view it would be desirable to keep track of the changes and verification done by the different agents which are involved in the digitisation process. This would allow a simple kind of a document history and gives also important information about the trustworthily of the whole document. If for example everything was verified by a service provider than we can asume that the quality of the document is very high. Storing the old values as well as the new ones would increase the filesize tremendously.

Correction and Validation are possible outcomes of the same process.

Implementation
The ALTO schema already defines a element. The intention of this element is to record any details about those process steps that were carried out after the creation of the full text. The element is optional and not part of the actual page’s definition in ALTO.

In order to store information about the correction and verification process for individual text lines, words etc. the following elements are added to the section:

• stores the type of process step. It is a free text field, though IMPACT internal constraints require the element’s value to be set to “correction”.
• groups all elements regarding the result of the process. The element’s value attribute contains information about the outcome of the process. The element is repeatable. Each element represents a specific outcome of the process that is recorded in the element’s value attribute. This attribute may only contain two values: “corrected” or “verified”.
• is an element that wraps around all elements that were processed with the actual result as stated in the element’s value attribute.
• element contain the ID-value of an individual text line or word element. Unprocessed are not listed here.
If an element had not been processed, the element is not listed within .

Example:

<postProcessingStep ID="0003">      
  <processingDateTime>2012-05-26T09:34:00+02:00</processingDateTime>      
  <processingAgency>ACME Agency</processingAgency>     
  <processingStepDescription>Proofreading</processingStepDescription>     
  <processingStepSettings>Double keying required</processingStepSettings>     
  <processingSoftware>
   <softwareCreator>ACME Software Corp.</softwareCreator>           
   <softwareName>Proofer</softwareName>
   <softwareVersion>12.1</softwareVersion>
   <applicationDescription>Distributed proofreading software</applicationDescription>     
  </processingSoftware>
  <processingResult value="Proof reading performed">
    <processedElements>
      <pe>P4_TB00003</pe>
      <pe>P4_TB00002</pe>
      <pe>P4_ST00004</pe>
    </processedElements>
  </processingResult>
  <processingResult value="Uncorrected">
    <processedElements>
      <pe>P4_TB00003</pe>
      <pe>P4_TB00002</pe>
      <pe>P4_ST00004</pe>
    </processedElements>
  </processingResult>
</postProcessingStep>

Schema changes draft

Current schema Changed schema

<xsd:complexType name="processingStepType">
  <xsd:annotation>
    <xsd:documentation>A processing step.</xsd:documentation>
  </xsd:annotation>
  <xsd:sequence>
    <xsd:element name="processingDateTime" type="dateTimeType" minOccurs="0">
      <xsd:annotation>
        <xsd:documentation>Date or DateTime the image was processed.</xsd:documentation> 
      </xsd:annotation>
    </xsd:element>
    <xsd:element name="processingAgency" type="xsd:string" minOccurs="0">
      <xsd:annotation>
        <xsd:documentation>Identifies the organizationlevel producer(s) of the processed image.</xsd:documentation>
      </xsd:annotation>
    </xsd:element>
    <xsd:element name="processingStepDescription" type="xsd:string"  minOccurs="0" maxOccurs="unbounded">
      <xsd:annotation>
        <xsd:documentation>An ordinal listing of the image processing steps performed. For example, "image despeckling."</xsd:documentation>
      </xsd:annotation>
    </xsd:element>
    <xsd:element name="processingStepSettings" type="xsd:string" minOccurs="0">
      <xsd:annotation>
        <xsd:documentation>A description of any setting of the processing application.
        For example, for a multi-engine OCR application this might include the
        engines which were used. Ideally, this description should be adequate so
        that someone else using the same application can produce identical
        results.
        </xsd:documentation>
      </xsd:annotation>
    </xsd:element>
    <xsd:element name="processingSoftware" type="processingSoftwareType" minOccurs="0"/>    
  </xsd:sequence>
</xsd:complexType>
<xsd:complexType name="processingStepType">
  <xsd:annotation>
    <xsd:documentation>A processing step.</xsd:documentation>
  </xsd:annotation>
  <xsd:sequence>
    <xsd:element name="processingStepType" type="dateTimeType" minOccurs="0">    
      <xsd:annotation>
        <xsd:documentation>Type of processing step</xsd:documentation>
      </xsd:annotation>
   </xsd:element>
   <xsd:element name="processingDateTime" type="dateTimeType" minOccurs="0">    <xsd:annotation>    <xsd:documentation>Date or DateTime the image was processed.</xsd:documentation>   </xsd:annotation>  </xsd:element>  <xsd:element name="processingAgency" type="xsd:string" minOccurs="0">   <xsd:annotation>    <xsd:documentation>Identifies the organizationlevel producer(s) of the
      processed image.</xsd:documentation>   </xsd:annotation>  </xsd:element>  <xsd:element name="processingStepDescription" type="xsd:string"               minOccurs="0" maxOccurs="unbounded">   <xsd:annotation>    <xsd:documentation>An ordinal listing of the image processing steps performed.
        For example, "image despeckling."</xsd:documentation>   </xsd:annotation>  </xsd:element>  <xsd:element name="processingStepSettings" type="xsd:string" minOccurs="0">   <xsd:annotation>    <xsd:documentation>A description of any setting of the processing application.
        For example, for a multi-engine OCR application this might include the
        engines which were used. Ideally, this description should be adequate so
        that someone else using the same application can produce identical
        results.</xsd:documentation>   </xsd:annotation>  </xsd:element>  <xsd:element name="processingSoftware" type="processingSoftwareType"               minOccurs="0"/>  <xsd:element name="processingResult" type="processingResultType"               minOccurs="0" maxOccurs="unbounded"/> </xsd:sequence></xsd:complexType>  
  <xsd:complexType name="processingResultType">
 <xsd:annotation>  <xsd:documentation>List of processed elements.</xsd:documentation>
 </xsd:annotation>
 <xsd:sequence>
  <xsd:element name="processedElements" minOccurs="0" maxOccurs="unbounded">
   <xsd:annotation>
    <xsd:documentation>ID of processed element</xsd:documentation>
   </xsd:annotation>
   <xsd:complexType>
    <xsd:sequence>
     <xsd:element name="pe" type="xsd:IDREF" minOccurs="1" maxOccurs="unbounded">     </xsd:element>
    </xsd:sequence>
   </xsd:complexType>
  </xsd:element>
 </xsd:sequence>
 <xsd:attribute name="value" type="xsd:string"></xsd:attribute>
</xsd:complexType>  
@Jo-CCS Jo-CCS assigned Jo-CCS and bkgeig and unassigned Jo-CCS and bkgeig Sep 10, 2014
@jukervin jukervin changed the title 2013-02_IMPACT-proposal: (2) Process Result tracking Process Result tracking (IMPACT) Sep 10, 2014
@Jo-CCS Jo-CCS assigned cneud and unassigned Jo-CCS Dec 10, 2015
@cneud
Copy link
Member

cneud commented Jun 15, 2016

Reviewing the original change request filed by the IMPACT project, it seems as two changes are requested:

  1. Add an attribute ID to the processingStepType - covered by Add Processing to replace OCRProcessing #13
  2. Add two attributes CORRECTEDBY and VERIFIEDBY for all elements. The attributes are holding a list of references (using the ID attribute) to all processingStepType entries which have changed the original value.

Example:

<processingStep ID="ID005">
    <processingDateTime>2010-12-15T15:02:48</processingDateTime>
    <processingAgency>ACME Agency</processingAgency>
    <processingStepDescription>manual correction</processingStepDescription>
    <processingStepSettings>misc. settings</processingStepSettings>
    <processingSoftware>
        <softwareCreator>USAL</softwareCreator>
        <softwareName>Aletheia</softwareName>
        <softwareVersion>1.2.3</softwareVersion>
    </processingSoftware>
</processingSteps>

<TextLine ID="ID069" STYLEREFS="ID007" BASELINE="1261" CORRECTEDBY="ID005" VPOS="1230" HPOS="260" HEIGHT="40" WIDTH="902">

Justification:

"A lot of software tools and also human interactions are involved in different steps of the digitisation process. Each of them may affect an ALTO file by doing some refinements or corrections. From our point of view it would be desirable to keep track of the changes and verification done by the different agents which are involved in the digitisation process. This would allow a simple kind of a document history and gives also important information about the trustworthily of the whole document. If for example everything was verified by a service provider than we can asume that the quality of the document is very high. Storing the old values as well as the new ones would increase the filesize tremendously. Therefore we suggest to store only the information about what has been changed and by whom without keeping track of the changed values."

@Jo-CCS
Copy link
Member Author

Jo-CCS commented Jun 16, 2016

A post-processing actopm like new layout analysis (like outlined in #36 ) will cause too big changes to be able to track in such method.
So the use-case for sich referencing might be quite limited in my point of view.
But as you will loose original text information I would in repsonsible position for a long term-pres. storage not allow to overwrite these and anywhay keep a copy of the files.
From those projects I made on national libraries I even heared that it is not allowed to adapt files in the repository at all and is always a new version placed.
So for me the question remain, which additional information I get by this information and how I can use.

Finally on the other side it is simple extension, will only be for optional usage and does not cause a structural issue. I would just shorten to also prevent data issue (CORR= / VERIFIED=).

@cneud cneud mentioned this issue Jun 16, 2016
6 tasks
@cneud
Copy link
Member

cneud commented Jun 16, 2016

Continued in #39.

@cneud cneud closed this as completed Jun 16, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants