Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vocabulary for ProcessingStepDescriptions #36

Closed
mittagessen opened this issue Feb 7, 2016 · 13 comments
Closed

Vocabulary for ProcessingStepDescriptions #36

mittagessen opened this issue Feb 7, 2016 · 13 comments

Comments

@mittagessen
Copy link
Contributor

One more from the wish list.

The nature of common *ProcessingStep elements (layout analysis, any kind of postcorrection) is only incompletely captured by MIX's change history and seem often to be out of scope of the MIX schema. It would therefore be beneficial to define a (optional?) vocabulary of possible processingStepDescription attribute values to increase interoperability between data sources.

Any comments?

@cneud
Copy link
Member

cneud commented Feb 24, 2016

Thank you for your feedback/request. The whole processingStepType will be investigated in the light of this, in connection with issues #35, #27, #13.

For the vocabulary, allowing embedded use of elements from e.g. PREMIS or similar established/standardised vocabularies can be considered, but this will need a wider discussion as well.

@cneud cneud self-assigned this Feb 24, 2016
@cneud
Copy link
Member

cneud commented Jun 14, 2016

Looking into this a little further, the scope is possibly tremendous...processingSteps can basically take any form, from image-related tasks to OCR corrections and all forms of linguistic post-processing, semantic enrichment asf. I am not sure it is really feasible to create a useful formal vocabulary of attribute values?
Sure, it limits interoperability to leave this undefined, but on the other hand it gives ALTO flexibility to include also processingStepDescriptions which are rare or edge cases, without making an adaptation of the schema necessary whenever a new processingStepDescription has been indentified.
How do others think about this? Are there examples of other schemata that cover this, e.g. for linguistic annotation?

@splet
Copy link

splet commented Jun 15, 2016

If allowing any kind of processing step descriptions I would suggest to follow the Semantic Web/ Linked Data approach and at least to require a URI/IRI as reference for what is actually meant (there could be countless variants of image enhancement, segmentation, etc. methods).

@splet splet closed this as completed Jun 15, 2016
@splet splet reopened this Jun 15, 2016
@splet
Copy link

splet commented Jun 15, 2016

Sorry, closed this by accident (wrong button)...

@Jo-CCS
Copy link
Member

Jo-CCS commented Jun 16, 2016

I agree on the proposal that it is benetial to have predefined voavbularies. The way how it was done on METS I think also can cover the "rare or edge cases" Clemens outlined. On METS was always as additional option then "OTHER" available. In this case the description can handle to express the real details but helps to classify and make analysis of processing history also by machines to cluster the informations.
But I have to concerns. The pre-, ocr-, post-processing information are all subelements of "OCRProcessing". So sample case mentioned above is only matching once #13 is signed off.

@cowboyMontana
Copy link
Member

what is the point of processing step history? is it to serve as an audit trail (who did what and when did s/he do it?)? or is it to show what changes have been made to the file, that is, how is it different now than it was before the processing step? (subtle difference)

@jukervin
Copy link
Member

Some kind of common vocabulary is needed for at least (image processing, OCR, proof reading etc.) which is more simple issue than the

@cneud cneud mentioned this issue Jun 16, 2016
6 tasks
@cneud
Copy link
Member

cneud commented Jun 16, 2016

Continued in #39.

@cneud cneud closed this as completed Jun 16, 2016
@Jo-CCS
Copy link
Member

Jo-CCS commented Jul 28, 2016

In continuation of yesterdays call discussion here the results of my first thoughts about possible value list. As mentioned on the call I have in mind the METS agent solution with also just short list of top level areas which can be used for filtering / analysis of main parts, and all the remaining special processing types can be noted as "Other".

TextGeneration
TextAdaption / TextCorrection
LayoutGeneration
LayoutAdaption / LayoutCorrection
PreOperation
PostOperation
Other

This would cover the main areas of ALTO "layout" and "text" in specific to be able to filter out of processing inforation where the layout and text comes from. In my point of view image operation isolated are not relevant for ALTO. Only as part of text and layout actions it might be of interest to record parameters used on the operation (like image conversion).

@Jo-CCS Jo-CCS reopened this Jul 28, 2016
@jpmoreux
Copy link
Member

jpmoreux commented Sep 1, 2016

A naive comment: do we really want/need to repeat in all the ALTO files of a document all the operations which were applied on this document? This kind of processing history is generally stored in the document manifest, at the higher level possible.

@Jo-CCS
Copy link
Member

Jo-CCS commented Mar 30, 2017

In todays tech call we discussed this feedback from Jean-Philip shortly again and we concluded that the history recording inside the ALTO will be necessary in case it is more granular than document level.
So whenever individual elements are affected by the adaptations and re-processings the referencing to the individual elements is useful and external difficult.

@Jo-CCS
Copy link
Member

Jo-CCS commented Jan 22, 2018

The change for the processing history is included in the current draft schema version 4-0 for public review.
I consider this issue completed by the current change as well. Please reply to it if there are outstanding topics on this.

@cneud
Copy link
Member

cneud commented Apr 24, 2018

Basic vocabulary included in v4.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants