Vocabulary for ProcessingStepDescriptions #36

Closed
mittagessen opened this Issue Feb 7, 2016 · 13 comments

Comments

Projects
None yet
7 participants
@mittagessen

One more from the wish list.

The nature of common *ProcessingStep elements (layout analysis, any kind of postcorrection) is only incompletely captured by MIX's change history and seem often to be out of scope of the MIX schema. It would therefore be beneficial to define a (optional?) vocabulary of possible processingStepDescription attribute values to increase interoperability between data sources.

Any comments?

@cneud

This comment has been minimized.

Show comment
Hide comment
@cneud

cneud Feb 24, 2016

Member

Thank you for your feedback/request. The whole processingStepType will be investigated in the light of this, in connection with issues #35, #27, #13.

For the vocabulary, allowing embedded use of elements from e.g. PREMIS or similar established/standardised vocabularies can be considered, but this will need a wider discussion as well.

Member

cneud commented Feb 24, 2016

Thank you for your feedback/request. The whole processingStepType will be investigated in the light of this, in connection with issues #35, #27, #13.

For the vocabulary, allowing embedded use of elements from e.g. PREMIS or similar established/standardised vocabularies can be considered, but this will need a wider discussion as well.

@cneud cneud self-assigned this Feb 24, 2016

@cneud

This comment has been minimized.

Show comment
Hide comment
@cneud

cneud Jun 14, 2016

Member

Looking into this a little further, the scope is possibly tremendous...processingSteps can basically take any form, from image-related tasks to OCR corrections and all forms of linguistic post-processing, semantic enrichment asf. I am not sure it is really feasible to create a useful formal vocabulary of attribute values?
Sure, it limits interoperability to leave this undefined, but on the other hand it gives ALTO flexibility to include also processingStepDescriptions which are rare or edge cases, without making an adaptation of the schema necessary whenever a new processingStepDescription has been indentified.
How do others think about this? Are there examples of other schemata that cover this, e.g. for linguistic annotation?

Member

cneud commented Jun 14, 2016

Looking into this a little further, the scope is possibly tremendous...processingSteps can basically take any form, from image-related tasks to OCR corrections and all forms of linguistic post-processing, semantic enrichment asf. I am not sure it is really feasible to create a useful formal vocabulary of attribute values?
Sure, it limits interoperability to leave this undefined, but on the other hand it gives ALTO flexibility to include also processingStepDescriptions which are rare or edge cases, without making an adaptation of the schema necessary whenever a new processingStepDescription has been indentified.
How do others think about this? Are there examples of other schemata that cover this, e.g. for linguistic annotation?

@splet

This comment has been minimized.

Show comment
Hide comment
@splet

splet Jun 15, 2016

If allowing any kind of processing step descriptions I would suggest to follow the Semantic Web/ Linked Data approach and at least to require a URI/IRI as reference for what is actually meant (there could be countless variants of image enhancement, segmentation, etc. methods).

splet commented Jun 15, 2016

If allowing any kind of processing step descriptions I would suggest to follow the Semantic Web/ Linked Data approach and at least to require a URI/IRI as reference for what is actually meant (there could be countless variants of image enhancement, segmentation, etc. methods).

@splet splet closed this Jun 15, 2016

@splet splet reopened this Jun 15, 2016

@splet

This comment has been minimized.

Show comment
Hide comment
@splet

splet Jun 15, 2016

Sorry, closed this by accident (wrong button)...

splet commented Jun 15, 2016

Sorry, closed this by accident (wrong button)...

@Jo-CCS

This comment has been minimized.

Show comment
Hide comment
@Jo-CCS

Jo-CCS Jun 16, 2016

Member

I agree on the proposal that it is benetial to have predefined voavbularies. The way how it was done on METS I think also can cover the "rare or edge cases" Clemens outlined. On METS was always as additional option then "OTHER" available. In this case the description can handle to express the real details but helps to classify and make analysis of processing history also by machines to cluster the informations.
But I have to concerns. The pre-, ocr-, post-processing information are all subelements of "OCRProcessing". So sample case mentioned above is only matching once #13 is signed off.

Member

Jo-CCS commented Jun 16, 2016

I agree on the proposal that it is benetial to have predefined voavbularies. The way how it was done on METS I think also can cover the "rare or edge cases" Clemens outlined. On METS was always as additional option then "OTHER" available. In this case the description can handle to express the real details but helps to classify and make analysis of processing history also by machines to cluster the informations.
But I have to concerns. The pre-, ocr-, post-processing information are all subelements of "OCRProcessing". So sample case mentioned above is only matching once #13 is signed off.

@cowboyMontana

This comment has been minimized.

Show comment
Hide comment
@cowboyMontana

cowboyMontana Jun 16, 2016

Member

what is the point of processing step history? is it to serve as an audit trail (who did what and when did s/he do it?)? or is it to show what changes have been made to the file, that is, how is it different now than it was before the processing step? (subtle difference)

Member

cowboyMontana commented Jun 16, 2016

what is the point of processing step history? is it to serve as an audit trail (who did what and when did s/he do it?)? or is it to show what changes have been made to the file, that is, how is it different now than it was before the processing step? (subtle difference)

@jukervin

This comment has been minimized.

Show comment
Hide comment
@jukervin

jukervin Jun 16, 2016

Member

Some kind of common vocabulary is needed for at least (image processing, OCR, proof reading etc.) which is more simple issue than the

Member

jukervin commented Jun 16, 2016

Some kind of common vocabulary is needed for at least (image processing, OCR, proof reading etc.) which is more simple issue than the

@cneud cneud referenced this issue Jun 16, 2016

Closed

Processing history #39

6 of 6 tasks complete
@cneud

This comment has been minimized.

Show comment
Hide comment
@cneud

cneud Jun 16, 2016

Member

Continued in #39.

Member

cneud commented Jun 16, 2016

Continued in #39.

@cneud cneud closed this Jun 16, 2016

@Jo-CCS

This comment has been minimized.

Show comment
Hide comment
@Jo-CCS

Jo-CCS Jul 28, 2016

Member

In continuation of yesterdays call discussion here the results of my first thoughts about possible value list. As mentioned on the call I have in mind the METS agent solution with also just short list of top level areas which can be used for filtering / analysis of main parts, and all the remaining special processing types can be noted as "Other".

TextGeneration
TextAdaption / TextCorrection
LayoutGeneration
LayoutAdaption / LayoutCorrection
PreOperation
PostOperation
Other

This would cover the main areas of ALTO "layout" and "text" in specific to be able to filter out of processing inforation where the layout and text comes from. In my point of view image operation isolated are not relevant for ALTO. Only as part of text and layout actions it might be of interest to record parameters used on the operation (like image conversion).

Member

Jo-CCS commented Jul 28, 2016

In continuation of yesterdays call discussion here the results of my first thoughts about possible value list. As mentioned on the call I have in mind the METS agent solution with also just short list of top level areas which can be used for filtering / analysis of main parts, and all the remaining special processing types can be noted as "Other".

TextGeneration
TextAdaption / TextCorrection
LayoutGeneration
LayoutAdaption / LayoutCorrection
PreOperation
PostOperation
Other

This would cover the main areas of ALTO "layout" and "text" in specific to be able to filter out of processing inforation where the layout and text comes from. In my point of view image operation isolated are not relevant for ALTO. Only as part of text and layout actions it might be of interest to record parameters used on the operation (like image conversion).

@Jo-CCS Jo-CCS reopened this Jul 28, 2016

@jpmoreux

This comment has been minimized.

Show comment
Hide comment
@jpmoreux

jpmoreux Sep 1, 2016

Member

A naive comment: do we really want/need to repeat in all the ALTO files of a document all the operations which were applied on this document? This kind of processing history is generally stored in the document manifest, at the higher level possible.

Member

jpmoreux commented Sep 1, 2016

A naive comment: do we really want/need to repeat in all the ALTO files of a document all the operations which were applied on this document? This kind of processing history is generally stored in the document manifest, at the higher level possible.

@Jo-CCS

This comment has been minimized.

Show comment
Hide comment
@Jo-CCS

Jo-CCS Mar 30, 2017

Member

In todays tech call we discussed this feedback from Jean-Philip shortly again and we concluded that the history recording inside the ALTO will be necessary in case it is more granular than document level.
So whenever individual elements are affected by the adaptations and re-processings the referencing to the individual elements is useful and external difficult.

Member

Jo-CCS commented Mar 30, 2017

In todays tech call we discussed this feedback from Jean-Philip shortly again and we concluded that the history recording inside the ALTO will be necessary in case it is more granular than document level.
So whenever individual elements are affected by the adaptations and re-processings the referencing to the individual elements is useful and external difficult.

@Jo-CCS

This comment has been minimized.

Show comment
Hide comment
@Jo-CCS

Jo-CCS Jan 22, 2018

Member

The change for the processing history is included in the current draft schema version 4-0 for public review.
I consider this issue completed by the current change as well. Please reply to it if there are outstanding topics on this.

Member

Jo-CCS commented Jan 22, 2018

The change for the processing history is included in the current draft schema version 4-0 for public review.
I consider this issue completed by the current change as well. Please reply to it if there are outstanding topics on this.

@cneud

This comment has been minimized.

Show comment
Hide comment
@cneud

cneud Apr 24, 2018

Member

Basic vocabulary included in v4.0.

Member

cneud commented Apr 24, 2018

Basic vocabulary included in v4.0.

@cneud cneud closed this Apr 24, 2018

@cneud cneud added 8 published and removed 7 public comment labels Apr 24, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment