Skip to content

CLARIN:EL formats update#341

Merged
bansp merged 2 commits intoclarin-eric:formatsfrom
raspberryjoy:formats
Dec 6, 2024
Merged

CLARIN:EL formats update#341
bansp merged 2 commits intoclarin-eric:formatsfrom
raspberryjoy:formats

Conversation

@raspberryjoy
Copy link
Copy Markdown
Contributor

CLARIN:EL formats update

New formats (not on lists) are also added. In time we will provide descriptions for those additional formats.
@bansp
Copy link
Copy Markdown
Member

bansp commented Dec 6, 2024

Thanks, that's a lot of formats! Not all of them are going to be known to others, but we'll see what happens.

@bansp bansp merged commit 241122b into clarin-eric:formats Dec 6, 2024
@bansp
Copy link
Copy Markdown
Member

bansp commented Dec 6, 2024

53 validation errors... :-)
I hope to manage to deal with them by search and replace.

@raspberryjoy
Copy link
Copy Markdown
Contributor Author

Sorry for this,

Maybe I can do the XML again if you tell me what the errors are …

I followed the Fin-clarin XML as an example .

@bansp
Copy link
Copy Markdown
Member

bansp commented Dec 6, 2024

I am a bit torn, given the amount of undescribed formats -- there are some methodological issues here that I am not sure we agree on, and it would be good to investigate that, because the result may not only improve this submission but also help others.
Also, machine learning models in the "language description" category is probably not the most fortunate move (though I think I can imagine the reasons for the choice). Further, I am not an expert on machine learning models at all, but isn't YAML a configuration specification language, which could be used for some aspects of the automation of machine learning tasks, but is not, in itself, a machine learning model format?
Also, and this is a random check, I am a bit worried about fXCESILSPVariant -- is this a syntactic variant of XCES, so, in effect, an XCES-derived format? Or a set of conventions for XCES, which would make it XCES with perhaps a comment containing a URL for the definition?
I am concerned with the misuse of the "Language Description" category for ML models -- but please do not take this as criticism of your decision. I think it was a trap of a sort, and I need to expand its description.

Let us work on this submission a bit before it is merged, OK? Perhaps by direct e-mailing or a zoom?

Let me try to pull this request out of the SIS source, but let the PR please stay in -- it will just accumulate your fixes and this way it will give you full credit for your work.

AH, and I'll have a look at the Fin recommendations, thanks for the hint!

@bansp
Copy link
Copy Markdown
Member

bansp commented Dec 6, 2024

Hmm, it looks like GitHub has closed this particular PR, or rather hasn't re-opened it. So let's just repeat the procedure after we've exchanged on it, ok? Could you, at some convenient point, don't feel rushed, message me at banski in the domain ids-mannheim .de? (Trying to make it harder to harvest that address, although it's probably way too late for that.)

@raspberryjoy
Copy link
Copy Markdown
Contributor Author

Sure!
I understand your concerns.
I think it will be better to have a discussion with Maria. The formats are not my expertise also and a couple of teams members have collaborated on creating this file with the proposed recommendations.
We will be in touch via email.
Thanks!!

@bansp
Copy link
Copy Markdown
Member

bansp commented Dec 6, 2024

I'm pasting the fragment below from my e-mail to Sasa, because it might be something I want to put somewhere into the wiki documentation:

My first methodological hint is: it's not so much about what data are processed at the institute or institutes that make up CLARIN:EL, but rather what data can be reasonably expected to be deposited in the repository, and out of those formats, which formats would the given institute

  • "just take and archive (with minimal fuss)" -- these are the recommended formats.
  • If the institute would be only mildly happy to see data in some formats, but it's OK because those formats can easily be converted for long-time archiving or for re-use, then this is the "acceptable" category.
  • Then there is the open-world assumption that one is unable to enumerate all the possible formats -- just, say, the most common 80% of those. The remaining 20% is a grey area, by default "acceptable" or...
  • sometimes so completely weird that they would qualify as "discouraged". There's no way to guess them all, but sometimes guesses can be made, and those formats that you definitely wouldn't like to get but get asked about sometimes, are "discouraged". It is natural that the last group is the least numerous, because
    • (a) users at large usually have some idea of format usefulness,
    • (b) popular tools help sometimes, when they produce e.g. annotations, and
    • (c) other centres have made their preferences known and those preferences can be accessed by users (and by those who create recommendation lists :-)).

Also, for the last point: when I say "weird", they may be

  • "weird as such", for example a Word Perfect 5.1 format, or plain text but encoded in EBCDIC
  • "weird in a combination of format+function", for example the newest .odt or a perfectly archival PDF/A when used to store annotations (or a dictionary).

In most of the cases, we don't enumerate those as discouraged, because we leave that to common sense and also to some general rules that are true of the entire repository and maybe the entire network, such as, in the case at hand:

  • "no obsolete, proprietary formats"
  • Unicode, encoded as UTF-8 or UTF-16 if needs be.

I'm very grateful for your highlighting the need for the domain for ML models -- it's high time the SIS had that.

Also, it is worth noting that "XML" as a recommended format really means. roughly, "we're going to accept anything" -- because XML can be all sorts of things, and one probably doesn't want to say that they want to see annotations provided as .docx or .odt (because these are compressed XML), etc.

@bansp
Copy link
Copy Markdown
Member

bansp commented Dec 7, 2024

Here's one more thing. The first screenshot is from the test system on my desktop. It shows 57 formats that are not described yet. (The online version has 60, but thanks to @sfischer-uds , that number got reduced today, and the changes are not yet live).

Screenshot_20241207_004532

The second screenshot is the state after adding the CLARIN:EL recommendations. 92 missing format descriptions.

image

And I am not trying to say that adding formats without a description is "bad", by no means. But I do fear a bit that it may unbalance the system at this still relatively early stage. So I'd be willing to work with you on reducing the number of referenced formats while preserving the information (provided that all the formats are really meant as formats for data depositions).

One way to tame high granularity in the SIS is the use of comments, and you have done that for CoNLL (although, again, I would ask: do you see / expect depositions in all the enumerated variants of CoNLL? Are they exactly as welcome as depositions in CoNLL-U?). So, perhaps, we could go for a reduction of the number of ML models (maybe subdividing them according to formal properties if that makes sense, or according to other grouping criteria that make sense in the field, in such a way that may be expected to be shared by other centres), and then the particular names of some models grouped this way would be enumerated in the comment. This does not mean forever, because maybe some of them are going to become more popular than others and will deserve a separate description file. And so on -- I'm just outlining one possibility.

Similarly with the XCES group. Depending on what your customisation of XCES is precisely, it may be handled by a comment in the general XCES recommendation ("See [link] for the definition of a CLARIN:EL-specific variant").

I am going to define the creation of format descriptions as one of the SIS intern-level tasks. But we should be careful not to scare interns or nothing gets done :-) (This comes from one who once managed to scare an intern into disappearing from one day to the other, by assigning them an overambitious task...)

Ah, one more remark: in my local version, where I added the CLARIN:EL recommendations, I can see, in the Sanity Checker, the following repeated recommendation:
image

-- you can see that something is odd, above, and we can probably go down by 1 recommendation here, unless you meant to provide a different qualification in the two comments.

One way or another, I do see a good chance that we can do some good work on these recommendations, and also on the SIS as a whole -- because I am going to add at least one more data domain, for ML models (not to mention a fix for the "Databases" superdomain). That is long overdue, but at last we have a clear use case.

@sfischer-uds sfischer-uds mentioned this pull request Dec 9, 2024
@bansp
Copy link
Copy Markdown
Member

bansp commented Dec 13, 2024

The recommendations are now in a separate branch, so that it's easier to work on them step by step.

https://github.com/clarin-eric/standards/tree/EL

I suggest the following:

  • fix JPEG (need to decide which recommendation to delete or maybe change some values)
  • possibly fix CoNLL and XML (they appear too broad to be meaningful)
  • possibly prepare some even skeletal format descriptions (I'll gladly help), but first or in parallel:
  • let's please see what domains we need for ML-related data functions (my suggestion was something along the lines of 'data-preparation', 'model-training', 'model-exchange' -- if that makes sense, and if people wish to share/deposit data in all three functions)

Hint: some XML editors won't recognise the processing instruction at the top of the recommendations document (the one that identifies the schema). If you are editing in a clone of this repository (the EL branch), then you may want to manually indicate to the editor that the schema is at ../../schemas/recommendation.xsd. It is best to enable Schematron rules for extra hints.

@bansp bansp mentioned this pull request Dec 13, 2024
@raspberryjoy
Copy link
Copy Markdown
Contributor Author

  • JPEG will be fixed
  • CoNLL: we don't expect to get all enumerated variants - but since they are all standardised and documented, there's no point in "discriminating" among them. If a corpus is already available in one of these variants, it can be accepted.
  • XML I can your point, I will remove it.

@bansp
Copy link
Copy Markdown
Member

bansp commented Jan 14, 2025

As for CoNLL, you've basically repeated the definition of "acceptable" :-) , so would it be OK to change the level to that?

The recommendations also need the author / curator -- I can set that, but I'd need your decision if the curator is you, and if it's OK to reference your github profile, or maybe you'd prefer something else?

@raspberryjoy
Copy link
Copy Markdown
Contributor Author

Ok for CONLL!
Yes, you can add me as the curator.

@bansp
Copy link
Copy Markdown
Member

bansp commented Jan 15, 2025

Done on both counts. We need a while before it's uploaded to the production instance, though.

@bansp
Copy link
Copy Markdown
Member

bansp commented Jan 15, 2025

Ahh, but the delay is good, because I now realise that I haven't talked to you about several issues:

  1. The description of the centre is very likely still "mine", in the sense that I think that I pasted a sensible link when no EL recommendations had been published yet, just to point the potential users at the centre repository.
  2. At https://www.clarin.gr/sites/default/files/CLARINELRecommendedFormats.pdf there is still a competing set of recommendations, most probably a subset of those in the SIS
  3. Once the recommendations are fully published in the SIS, and you and your centre are satisfied, it makes a lot of sense to link back to them -- after all, you want to make your work useful to the users.

How about the following: could you perhaps take the prose fragments of the PDF linked above and put them into the <info> element, while also preserving the initial link (naturally -- because it points straight at the centre repository). At that moment, the PDF is going to become fully spurious. And once the updated new content of the EL recommendations is online in the SIS, you could get the link to the PDF replaced with the link to https://standards.clarin.eu/sis/views/view-centre.xq?id=CLARIN:EL , and the task would be complete :-)

@raspberryjoy
Copy link
Copy Markdown
Contributor Author

  1. Oh, ok ...
  2. We started our task with this pdf in order to create the formats file for the SIS.
  3. Yes, it makes sense to replace the pdf with the link to the SIS.

I will add the text to the (you are right...)

@bansp
Copy link
Copy Markdown
Member

bansp commented Jan 15, 2025

Thanks for the new PR! I'll merge it and check if we need to introduce functionality for headings inside <info> -- interesting point.

@raspberryjoy
Copy link
Copy Markdown
Contributor Author

Headings can be useful but they can also “mess” the style of the final page that is displayed to the user. You could allow for example only

if the css support this heading and don’t allow larger headings like h1,h2,h3 that could be reserved for main menu, sidebar, footer of the final webpage. But you have to test how it looks even with those smaller fonts.

@bansp
Copy link
Copy Markdown
Member

bansp commented Jan 15, 2025

I only allowed h3 and h4, and only within the <info> element, because that's where curators can operate. The result is fairly neat, I think:

image

@bansp
Copy link
Copy Markdown
Member

bansp commented Jan 15, 2025

(That's still the local state, of course. On my desktop.)

@raspberryjoy
Copy link
Copy Markdown
Contributor Author

Looks ok …

@bansp
Copy link
Copy Markdown
Member

bansp commented Jan 15, 2025

Ah, actually, the lower two should be h4, shouldn't they? I'll fix that.

@raspberryjoy
Copy link
Copy Markdown
Contributor Author

Yes h4 is smaller than h3 :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants