Conversation
New formats (not on lists) are also added. In time we will provide descriptions for those additional formats.
|
Thanks, that's a lot of formats! Not all of them are going to be known to others, but we'll see what happens. |
|
53 validation errors... :-) |
|
Sorry for this, Maybe I can do the XML again if you tell me what the errors are … I followed the Fin-clarin XML as an example . |
|
I am a bit torn, given the amount of undescribed formats -- there are some methodological issues here that I am not sure we agree on, and it would be good to investigate that, because the result may not only improve this submission but also help others. Let us work on this submission a bit before it is merged, OK? Perhaps by direct e-mailing or a zoom? Let me try to pull this request out of the SIS source, but let the PR please stay in -- it will just accumulate your fixes and this way it will give you full credit for your work. AH, and I'll have a look at the Fin recommendations, thanks for the hint! |
|
Hmm, it looks like GitHub has closed this particular PR, or rather hasn't re-opened it. So let's just repeat the procedure after we've exchanged on it, ok? Could you, at some convenient point, don't feel rushed, message me at |
|
Sure! |
|
I'm pasting the fragment below from my e-mail to Sasa, because it might be something I want to put somewhere into the wiki documentation: My first methodological hint is: it's not so much about what data are processed at the institute or institutes that make up CLARIN:EL, but rather what data can be reasonably expected to be deposited in the repository, and out of those formats, which formats would the given institute
Also, for the last point: when I say "weird", they may be
In most of the cases, we don't enumerate those as discouraged, because we leave that to common sense and also to some general rules that are true of the entire repository and maybe the entire network, such as, in the case at hand:
I'm very grateful for your highlighting the need for the domain for ML models -- it's high time the SIS had that. Also, it is worth noting that "XML" as a recommended format really means. roughly, "we're going to accept anything" -- because XML can be all sorts of things, and one probably doesn't want to say that they want to see annotations provided as .docx or .odt (because these are compressed XML), etc. |
|
Here's one more thing. The first screenshot is from the test system on my desktop. It shows 57 formats that are not described yet. (The online version has 60, but thanks to @sfischer-uds , that number got reduced today, and the changes are not yet live). The second screenshot is the state after adding the CLARIN:EL recommendations. 92 missing format descriptions. And I am not trying to say that adding formats without a description is "bad", by no means. But I do fear a bit that it may unbalance the system at this still relatively early stage. So I'd be willing to work with you on reducing the number of referenced formats while preserving the information (provided that all the formats are really meant as formats for data depositions). One way to tame high granularity in the SIS is the use of comments, and you have done that for CoNLL (although, again, I would ask: do you see / expect depositions in all the enumerated variants of CoNLL? Are they exactly as welcome as depositions in CoNLL-U?). So, perhaps, we could go for a reduction of the number of ML models (maybe subdividing them according to formal properties if that makes sense, or according to other grouping criteria that make sense in the field, in such a way that may be expected to be shared by other centres), and then the particular names of some models grouped this way would be enumerated in the comment. This does not mean forever, because maybe some of them are going to become more popular than others and will deserve a separate description file. And so on -- I'm just outlining one possibility. Similarly with the XCES group. Depending on what your customisation of XCES is precisely, it may be handled by a comment in the general XCES recommendation ("See [link] for the definition of a CLARIN:EL-specific variant"). I am going to define the creation of format descriptions as one of the SIS intern-level tasks. But we should be careful not to scare interns or nothing gets done :-) (This comes from one who once managed to scare an intern into disappearing from one day to the other, by assigning them an overambitious task...) Ah, one more remark: in my local version, where I added the CLARIN:EL recommendations, I can see, in the Sanity Checker, the following repeated recommendation: -- you can see that something is odd, above, and we can probably go down by 1 recommendation here, unless you meant to provide a different qualification in the two comments. One way or another, I do see a good chance that we can do some good work on these recommendations, and also on the SIS as a whole -- because I am going to add at least one more data domain, for ML models (not to mention a fix for the "Databases" superdomain). That is long overdue, but at last we have a clear use case. |
|
The recommendations are now in a separate branch, so that it's easier to work on them step by step. https://github.com/clarin-eric/standards/tree/EL I suggest the following:
Hint: some XML editors won't recognise the processing instruction at the top of the recommendations document (the one that identifies the schema). If you are editing in a clone of this repository (the EL branch), then you may want to manually indicate to the editor that the schema is at |
|
|
As for CoNLL, you've basically repeated the definition of "acceptable" :-) , so would it be OK to change the level to that? The recommendations also need the author / curator -- I can set that, but I'd need your decision if the curator is you, and if it's OK to reference your github profile, or maybe you'd prefer something else? |
|
Ok for CONLL! |
|
Done on both counts. We need a while before it's uploaded to the production instance, though. |
|
Ahh, but the delay is good, because I now realise that I haven't talked to you about several issues:
How about the following: could you perhaps take the prose fragments of the PDF linked above and put them into the |
I will add the text to the (you are right...) |
|
Thanks for the new PR! I'll merge it and check if we need to introduce functionality for headings inside |
|
Headings can be useful but they can also “mess” the style of the final page that is displayed to the user. You could allow for example only if the css support this heading and don’t allow larger headings like h1,h2,h3 that could be reserved for main menu, sidebar, footer of the final webpage. But you have to test how it looks even with those smaller fonts. |
|
(That's still the local state, of course. On my desktop.) |
|
Looks ok … |
|
Ah, actually, the lower two should be h4, shouldn't they? I'll fix that. |
|
Yes h4 is smaller than h3 :-) |




CLARIN:EL formats update