Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sanity checker: general discussion ticket #115

Open
bansp opened this issue Feb 16, 2022 · 1 comment
Open

Sanity checker: general discussion ticket #115

bansp opened this issue Feb 16, 2022 · 1 comment
Labels
discussion hub Meant as target for individual action tickets. Please don't assign 'hub' tickets. sanity-checker stuff that should potentially be handled by the sanity checker

Comments

@bansp
Copy link
Member

bansp commented Feb 16, 2022

This ticket will probably stay outside of milestones, unless we see a straight path ahead to closing it, for a time. The previous ticket of this nature got turned into an action ticket (#60), and now this one is supposed to be the new place to gather ideas and turn them into separate action tickets.

1. Introduction: places to look for sanity checks

We have a somewhat distributed sanity-checking functionality now (and that happened in a way by design):

  1. the main sanity-checker page (small at the time of writing, but it's already been useful)
  2. list of missing formats (mentioned in the recommendations by ID but the IDs belong to not-yet-existent format-description files), under "Data Deposition Formats"
  3. list of existing format-description files that are not mentioned by any recommendation -- also under "Data Deposition Formats"
  4. list of format-description files that don't mention any file extensions -- under "File Extensions"
  5. list of format-description files that don't mention any media types -- under "Media Types"
  6. and, indirectly, the Statistics page, which is meant (mainly) for aggregating and visualizing the data content, but indirectly might point to some local insanities ;-), especially because, for now, it contains some meta-statistics that still tell us more about the content of the SIS rather than the individual centres and formats. Note also that this page has a dedicated discussion issue (contents of the "Statistics" tab #67). Maybe it should even get split into something like "SIS statistics" and "Data Visualization", later on, but let's ignore that in the present issue.

This very ticket is meant for the content of the sanity-checker page.

2. Sanity-checker page

This page should eventually have structured logic and probably repeat some of the distributed information (which, given the modular structure of the SIS, should be trivial).

2.1. What it contains

Context Target Check
1. recommendation list domain name check if set, check if valid
2. recommendation list recommendation level check if set, check if valid
3. recommendation list format ID + domain + level check if repeated/'similar'

2.2. What else it might contain (and/or how it can get arranged)

There are three main hubs of information that may either get edited or where something external may change 'spoiling' them in some way:

  1. format descriptions under data/formats/
  2. recommendations under data/recommendations/
  3. center descriptions under data/centres.xml

We thus get three targets for sanity checks, and one should bear in mind that the middle one, recommendations, can be checked for internal coherence but also for whether the associations that it makes (between centres and formats and properties defined inside recommendations) are coherent.

2.2.1. Context: formats

  • items (4) and (5) from section 1 above (extensions, media types)
  • format families (this is a separate can of worms that requires at least one separate ticket)
  • stuff someone might forget to change after using one format description as a template for another:
    • repeated IDs,
    • repeated names, abbreviations
  • extId pointers?
  • check for similarity of format IDs across the entire set of formats (capitalization, hyphenation, etc... partial matching?)

2.2.2. Context: recommendations and the associations that they create

Internally to the list of recommendations
  • much of that is already handled in section 2.1 above (and implemented as of Feb 2022)
  • 2.1. also includes item (3), which is about ascribing properties to formats (or rather to format IDs)
  • new: check for similarity of format IDs across the entire set of recommendations (capitalization, hyphenation, etc.)
Associations defined by the recommendations
  • the association with centres is not yet handled (perhaps someone uses a recommendation file for one centre as a template for populating another one?)

2.2.3. Context: centres

  • centres are associated with the CLARIN database via links -- should we check if these links are live?
  • we might want to check for repeated centres or repeated links in different centre elements
  • we might want to diagnose (somehow) the RI status of a centre
  • within CLARIN, we might want to check the sanity of status indicators, although part of that should be handled by the schema (separate ticket!)

Please kindly add ideas / comments below.
Please don't assign this ticket, and it probably doesn't make sense to put it into a milestone either.
Other, real action tickets, should mention this one if they concern implementing some of the above ideas.

@bansp bansp added discussion hub Meant as target for individual action tickets. Please don't assign 'hub' tickets. labels Feb 16, 2022
@bansp bansp pinned this issue Feb 20, 2022
@bansp bansp added this to the the milestone that isn't milestone Mar 21, 2023
@bansp bansp added the sanity-checker stuff that should potentially be handled by the sanity checker label Mar 21, 2023
@bansp
Copy link
Member Author

bansp commented Mar 22, 2023

  • occasionally check the links from centres to the CL database, perhaps
  • checks that are done by Schematron (or maybe even by the schema) -- because a submitted document may theoretically escape the validation stage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion hub Meant as target for individual action tickets. Please don't assign 'hub' tickets. sanity-checker stuff that should potentially be handled by the sanity checker
Projects
None yet
Development

No branches or pull requests

1 participant