Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

contents of the "Statistics" tab #67

Closed
bansp opened this issue Oct 17, 2021 · 11 comments
Closed

contents of the "Statistics" tab #67

bansp opened this issue Oct 17, 2021 · 11 comments
Assignees
Labels
discussion homework stuff to think about actively between meetings SIS:functionality
Milestone

Comments

@bansp
Copy link
Member

bansp commented Oct 17, 2021

It's time to see what kind of statistics we could derive from the aggregation of the information, and I suggest that we get a proof-of-concept tab going asap (much as the mimeType tab).
The question is, what kind of statistics can we get going now, and what can we work towards?
Our 2019 goal was to see what formats are popular among the centres, relative to the function(s) they are expected to play. That is something that we can get straightforwardly:

  • group by domain, display the number of occurrences of each format (by format ID!), as provided in the recommendation files, and sorted alphabetically

Once we get the formal format family tree (or graph) going, we could also:

  • group by format family, display the number of occurrences of each format (by format ID!), as provided in the recommendation files, and sorted alphabetically

I guess that, with the information we already have (or can have, easily), we can also

  • dynamically compute the relevant KPI, the measure being "percentage of centres offering repository services that have published an overview of formats that can be processed in their repository".

What else? Please share your ideas.

@margaretha
Copy link
Collaborator

margaretha commented Oct 19, 2021

Some basic statistics about the data (e.g. number of formats, domains, etc) have been added in 1eb2c3f

@bansp bansp removed the priority label Oct 19, 2021
@bansp
Copy link
Member Author

bansp commented Oct 30, 2021

On the one hand, we can list some internal or technical statistics that basically reflect the work performed on extending the SIS: the number of format descriptions, the number of recorded file extensions or media types (not very useful, these two, but they do signal the data content of the SIS). The number of recommendations and recommendations per domain could also, arguably, fall in this category. Also the number of <centre> elements with some attributes, like, @ri, @status, etc.

But then we perform a kind of an epistemic step and assume that we've built a model of an aspect of CLARIN, and look for some statistics describing CLARIN. This is where we can count "the most popular formats per domain", and where we can compute the relevant KPI for centres where @ri = 'CLARIN', and, in time, maybe draw same cross-RI numbers. I would like to get to that second stage, because apart from its providing interesting information, it also means a gentle incentive for centres to keep their information current and somehow corresponding to that provided by other centres (e.g. in terms of granularity of recommendations), or to provide the information at all (and thus affect the KPI).

@bansp bansp changed the title add a "Statistics" tab for formats contents of the "Statistics" tab Oct 30, 2021
@bansp bansp added this to the SIS v. 2.2.0 milestone Jan 7, 2022
@bansp bansp added the homework stuff to think about actively between meetings label Feb 2, 2022
@bansp
Copy link
Member Author

bansp commented Feb 17, 2022

This is just to note that, as mentioned in the ticket referenced right above this comment, we may want to consider splitting this page into something like "SIS statistics" and "Data visualization". The latter meant to provide RI-wide (and maybe cross-RI?) visualizations.

@bansp bansp modified the milestones: SIS v. 2.2.0, SIS v. 2.3.0 Feb 24, 2022
@bansp
Copy link
Member Author

bansp commented Feb 24, 2022

We're moving this issue to the next milestone.

@bansp
Copy link
Member Author

bansp commented Mar 22, 2022

Eliza has prepared a list-statistics.xq page where a beginning can be seen, ordered by domains and then by the numbers of recommendations in those domains.
We're going to have a settable threshold; what gets displayed is formats with the number of recommendations equal to the threshold and above.
And there will be another setting (it's an experiment!) for showing either all the formats that make the threshold or just the top three values of occurrences. (By that we mean that if formats are tied for the number of recommendations, we get to see all the formats that have the given value).

This was referenced Mar 23, 2022
@bansp
Copy link
Member Author

bansp commented May 27, 2022

A lot from this ticket got implemented, though not everything (for one thing, this is a discussion ticket, so it should actually be a hub for action-oriented children but there's only 24 hours in a... working day...).
I think I will prepare a separate quasi-milestone for discussion tickets (maybe that's what they have the "discussion" facility now at GitHub, but, again, 24hrs...). Crucially, I'm taking this ticket out of milestone 2.3.0. Probably to 2.4.0, while I ponder the sensibility of that extra "timeless" quasi-milestone.

@bansp bansp modified the milestones: SIS v. 2.3.0, SIS v. 2.4.0 May 27, 2022
@bansp
Copy link
Member Author

bansp commented Oct 3, 2022

While Pooh keeps pondering, this ticket sneaks out of 2.4.0 and onwards, toward a brighter future.

@bansp bansp modified the milestones: SIS v. 2.4.0, SIS v. 2.5.0 Oct 3, 2022
@bansp bansp removed the homework stuff to think about actively between meetings label Jan 24, 2023
@bansp
Copy link
Member Author

bansp commented Mar 3, 2023

#180 is a related issue.

@bansp bansp added the homework stuff to think about actively between meetings label Mar 3, 2023
@bansp bansp self-assigned this Mar 3, 2023
@bansp
Copy link
Member Author

bansp commented Mar 3, 2023

Assigning it to myself for careful re-reading, to see if it can be closed.

@bansp
Copy link
Member Author

bansp commented Mar 27, 2023

AH, but remember that the issue is linked from https://clarin.ids-mannheim.de/standards/views/list-popular-formats.xq so when it's closed, the link should either go or get replaced by a link to something more permanent.

@bansp
Copy link
Member Author

bansp commented Mar 28, 2023

Gathering some potentially interesting points from above:

I am going to close this issue, since we need some sort of closure on the goals that have been achieved :-) I'll take the remaining ideas to another ticket.

There is now a new issue dedicated to gathering more ideas on what it can mean for a format to be popular (#201 ) and linked from the corresponding subpage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion homework stuff to think about actively between meetings SIS:functionality
Projects
None yet
Development

No branches or pull requests

2 participants