Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal to add readme to schema.org #247

Open
moranegg opened this issue Jul 1, 2020 · 16 comments
Open

Proposal to add readme to schema.org #247

moranegg opened this issue Jul 1, 2020 · 16 comments

Comments

@moranegg
Copy link
Contributor

moranegg commented Jul 1, 2020

This issue is part of the SCIWG CodeMeta task force aiming to add all additional properties into schema.org.
Text proposal for the adding the readme property in schema.org:

In modern software engineering having a readme file in the root of the source code project is good practice.
It is usually the place for the software description and many guides exists on how to write a good README.
The metadata describing the software, can gain a relevant property with a link to the readme file, which is already possible with the CodeMeta vocabulary.

We suggest using a PID for this property, which can be a URI or a SWHID (linking to an archived copy of the readme:
https://archive.softwareheritage.org/swh:1:cnt:ed2eedca46c719144d2485d1a2c3d25c21b2bcd3;origin=https://doi.org/10.5201/ipol.2018.236;visit=swh:1:snp:e0674ffb865529b05511808d1ee7ba5d72346009;anchor=swh:1:rev:fad7a0486bb7a7cfdbb1c28e28a64f2d3f5e0df9;path=/mlheIPOL/README.txt/

A proposal for the property's description:

   A link to the file named README in a software project, containing the description of the folder and its files. this link can be to the archived copy of the file a persistent identifier(e.g SWHID).

inspired from https://www.wikidata.org/wiki/Q539662

@arfon
Copy link
Contributor

arfon commented Jun 1, 2021

This feels like a subset of documentation to me, so my comment in #245 (comment) applies here also. I'm not sure this should be added.

@dgarijo
Copy link
Contributor

dgarijo commented Jun 3, 2021

I agree this is a type of documentation, but I do find this property very useful by itself. Software repos often have more than one documentation, e.g., a readme, and a readthedocs documentation which is a more extensive doc. Sometimes they also have a wiki and an external page. Sometimes they may be even described in an external repository. To me, documentation helps to point to one or multiple documentation pages, i.e., the full documentation; while readme points to the more concise one.

Here I fully agree on the overlap in meaning between readme and documentation and even softwareHelp. I am just highlighting a use case where it is beneficial to distinguish both of them.

@mfenner
Copy link

mfenner commented Jun 30, 2021

Force11 Codemeta Task Force suggests to use documentation, as it aligns with schema.org. readme should be referred to in the description.

@dgarijo
Copy link
Contributor

dgarijo commented Jun 30, 2021

@mfenner what do you mean by "readme" should be in the description? A readme file often contains installation instructions and usage examples, which go beyond a description of what a software component does.

@mfenner
Copy link

mfenner commented Jun 30, 2021

We felt that readme is a special case of description, that we want to keep the number of codemeta properties small, and that we prefer documentation over readme`, as this is a schema.org property.

@tmorrell
Copy link
Contributor

PR implementing change is #260

@dgarijo
Copy link
Contributor

dgarijo commented Jul 1, 2021

@mfenner @tmorrell,
I think #260 is problematic. Let me try to justify this:

@mfenner, is this decision based on any statistical analysis of common practice? There are plenty of repositories where the readme goes beyond a description of a repository (including license, citation, installation instructions, requirements...). For example, in GitHub, there are small descriptions of a repository, which are 2-3 sentences provided for the authors usually seen in the top-right of the repo. Then you may have a longer description (which is part of the readme file, I agree), and then you may have separate documentation (e.g. in readthedocs). Having "readmes" as "descriptions" is conflating too many elements about the software under a single property.
I find the explanation ("we felt") insufficient. What if next month the group feels differently? Without a proper justification, these terms will keep changing meaning

@tmorrell this issue is about adding readme in schema.org, not about deleting properties from codemeta. When creating a standard it is usually a very bad practice to delete properties, because people who may have started using codemeta will suddenly have incompatible representations. For example, let's say I have annotated hundreds of repositories adding "readme" in my current implementation by using codemeta. Now the codemeta version suddenly adds a breaking change. If there is an agreement that a property should not be any longer part of the standard, it is usually kept with a deprecated tag. That way older versions will still work. Otherwise, people will not be keen to a standard when a new version may break their hard work.

Finally, I answered in this thread responding to an open call for feedback. I feel like this feedback has not been taken into account. It looks like whatever the codemeta TF decided is what is put in the PR. If that's the process this community will follow, then why asking for feedback from the community?

@mfenner
Copy link

mfenner commented Jul 1, 2021

@dgarijo thank you for your feedback. Let me first answer to the last sentence in your feedback. We spent a lot of time pushing a major update of the codemeta schema, including talking to many people. Your feedback is very valuable, but this is does not necessarily mean we follow your feedback. In this case we spent a lot of time discussing this issue in a call yesterday, and also have feedback in the issue comments (from @arfon) that wants this to go in another direction. We are aware of the risk of removing properties, but the major driver is closer alignment with schema.org.

@dgarijo
Copy link
Contributor

dgarijo commented Jul 1, 2021

@mfenner
I am not saying my feedback has to be followed to the letter, of course! But I think the use cases I have brought have been dismissed without proper justification because the group "felt" going in a certain direction. If there is an open forum which I have missed or these decisions have been tracked somewhere else, I am happy to read more to catch up (unfortunately, I am not part of the task force, but happy to join future calls).

Codemeta is already aligned with schema.org, as far as I understand, the effort is to push properties from codemeta into schema.org, right? Removing properties already affects part of my work, that's why I fear for codemeta adoption if removing properties is going to be a commonplace practice.

@tmorrell
Copy link
Contributor

tmorrell commented Jul 1, 2021

I've closed my PR so there can be more discussion. I still think moving everyone to the standard schema.org property documentation makes more sense than having a custom readme property.

@dgarijo
Copy link
Contributor

dgarijo commented Jul 1, 2021

ok, let's say that there is a repo, e.g.: https://github.com/tensorflow/tensorflow

They do have a readme: https://github.com/tensorflow/tensorflow/blob/master/README.md and they do have API documentation: https://www.tensorflow.org/api_docs/ They also have tutorials: https://www.tensorflow.org/tutorials/. And more resources (all these are linked in the readme)

Would the documentation point to all these resources independently? (i.e., 3 properties called documentation in the JSON LD)? I think researchers specifically target the readme because they know it's going to be brief, like a hub for resources. In this case I would see a benefit in having documentation point to https://www.tensorflow.org/api_docs/ and https://www.tensorflow.org/tutorials/, while having the readme on its own category. They are all types of documentation, that is why having readme as a subproperty may help.

@cboettig
Copy link
Member

cboettig commented Jul 1, 2021

Since we all agree that readme is a type of documentation, it seems reasonable that someone writing a codemeta file would place a link to a README there. I think we all agee as well that there are 'different kinds of documentation', and perhaps there could be a clear use case to be made for a Documentation @type class that could provide the appropriate semantic differentiation between "docs" like https://www.tensorflow.org/api_docs/ and "tutorials" like https://www.tensorflow.org/tutorials/ as well as READMEs. but at this time, I think the use case for introducing a new class isn't sufficiently strong, and the common practices of the field are sufficiently varied as to make precise differentiation hard. (i.e. we can all agree readme is 'documentation', but sometimes it is also a tutorial and sometimes it isn't...).

Meanwhile, I think placing readme in an array of documentation is more appropriate than leaving it in a separate term, which would seem to suggest it is not documentation at all.

Footnote, but It's not 100% clear to me how schema.org/documentation is typed, but if these are proper URIs, then you could have:

...
"documentation": ["https://www.tensorflow.org/tutorials/", "https://github.com/tensorflow/tensorflow/blob/master/README.md"]

And then use these URIs as @id of a CreativeWork, e.g.:

"@id": "https://github.com/tensorflow/tensorflow/blob/master/README.md",
"@type": "CreativeWork",
"name": "Readme",
"description": "a very important type of documentation..."
...

So you can effectively already nest readme under documentation and provide quite a lot of additional context to distinguish it from other documentation without moving it outside of the documentation, right?

(I wasn't at the meeting either, I'm just sharing my own thoughts here in the spirit of the discussion).

@dgarijo
Copy link
Contributor

dgarijo commented Jul 1, 2021

@cboettig,
I think that having readme as a separate term does not necessarily imply it's not documentation.

For example, look at https://schema.org/Property. It inherits two properties: disambiguatingDescription and description. One is a subproperty of the other. It doesn't mean that a disambiguatingdescription is not a description. However, they are separated because conflating everything under description may not be as useful.

Having different types for documentation in an array as you suggest actually solves the problem too. This is similar to the proposal that @moranegg was suggesting, but having the types with classes is cleaner than having role-like properties (in my opinion), because it avoids reification. I can live with this solution :) But we would have to check if the range of documentation is compatible with a list of elements.

The disadvantage of using class and types is that it's a little harder to consume (before I had only to ask for "readme", now I have to iterate over the "documentation" and find which of them is of type readme); and that in reality we are placing the label somewhere else (instead of properties they become classes). Plus we would have to formally defines the types as subclasses of "CreativeWork"

@cboettig
Copy link
Member

cboettig commented Jul 1, 2021

I believe the JSON-LD spec allows basically anything to be array-valued, and it should produce sensible RDF triples. Not sure that schema.org adds any restrictions to that.

I certainly acknowledge the difficulties in consuming the data, but that is the blessing and the curse of adopting the linked data model. Any link can be expanded to information about that link. (even if readme were retained a separate predicate, I don't think a tool consuming codemeta to identify 'readme' links could exclude the possibility of finding such links under 'documentation' anyway, so personally I think this simplifies the logic.)

I think whether we need additional refinements of CreativeWork to describe different types of documentation ought to be use-case driven. For the moment, I think a working definition of README could be defined as a filename regex, which may be more robust than assuming every metadata provider explicitly annotates it as such.

@dgarijo
Copy link
Contributor

dgarijo commented Jul 1, 2021

Answering per paragraph:

  1. You are right, lists are treated as multi-valued objects when converting to RDF, so it's fine.

  2. Sorry, I don't follow your argument here If you have a readme property that is specific for readmes, then why would you be concerned about finding the same links under documentation? That would not be wrong. The term would help identify what is being looked for. It's like my example with disambiguatingDescription and description. Maybe both are the same, maybe one has additional descriptions. What matters is that I can quickly identify the one I want

  3. Here I disagree. Having regex would not work in cases where the readme files are not called literally "readme" (or similar, depending on your regex). I think codemeta should provide clear guidelines in how the types would be annotated. Whether people end up just having links to files or proper CreativeWork extensions is up to them, but if they choose the latter, then the types should be clearly defined. Otherwise, the problem of defining properties for build instructions, readme, etc. is pushed to the adopters.

@moranegg
Copy link
Contributor Author

see the new discussion to come to a consensus: #335

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants