Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automating metadata file maintenance (templated metadata with a metadata discovery and rendering service) #22

Open
jonathansick opened this issue Jun 6, 2017 · 10 comments

Comments

@jonathansick
Copy link
Contributor

I'm approaching this working group from a software producer's perspective. At LSST we have a few hundred repositories on GitHub (https://github.com/lsst, https://github.com/lsst-dm, https://github.com/lsst-sqre, https://github.com/lsst-sims are our major GitHub organizations), and have a large group of people contributing to these repos. As much as possible, we rely on automation to move towards a continuous delivery ideal to ensure our code releases are reliable.

The idea of putting something like a CITATION file #2 or codemeta.json #4 in our repositories is great, and I think we're going that route. We especially like codemeta / JSON-LD because it means we can add LSST-specific metadata for our own internal purposes. At the same time, deploying codemeta.json at scale across all of our repositories could cause some maintenance challenges.

If LSST has 500 GitHub repositories, we'd have 500 codemeta.json files. And like documentation, it's sometimes difficult to rely on software developers in each project to keep that metadata accurate and up-to-date. For example, every time there is a new contributor we'd need to add them to codemeta.json. We might add a new code dependency, so we'd have to ensure the dependency metadata is up to date. Or at worst, every new commit on GitHub is in some sense a new release/version of the software for provenance purposes; it's not tractable to have a codemeta.json file committed to a repo reflect that sort of continuous versioning information.

A solution I'm interested in is combining codemeta.json metadata committed to a repository with metadata that's intrinsic to the repository itself. Things you can discover from a software repository are:

  • Name
  • Origin URL
  • Committers and reviewers (see also Mapping committers to authors #8)
  • Version (Git commit, tag, or version embedded in a setup.py file for example; see also How to managing versioning when citing software #16)
  • Date last modified
  • Dependencies
  • License
  • (and anything else that could be discovered from the source code, Git history, or a language specific metadata file like Python's setup.py or node.js's package.json)

Here's a system I'm envisioning:

  • Software repositories have a template metadata file. In that template metadata file, we put metadata that can't be discovered by any other means, like names of funding agencies, technical managers, and non-code contributors.
  • There's a web service (i.e., REST API) capable of generating a fully hydrated codemeta.json object on-demand for a Git repository at any Git ref. The web service inspects the Git repository for metadata and merges that metadata with the existing, manually maintained template metadata file.
  • When we make a code release, or even create a maintenance branch for a release on GitHub, we use the web service to render codemeta.json and commit that metadata into the Git repository/software distribution. Potentially the master branch could even carry the codemeta.json rendered from the latest release. This metadata rendering and committing happens automatically on the continuous integration server.

In some ways, this is similar to how we're approaching software documentation. Combining code and its documentation in the same repository help make a software product more self-contained from a developer's perspective and makes it easier to maintain versioned documentation. In the same way codemeta.json embedded in a repository is useful for maintaining versioned metadata. But we also rely on automation in a continuous integration service to help us produce, render, and validate the documentation (for example, generating an API reference by inspecting the code base and merging API signatures with human-written documentation strings).

I'm curious if others have thought about the maintenance of codemeta.json files at scale, and whether this approach is generally tractable?

A significant challenge is that the web service needs to know how to introspect the software. At LSST we have some non-standard practices for building software, so we'd need to implement a web service that knows about the LSST build system, in addition to standard Python PyPI packing, for example.

A spin-off of this approach is a "linting" service that runs in continuous integration and identifies when metadata in codemeta.json is out of date. In this case, a developer would still maintain codemeta.json manually, but would be forced to resolve metadata discrepancies before merging a PR.

@astrofrog
Copy link

This would be really useful for Astropy, which also has quite a few repositories, and where manual curation would be difficult. We should discuss this!

@astrofrog
Copy link

I would advocate a slightly different approach - rather than having a full template file, we could have a configuration YAML file that can include e.g. a list of fields to automatically update, and options such as how to add new contributors (e.g. adding missing contributors, always sorting authors in some way, etc.). So then there would be a single JSON file (codemeta.json) that authors edit but the YAML file would allow whitelisting of certain fields that can be auto-updated.

@jonathansick
Copy link
Contributor Author

I like your approach @astrofrog. The codemeta.json file in the repo would be fully rendered (updated on a per-release basis, say), while as you say the something.yaml file can have YAML structures that configure the codemeta generator/updating service. That's much less magic than having the service try its best to hydrate a partially filled-in codemeta object. 👍

I suppose some next steps are to more clearly write down some user stories, and start defining a syntax for the configuration YAML file.

@jonathansick
Copy link
Contributor Author

We've created a repository to start developing this idea: https://github.com/codemeta-gen/metagen

I think our first task there is to begin documenting the YAML configuration files as a means of designing the system.

@arfon
Copy link
Contributor

arfon commented Oct 25, 2017

👋 @jonathansick & @astrofrog - I've probably mentioned this to you in the past but a couple of years back I started work on a RubyGem to do something similar to this https://github.com/arfon/metamatter

I'd be happy to walk you through the library sometime if it's not obvious what it's doing.

@moranegg
Copy link
Contributor

@jonathansick The specification you proposed for generating codemeta.json file from the intrinsic metadata and trying to minimize the maintenance of the codemeta files (which is also problematic when having only 10 files) are exactly what we should strive for when implementing the software citation workflow. @cboettig has developed a tool generating codemeta.json for R packages: CodeMetar.

@arfon that's great ! it's the first time I'm seeing this tool. We should reference it in the tools page, I think it should be more visible.

@jonathansick
Copy link
Contributor Author

@arfon @moranegg Those both look great.

My application is generating metadata in the Python ecosystem that my project (@lsst) lives in. I also want a metadata generator that's extremely pluggable. LSST has a lot of unconventional software packaging and metadata sources, so being able to write plugins that extract and transform metadata are a high priority for me. R and Ruby place a higher barrier to entry for that than Python, for us at least.

For sure, though, I want to study what you're doing in metamatter and CodeMetar. Maybe I'll take you up on that walk-through at some point :)

@cboettig
Copy link

cboettig commented Nov 4, 2017

@jonathansick et al Really excellent thread here raising some important issues; in particular, developing a robust maintenance strategy for the codemeta.json to avoid metadata being invalid or out-of-date.

The approach we have taken with codemetar is to automate the creation of codemeta.json from available metadata authors are (presumably) already maintaining elsewhere in their software. We see this in several steps:

  1. Map available metadata from the language config/docs files: This is the basic motivation for the notion of the CrossWalk table, e.g. for python's distutils.

  2. Map available metadata from the GitHub repo. This can include information available from the GitHub API, https://codemeta.github.io/crosswalk/github/, (with precedence given to the config files); in codemetar we also do things like scraping README badges to infer stuff like Travis URLs.

  3. Support manual updating of the codemeta.json. Authors can edit the codemeta.json directly to add additional fields. by default, codemetar updates any field it infers over the existing value, but preserves any existing field if no inferred value is found.

I suppose we're basically doing the 'hydrate a partially filled codemeta.json` as you so succinctly put it.

Ideally this would be wired into the release process, which would handle the chicken-and-egg issues like updating the version in both the language docs and codemeta.json, pre-reserving a DOI an entering that into the json, uploading to the archive (zenodo), tagging the release on GitHub, and uploading to the distribution mirror (CRAN in the R case). This part doesn't fully exist yet since we cannot yet reserve versioned DOIs on Zenodo.

It is far from obvious that this is the ideal strategy, but just thought I'd share what we'd done so far.

I like your idea of allowing the users to enter metadata directly as YAML instead of JSON, since it's more user-friendly and has a well-defined automated mapping into JSON. (Of course any valid JSON is also valid YAML). This still leaves open the issues of how to resolve conflicts between metadata fields declared here and inferred fields, and how to avoid stale metadata (particularly versions, dois) slipping in here.

I think you're suggesting that the .yml / codemeta.json metadata would take precedence over any potentially inferred metadata field (or rather, we ignore inferring metadata for the time being), which may indeed be a more robust strategy, so I'd love to hear more thoughts on this. In the case of R, the DESCRIPTION file is already a relatively rich metadata source that authors are usually pretty good at filing out (the packaging tools enforce a lot of this already), so it seemed natural to build on that rather than start from scratch, but that may not make sense in other contexts where relatively little good metadata is automatically or reliably available.

@danielskatz
Copy link
Collaborator

I agree this is a great thread with some really good discussion of an important issue.

My take is:

As I mention in https://danielskatzblog.wordpress.com/2017/09/25/software-heritage-and-repository-metadata-a-software-citation-solution/, I would like to get to a point where authors who want credit provide the needed metadata, just like they create READMEs and CONTRIBUTING and LICENSE. I was thinking that they would do this into a codemeta.json file, but a yaml version also seems fine - whatever we think is easier.

One question is which fields are properties of the template (organization) and what are properties of the repo (software) itself. I'm not sure there will be a single answer to this that covers both a project like LSST and a small lab project.

And I agree that since almost everything there (except the authors) can be generated automatically, if the authors have provided information, it should be used rather than what would be generated.

The author metadata is the only thing that I think cannot be generated automatically. The authors is not the set of github contributors, and the set of github contributors is not the authors, but there is likely some overlap. If the authors are not specified by the authors, I really think the best thing to do is to name the project as the authors, rather than inaccurately guessing.

To provide some examples for this briefly, one case is that an author is a person who had the idea for a piece of code, got the funding to develop it, and designed it on a whiteboard, but never committed anything to the repo. This person likely should be an author.

On the other side, imagine an administrator who updates the license file, and thus is a committer, even without making any intellectual contribution to the software. This person likely should not be an author.

@cboettig
Copy link

cboettig commented Nov 6, 2017

@danielskatz Great points about the author issues, I agree entirely that the GitHub commit records are not a good indicator of this for exactly the reasons you outline.

In the R community we are pretty used to specifying author information, even along with author roles (e.g. a "contributor" in an R package is distinct from an "author" and omitted from the author list generated by R's citation() function), in the DESCRIPTION file. Also, CRAN now officially supports the pattern we introduced in codemetar to list ORCID IDs:

Other language/distribution config files have somewhat similar support for identifying authors in the language, (e.g. python distutils, npm package.json, see https://codemeta.github.io/crosswalk/), but maybe having people specify this stuff directly in codemeta.json (or yaml) is easiest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants