Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata Relationship and Functionality between Publisher and Zenodo repo #5

Closed
augustfly opened this issue Mar 9, 2015 · 6 comments

Comments

@augustfly
Copy link

I'd like to open a discussion about the choice of metadata for establishing the relationship between the original blog publication and the zenodo repository. at the bottom I mention functional issues related to capturing this metadata.

For example in the github=>zenodo software exchange, the relationship between the original github repository and the frozen zenodo repo is encoded as a "SupplementTo" relationship:

<relatedIdentifiers>
<relatedIdentifier relationType="IsSupplementTo" relatedIdentifierType="URL">
https://github.com/henrikstranneheim/MIP/tree/v2.2.0
</relatedIdentifier>

As this is not a discussion of that workflow, I will reserve my concerns about the choice of "IsSupplementTo" in any case where 1:1 mirroring has occurred. However, there are many more semantically relevant relationships that could be used to encode the relationship between the original Publisher (blog post) and the preserved repo. The controlled list from DataCite Schema 3.1(PDF) includes:

IsCitedBy
Cites
IsSupplementTo
IsSupplementedBy
IsContinuedBy
Continues
HasMetadata
IsMetadataFor
IsNewVersionOf
IsPreviousVersionOf
IsPartOf
HasPart
IsReferencedBy
References
IsDocumentedBy
Documents
IsCompiledBy
Compiles
IsVariantFormOf
IsOriginalFormOf
IsIdenticalTo
IsReviewedBy
Reviews
IsDerivedFrom
IsSourceOf

The appendix of the above PDF schema describes them in detail. I'd rather not pollute the discussion by inserting any opinion yet, but encourage others to think about this matter a bit.

Additionally, there are functional bits to consider about how to capture, preserve, and expose via citation this post<=>repo relationship. A few that come to mind are:

  • the Zenodo workflow needs to enable the Publisher to preserve the original URL (relatedIdentifierType="URL") of the post.
    • there was no such relationship established by the OP to our entire exercise DF-M's post, and it is not clear to me if that was a result of the Zenodo ingest workflow or oversight by @dfm 👋
  • ADS should/must/really ought to expose this relationship between an original blog post and the preserved resource.
    • There are plenty of examples where that relationship is internally curated by ADS (ASCL and CDS data to article links), but in this case the relationship can and should be captured at ingest when the Publisher is ready to publish.
    • this relationship should also be encoded in the full record ADS record;
    • this relationship should also be encoded in the bibtex export. Why? I suspect that publishers will want to see or even reuse the semantic relationship between an informal document and the DOI encoded repository where it is preserved. To put that another way, we want to know what it means when we see a DOI (it ain't just citable 📇)
@owlice
Copy link

owlice commented Mar 9, 2015

On Mar 8, 2015, at 9:20 PM, August Muench notifications@github.com wrote:

the Zenodo workflow needs to enable the Publisher to preserve the original URL (relatedIdentifierType="URL") of the post.
there was no such relationship established by the OP to our entire exercise DF-M's post, and it is not clear to me if that was a result of the Zenodo ingest workflow or oversight by @dfm
ADS should/must/really ought to expose this relationship between an original blog post and the preserved resource.
There are plenty of examples where that relationship is internally curated by ADS (ASCL and CDS data to article links), but in this case the relationship can and should be captured at ingest when the Publisher is ready to publish.
this relationship should also be encoded in the full record ADS record;
The reason ASCL doesn’t pass the original URL (in the ASCL’s case, the code download site) to ADS is because that URL is likely over time to be ephemeral. I think this is likely to happen with blogs, too, over the longterm. I was not involved in the exercise with DF-M’s post, but I suspect DF-M did not overlook the preservation of the original URL, given this exchange:

owlice ‏@owlice 17h17 hours ago
@exoplaneteer What do you mean by "trusted repository"? @jonathansick @augustmuench @doug_burke @geerthub @arfon @ZENODO_ORG
Dan F-M ‏@exoplaneteer 17h17 hours ago
@owlice I mean that I redesign my website every few months & delete posts… @jonathansick @augustmuench @doug_burke

It seems to me that if Zenodo is to act as the archive for these posts, it's better to let it actually do that, and if ADS is indexing posts in an archive… well, it’s indexing the archive, not the blogs themselves, and capturing/exposing the original URLs is capturing information that in 10, 15, 20 years’ time will likely be mostly worthless.

(Many distractions here; if this is a jumbled mess, let me know and I’ll try again tomorrow!)

Alice

@augustfly
Copy link
Author

I don't see the logic of any of these points @owlice except that @dfm did include his original blog post URL in the description of the abstract. My point there was that the Zenodo ingest could have made the relationship explicit in the metadata, and probably should for this AIAC project. And that ADS or ASCL should expose this provenance for mirrored archives.

The reason why I do not follow any of your logic is that duplication and mirroring of resources is a huge problem if the provenance cannot be unpacked. Do you need me to prove that? To assert otherwise because links are ephemeral is nonsense; all links are ephemeral.

For an archive or a curated index to dereference the original resource for arguments about perpetuity is to substitute the repo or curated index as the valuable piece. They are valuable, just not more valuable than the original post/object especially in the case of curated indices that do not capture the original object's content. When archives do capture the original content, they build from and preserve the original URL even it is eventually broken. See WaybackMachine.

Preserve the original link. I was looking for discussion about the semantics of preserving the provenance.

@owlice
Copy link

owlice commented Mar 9, 2015

If the blog post has been deleted (rendering not just the link to it but the actual piece itself ephemeral), isn’t the archive then the valuable piece? (Isn't that the point of the archive?) And if the archive has the original URL, isn’t that preservation enough? I can understand pushing the original URL to Zenodo and having it exposed there; I do think that’s a good idea. I’m not following your reason for wanting ADS to hold it. That just strikes me as unnecessary unless ADS is going to actually archive the blog post (and if it’s going to do that, then why bother with Zenodo?!), but hey, I’m not the one you need to convince, and I’ll stop getting in the way of the discussion! (Sorry!!)

Does Zenodo build from the original URL?

On Mar 9, 2015, at 8:35 AM, August Muench notifications@github.com wrote:

I don't see the logic of any of these points @owlice except that @dfm did include his original blog post URL in the description of the abstract. My point there was that the Zenodo ingest could have made the relationship explicit in the metadata, and probably should for this AIAC project. And that ADS or ASCL should expose this provenance for mirrored archives.

The reason why I do not follow any of your logic is that duplication and mirroring of resources is a huge problem if the provenance cannot be unpacked. Do you need me to prove that? To assert otherwise because links are ephemeral is nonsense; all links are ephemeral.

For an archive or a curated index to dereference the original resource for arguments about perpetuity is to substitute the repo or curated index as the valuable piece. They are valuable, just not more valuable than the original post/object especially in the case of curated indices that do not capture the original object's content. When archives do capture the original content, they build from and preserve the original URL even it is eventually broken. See WaybackMachine.

Preserve the original link. I was looking for discussion about the semantics of preserving the provenance.


Reply to this email directly or view it on GitHub.

@jonathansick
Copy link
Member

@augustfly I just want to add that in PR #4 I describe the metadata that is associated with a Zenodo deposition. The full REST API is here. That's basically what we have to work with when we're adding materials to a Zenodo community.

I'll also agree that the post<=>repo relationship is important, though I don't have the background to make an educated suggestion.

As a layperson and potential publisher, I'd be concerned that if Zenodo became the de facto place to see my blog post, then that would take some 'value' away from my blog. (Thankfully we're not dealing with publishers who are paid by ad clicks or page views). Also, the viewing experience on Zenodo itself probably won't be that great (?). A regular reader will probably want to be whisked as quickly as possible to the original web page where there are proper CSS layouts etc. for the blog post. So in that sense I see the content on Zenodo more as a backup than a place to go read the content.

@aaccomazzi
Copy link

What Gus said. Basically, let's try to capture as much as possible about provenance/relationship early in the game. The datacite 3.1 format (which is supported by Zenodo) supports both <alternateIdentifiers> and <relatedIdentifiers>. The original blog post url should be in one of the two (I'd have to read the full spec to give a more educated guess but that's TL;DR for now). See example below from the datacite website:

<alternateIdentifiers>
<alternateIdentifier alternateIdentifierType="URL">
http://schema.datacite.org/schema/meta/kernel-3.1/example/datacite-example-full-v3.1.xml
</alternateIdentifier>
</alternateIdentifiers>
<relatedIdentifiers>
<relatedIdentifier relatedIdentifierType="URL" relationType="HasMetadata" relatedMetadataScheme="citeproc+json" schemeURI="https://github.com/citation-style-language/schema/raw/master/csl-data.json">
http://data.datacite.org/application/citeproc+json/10.5072/example-full
</relatedIdentifier>
<relatedIdentifier relatedIdentifierType="arXiv" relationType="IsReviewedBy">arXiv:0706.0001</relatedIdentifier>
</relatedIdentifiers>

I'm still not sure what the deposited content from the blog will look like in Zenodo, but if we are thinking that this is some kind of a pdf-ified version of the website, then I for one would want to see the original (assuming it's still live of course), not just Zenodo's "bad copy".

And just to be clear, there is no magic in all of this: Zenodo, dois, and metadata schema just give us tools and technology that will help us do a good (maybe just decent) job at persisting some of this content. But that does not mean that anything at the end of a URL is flaky or bad and everything with a DOI is awesome. Since we can't solve the 404 problem of the web in general, all we can do is mitigate its impact through technology and social constructs.

@kelle
Copy link
Member

kelle commented Mar 8, 2016

@kelle kelle closed this as completed Mar 8, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants