Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can't do preston update on content-based URIs #125

Open
mielliott opened this issue Jun 8, 2021 · 1 comment
Open

can't do preston update on content-based URIs #125

mielliott opened this issue Jun 8, 2021 · 1 comment

Comments

@mielliott
Copy link
Collaborator

mielliott commented Jun 8, 2021

preston update throws an error when trying to dereference content-based URIs:

$ preston update 'hash://sha256/97cbeae429fbc95d1859f7afa28b33f08ac64125ba72511c49c4b77ca66d2d66'
<https://preston.guoda.bio> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#SoftwareAgent> <urn:uuid:c716b299-09e1-401d-8b7d-f5fd948694f6> .
<https://preston.guoda.bio> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Agent> <urn:uuid:c716b299-09e1-401d-8b7d-f5fd948694f6> .
<https://preston.guoda.bio> <http://purl.org/dc/terms/description> "Preston is a software program that finds, archives and provides access to biodiversity datasets."@en <urn:uuid:c716b299-09e1-401d-8b7d-f5fd948694f6> .
<urn:uuid:c716b299-09e1-401d-8b7d-f5fd948694f6> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Activity> <urn:uuid:c716b299-09e1-401d-8b7d-f5fd948694f6> .
<urn:uuid:c716b299-09e1-401d-8b7d-f5fd948694f6> <http://purl.org/dc/terms/description> "A crawl event that discovers biodiversity archives."@en <urn:uuid:c716b299-09e1-401d-8b7d-f5fd948694f6> .
<urn:uuid:c716b299-09e1-401d-8b7d-f5fd948694f6> <http://www.w3.org/ns/prov#startedAtTime> "2021-06-08T14:36:24.593Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> <urn:uuid:c716b299-09e1-401d-8b7d-f5fd948694f6> .
<urn:uuid:c716b299-09e1-401d-8b7d-f5fd948694f6> <http://www.w3.org/ns/prov#wasStartedBy> <https://preston.guoda.bio> <urn:uuid:c716b299-09e1-401d-8b7d-f5fd948694f6> .
<https://doi.org/10.5281/zenodo.1410543> <http://www.w3.org/ns/prov#usedBy> <urn:uuid:c716b299-09e1-401d-8b7d-f5fd948694f6> <urn:uuid:c716b299-09e1-401d-8b7d-f5fd948694f6> .
<https://doi.org/10.5281/zenodo.1410543> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/dc/dcmitype/Software> <urn:uuid:c716b299-09e1-401d-8b7d-f5fd948694f6> .
<https://doi.org/10.5281/zenodo.1410543> <http://purl.org/dc/terms/bibliographicCitation> "Jorrit Poelen, Icaro Alzuru, & Michael Elliott. 2019. Preston: a biodiversity dataset tracker (Version 0.0.1-SNAPSHOT) [Software]. Zenodo. http://doi.org/10.5281/zenodo.1410543"@en <urn:uuid:c716b299-09e1-401d-8b7d-f5fd948694f6> .
<urn:uuid:0659a54f-b713-4f86-a917-5be166a14110> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Entity> <urn:uuid:c716b299-09e1-401d-8b7d-f5fd948694f6> .
<urn:uuid:0659a54f-b713-4f86-a917-5be166a14110> <http://purl.org/dc/terms/description> "A biodiversity dataset graph archive."@en <urn:uuid:c716b299-09e1-401d-8b7d-f5fd948694f6> .
<hash://sha256/6d924b3cc007cdb2fd78eab535dd9102563ebdddf4e0e30b00b50bde555f5e68> <http://www.w3.org/ns/prov#usedBy> <urn:uuid:c716b299-09e1-401d-8b7d-f5fd948694f6> <urn:uuid:c716b299-09e1-401d-8b7d-f5fd948694f6> .
[main] WARN bio.guoda.preston.store.Archiver - failed to dereference [<hash://sha256/97cbeae429fbc95d1859f7afa28b33f08ac64125ba72511c49c4b77ca66d2d66>]
org.apache.http.client.ClientProtocolException
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:187)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108)
        at bio.guoda.preston.ResourcesHTTP.asInputStream(ResourcesHTTP.java:75)
        at bio.guoda.preston.ResourcesHTTP.asInputStream(ResourcesHTTP.java:55)
        at bio.guoda.preston.ResourcesHTTP.asInputStream(ResourcesHTTP.java:59)
        at bio.guoda.preston.store.DereferencerContentAddressed.dereference(DereferencerContentAddressed.java:19)
        at bio.guoda.preston.store.DereferencerContentAddressed.dereference(DereferencerContentAddressed.java:8)
        at bio.guoda.preston.store.Archiver.handleBlankVersion(Archiver.java:49)
        at bio.guoda.preston.store.VersionProcessor.on(VersionProcessor.java:28)
        at bio.guoda.preston.process.StatementsListenerEmitterAdapter.on(StatementsListenerEmitterAdapter.java:12)
        at bio.guoda.preston.cmd.CmdUpdate.processQueue(CmdUpdate.java:61)
        at bio.guoda.preston.cmd.CmdActivity.run(CmdActivity.java:124)
        at bio.guoda.preston.cmd.CmdActivity.run(CmdActivity.java:77)
        at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:18)
        at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:26)
        at bio.guoda.preston.Preston.main(Preston.java:19)
Caused by: org.apache.http.HttpException: hash protocol is not supported
        at org.apache.http.impl.conn.DefaultRoutePlanner.determineRoute(DefaultRoutePlanner.java:89)
        at org.apache.http.impl.client.InternalHttpClient.determineRoute(InternalHttpClient.java:125)
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
        ... 16 more
<https://deeplinker.bio/.well-known/genid/aa0abea3-9689-3cc9-a467-ccfde0d5544f> <http://www.w3.org/ns/prov#wasGeneratedBy> <urn:uuid:a2a5c80e-32fc-490d-9615-316e7f2e24bd> <urn:uuid:a2a5c80e-32fc-490d-9615-316e7f2e24bd> .
<https://deeplinker.bio/.well-known/genid/aa0abea3-9689-3cc9-a467-ccfde0d5544f> <http://www.w3.org/ns/prov#qualifiedGeneration> <urn:uuid:a2a5c80e-32fc-490d-9615-316e7f2e24bd> <urn:uuid:a2a5c80e-32fc-490d-9615-316e7f2e24bd> .
<urn:uuid:a2a5c80e-32fc-490d-9615-316e7f2e24bd> <http://www.w3.org/ns/prov#generatedAtTime> "2021-06-08T14:36:25.373Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> <urn:uuid:a2a5c80e-32fc-490d-9615-316e7f2e24bd> .
<urn:uuid:a2a5c80e-32fc-490d-9615-316e7f2e24bd> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Generation> <urn:uuid:a2a5c80e-32fc-490d-9615-316e7f2e24bd> .
<urn:uuid:a2a5c80e-32fc-490d-9615-316e7f2e24bd> <http://www.w3.org/ns/prov#wasInformedBy> <urn:uuid:c716b299-09e1-401d-8b7d-f5fd948694f6> <urn:uuid:a2a5c80e-32fc-490d-9615-316e7f2e24bd> .
<urn:uuid:a2a5c80e-32fc-490d-9615-316e7f2e24bd> <http://www.w3.org/ns/prov#used> <hash://sha256/97cbeae429fbc95d1859f7afa28b33f08ac64125ba72511c49c4b77ca66d2d66> <urn:uuid:a2a5c80e-32fc-490d-9615-316e7f2e24bd> .
<hash://sha256/97cbeae429fbc95d1859f7afa28b33f08ac64125ba72511c49c4b77ca66d2d66> <http://purl.org/pav/hasVersion> <https://deeplinker.bio/.well-known/genid/aa0abea3-9689-3cc9-a467-ccfde0d5544f> <urn:uuid:a2a5c80e-32fc-490d-9615-316e7f2e24bd> .

Use case: after finding records (lines) in datasets using preston grep, I thought it would be fun to hash each of the lines so that I could see which ones have identical content. preston update seemed like a convenient way to do this.

This ties into an old idea of verifying the reliability (as defined in Elliott et al. 2020) of preston datasets by running preston update on an existing preston-generated provenance log, which we imagined should find the same exact content, since it would dereference content-based identifiers instead of location-based ones.

@mielliott
Copy link
Collaborator Author

Use case: after finding records (lines) in datasets using preston grep, I thought it would be fun to hash each of the lines so that I could see which ones have identical content. preston update seemed like a convenient way to do this.

I realize that finding "identical records" could be done using the text values associated with each line, as outputted by preston grep, but being able to find the hash of each line opens all sorts of possibilities (e.g. preston grep, sketch, similar, etc.), and the hash URI tends to be much more concise than the whole text of the record.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant