Publishing a new version by a script (command line) #1480

meliezer · 2020-06-17T07:25:27Z

Hello,
Would it be possible to add an option to create a new version after I overwrite the source files with a script or simply because database content has changed, maybe by API using curl?
Of course the HTTP response code would tell my script if something is wrong, and the details will be found in the validation log.

Cheers,
Menashè

sylvain-morin · 2020-11-02T10:01:34Z

Hello @meliezer

I totally agree with this feature.
On my side, I'm using a python script to publish and register automatically a list of resources.

I list the resource IDs in a file (one ID by row)
The python script will do the HTTP calls to login, then publish each or register each.
https://github.com/gbiffrance/ipt-batch-import/blob/master/src/py/Automate-IPT-INPN.py

But I agree that a REST API would be better !

timrobertson100 · 2021-04-28T10:54:20Z

I'm not sure a full REST API for the IPT makes sense, as it quickly becomes the GBIF registry, but having some management API for key features like the one @meliezer and @sylmorin-gbif need would make the tool more useful for their workflows.

I suggest we keep the API small initially so that it can be included in a 2.5.1 or 2.5.2 release

meliezer · 2021-04-28T14:29:22Z

Thank you @timrobertson100 ! Personally I would like to:

Upload source files. Overwrite if exist.
Map them (Auto-mapped, simple). If you cannot, so only remapping after the first time was defined using the web version.
Publish.
Register.
The first two steps are more important.

abubelinha · 2021-04-29T07:25:11Z

Perhaps this could also be possible?

Update metadata (i.e. passing a JSON dictionary of metadata as body request)

MattBlissett · 2021-04-29T08:58:25Z

Update metadata (i.e. passing a JSON dictionary of metadata as body request)

That relates to #955.

abubelinha · 2022-11-17T11:16:04Z

I'm not sure a full REST API for the IPT makes sense, as it quickly becomes the GBIF registry

@timrobertson100 : did you mean it would be easier to publish datasets directly to registry just using its current api?

i.e., is there a way to avoid IPT hosting and keep our datasets and their metadata somewhere (i.e. github repository) we can tell GBIF registry to read from?
Anyone using this approach? Any example scripts or pseudocode?

Thanks
@abubelinha

MattBlissett · 2022-11-17T11:24:05Z

If you have a Darwin Core Archive (or just an EML file for a metadata-only dataset) on a public URL you can register it directly with the GBIF.

https://github.com/gbif/registry/tree/dev/registry-examples/src/test/scripts has an example using Bash, which is obviously not production-ready in any way! There are probably 20 or so publishers registering datasets in this way.

Dataset metadata is read from the EML file within the DWCA. You need to keep track of what GBIF dataset UUID is assigned to datasets you have registered, so you can update them.

abubelinha · 2022-11-17T12:02:06Z

There are probably 20 or so publishers registering datasets in this way.

Thanks a lot! Is it possible to somehow search/know those publishers?
(if anyone publishes DwCa in github, maybe is sharing their publishing protocols/code too).

sylvain-morin · 2022-11-17T12:02:57Z

To sum up, we should:

put our dwca archives (zip files) on a web server
each archive will be accessible with a direct URL like http://myhost/dwca12345.zip
register (once) this web server as an "installation", to get an installationKey
it means, calling POST https://api.gbif.org/v1/installation
with a body like this: https://api.gbif.org/v1/installation/a957a663-2f17-415f-b1c8-5cf6398df8ed
but with the installationType HTTP_INSTALLATION
if the dataset is a new one (not yet register), we have 2 calls to do

a) POST https://api.gbif-uat.org/v1/dataset
with the following body:

{
		  "publishingOrganizationKey": "$ORGANIZATION",
		  "installationKey": "$INSTALLATION",
		  "type": "OCCURRENCE",
		  "title": "Example dataset registration",
		  "description": "The dataset is registered with minimal metadata, which is overwritten once GBIF can access the file.",
		  "language": "eng",
		  "license": "http://creativecommons.org/publicdomain/zero/1.0/legalcode"
		}

(copied from the scripts @MattBlissett mentioned ! thanks !)

to create the dataset on GBIF.org, and get the GBIF uuid of the dataset (let's call it aaaa-bbbb-cccc-dddd)

b) POST https://api.gbif-uat.org/v1/dataset/aaaa-bbbb-cccc-dddd/endpoint
with the following body:

{
			"type": "DWC_ARCHIVE",
			"url": "http://myhost/dwca12345.zip"
		}

to tell GBIF.org how to access the dwca archive on our web server

if the dataset is already registered, we are doing an update.
let's say we overwrite the file on our webserver (so same URL to access it)

we call PUT https://api.gbif-uat.org/v1/dataset/aaaa-bbbb-cccc-dddd
with the correct GBIF uuid and kind of the same the following body as in 3)

is it enough to trigger the update? since the URL has not changed, we don't have to call the "endpoint" URL, right?

sylvain-morin · 2022-11-17T12:08:09Z

as it is written in @mike-podolskiy90 register.sh script:

Using a process like this, you should make sure you store the UUID GBIF assigns to your dataset, so you don't accidentally re-register existing datasets as new ones.

if our archives are named with stable internal UUIDs, we can even just rely on the GBIF.org API.

calling http://api.gbif.org/v1/organization/1928bdf0-f5d2-11dc-8c12-b8a03c50a862/publishedDataset for your publishing org gives you all the information to retrieve the GBIF UUID from your internal ones (using the endpoints section)

that's what I do with some scripts to compare what is on GBIF.org and what is on my IPT, after an update

sylvain-morin · 2022-11-17T12:18:10Z

@timrobertson100 you told me to do this 2 years ago... but I love too much IPT to abandon it :-)

I guess it's time for me to migrate to this solution - having 10K datasets on my IPT is becoming difficult to handle.

I'm just wondering if the migration will be easy.
For all my current IPT datasets, I will call

PUT https://api.gbif-uat.org/v1/dataset/aaaaa-bbbbbbb-cccccc

{
		  "publishingOrganizationKey": "$ORGANIZATION",
		  "installationKey": "NEW_WEB_SERVER_INSTALLATION_KEY",
		  "type": "OCCURRENCE",
		  "title": "Example dataset registration",
		  "description": "The dataset is registered with minimal metadata, which is overwritten once GBIF can access the file.",
		  "language": "eng",
		  "license": "http://creativecommons.org/publicdomain/zero/1.0/legalcode"
		}

to update the installationKey

PUT https://api.gbif-uat.org/v1/dataset/aaaaa-bbbbbbb-cccccc/endpoint

{
			"type": "DWC_ARCHIVE",
			"url": "http://myhost/dwca12345.zip"
		}

to update the URL endpoint of the archive

Can we update the installationKey of an existing dataset?
Won't the GBIF.org block me?

What about the user to use for this call?
Can I do these operations with any account, or is there some specific registration to do for the account?

Currently, I don't care about this, since it's the IPT that is doing the registry calls.

MattBlissett · 2022-11-17T12:45:42Z

If there are no modifications to make you can call (with authentication) GET https://api.gbif-uat.org/v1/dataset/aaaaa-bbbbbbb-cccccc/crawl to request we re-crawl/interpret the dataset. Please don't do this with 10000 datasets at once -- for that many either run batches of about 200 and wait for them to complete, or just wait for the weekly crawl which will happen within 7 days anyway.

To migrate, you would only need to include the changed field:

PUT https://api.gbif-uat.org/v1/dataset/aaaaa-bbbbbbb-cccccc
{
   "installationKey": "NEW_WEB_SERVER_INSTALLATION_KEY",
}

(I think, haven't done this for a while.) Updates cause a crawl after 1 minute, in case there are more updates.

You should write to helpdesk@gbif.org to get authorization to make these requests. It's usually best to create a new institutional account on gbif.org for this. Create one on gbif-uat.org too, so you can test everything there first.

sylvain-morin · 2022-11-17T14:06:45Z

Thank you @MattBlissett.
I guess I will switch to this (I just asked confirmation to the GBIF France team) before leaving, so it will ease the maintenance.
I could share my scripts to the community for sure.

abubelinha · 2022-11-17T20:46:55Z

Wow ... tons of information today.
Thanks a lot @MattBlissett & @sylmorin-gbif ... are you planning to use Python for this too?

sylvain-morin · 2022-11-30T10:40:51Z

Hi @MattBlissett,
I'm testing the migration, as we discussed above:
PUT https://api.gbif-uat.org/v1/dataset/aaaaa-bbbbbbb-cccccc

{
   "installationKey": "NEW_WEB_SERVER_INSTALLATION_KEY"
}

Here is the result:

<ul>
	<li>Validation of [publishingOrganizationKey] failed: must not be null</li>
	<li>Validation of [title] failed: must not be null</li>
	<li>Validation of [type] failed: must not be null</li>
</ul>

So I added them:
PUT https://api.gbif-uat.org/v1/dataset/aaaaa-bbbbbbb-cccccc

{
   "installationKey": "NEW_WEB_SERVER_INSTALLATION_KEY",
   "publishingOrganizationKey": "xxxx",
   "title": "xxxx",
   "type": "OCCURRENCE"
}

Here is the new result:

<ul>
	<li>Validation of [created] failed: must not be null</li>
	<li>Validation of [key] failed: must not be null</li>
	<li>Validation of [modified] failed: must not be null</li>
</ul>

I don't think adding "created" or "modified" dates is normal....
But I did it :)

PUT https://api.gbif-uat.org/v1/dataset/aaaaa-bbbbbbb-cccccc

{
   "installationKey": "NEW_WEB_SERVER_INSTALLATION_KEY",
   "publishingOrganizationKey": "xxxx",
   "title": "xxxx",
   "type": "OCCURRENCE",
   "key": "xxxxx",
   "created": "2022-11-25T16:18:46.134+00:00",
   "modified": "2022-11-25T16:18:46.134+00:00"
}

And the result is... 400 BAD REQUEST

Any idea?

abubelinha · 2024-02-10T19:56:37Z

I guess I will switch to this (I just asked confirmation to the GBIF France team) before leaving, so it will ease the maintenance.
I could share my scripts to the community for sure.

@sylvain-morin did you finally end up with a solution you can share?
I'd love to find a way to upload DwCA files to a public repository (Zenodo, Github, whatever) and let GBIF registry read directly from them whenever we publish updates.

sylvain-morin · 2024-02-12T11:03:11Z

I did a very simple Python app server, to handle our needs at GBIF France.
https://github.com/gbiffrance/apt

In short:

you register this "APT" as a GBIF "HTTP installation"
you set the GBIF keys (publisher, installation, ...) in the APT config (by env var)
you define the folder to store the datasets (mounted by docker volume)
you use the POST endpoint to push a ZIP dataset to the APT

the POST endpoint will:

store the file in the folder you defined
register (or update) the dataset on gbif.org

It's really basic, but it has been handling +15 000 datasets for 1 year (https://www.gbif.org/installation/e44d0fd7-0edf-477f-aa82-50a81836ab46)

Our goal was to have a simple tool to handle the GBIF publication/update at the end of our dataset pipeline.

abubelinha · 2024-02-12T12:33:59Z

Oh great! Thanks a lot for the summarized explanations.
I suggest you to add them to the repository (INSTALLATION.md or wherever you think it's better).

I understand APT basically replicates IPT behaviour, but in a way that you must create DwCA files yourself before, and then use APT to both serve and register them (and their updates).
So this means the APT is expected to be up and running at any time, just like an IPT. Am I right?

This is great but I am mostly interested in "serving" datasets from a different place (i.e. institutional repository, or Zenodo), but using Python/APT only to register them.
I guess this might be possible by changing registry.py dataset_url(id):

def dataset_url(id):
    # your current code:
    # return CONFIG.APT_PUBLIC_URL + "/dataset/"+id 

    # use my own function to get dataset urls from wherever I store them (i.e. database, excel, ...):
    return get_remote_dataset_url(id)

Also the defaultserver.py post_dataset(id) must be changed to upload the file to a repository instead of storing it into APT server.

In such an scenario, would it be possible to use APT from a local machine (not accessible to GBIF registry) so that I only run it when publishing, but Zenodo or my institution's repository take care of keeping the DwC datasource accesible online 24x7?
I encourage gbif staff (@timrobertson100 @MattBlissett ...?) to give their opinions about this approach.

I suppose the main concern would be checking for valid DwCA file structure before uploading it to a public url and registering it. But perhaps python-dwca-reader might do the trick.
@sylvain-morin did you use any particular approach for that?

Of course creating valid DwCA files on your own might not be trivial (specially the metadata part) ... but that is a different question.

abubelinha · 2024-02-12T22:47:55Z

I think I missed the "HTTP installation" role of APT in my previous message.

I guess both APT & IPT are expected to be accessible online, so they constitute kind of an index page for the datasets they serve.
So I would slightly change my question: could this "HTTP installation" be a simple html index page of those datasets?

In other words, can we just store (and keep updated) both the "installation" and its datasets in a static website?
(so we may use any repository which can keep them available online in permanent urls)

MattBlissett added the Type-Enhancement label Aug 25, 2020

MattBlissett added Component-UI Type-NewFeature and removed Type-Enhancement labels Sep 20, 2020

timrobertson100 added this to the 2.5 milestone Apr 28, 2021

jiho mentioned this issue May 17, 2021

Investigate publishing to an IPT through scripting ecotaxa/ecotaxa_front#677

Open

abubelinha mentioned this issue Jul 31, 2021

how to revert back to a previous resource version #1618

Closed

abubelinha mentioned this issue Oct 9, 2022

CRSFtoken wrong cookie path ? #1849

Closed

abubelinha mentioned this issue Nov 17, 2022

Add possibility to reorder items in lists (e.g. contacts, taxa, citations) #1325

Closed

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Publishing a new version by a script (command line) #1480

Publishing a new version by a script (command line) #1480

meliezer commented Jun 17, 2020

sylvain-morin commented Nov 2, 2020

timrobertson100 commented Apr 28, 2021

meliezer commented Apr 28, 2021

abubelinha commented Apr 29, 2021

MattBlissett commented Apr 29, 2021

abubelinha commented Nov 17, 2022 •

edited

Loading

MattBlissett commented Nov 17, 2022

abubelinha commented Nov 17, 2022

sylvain-morin commented Nov 17, 2022 •

edited

Loading

sylvain-morin commented Nov 17, 2022

sylvain-morin commented Nov 17, 2022

MattBlissett commented Nov 17, 2022

sylvain-morin commented Nov 17, 2022

abubelinha commented Nov 17, 2022

sylvain-morin commented Nov 30, 2022 •

edited

Loading

abubelinha commented Feb 10, 2024

sylvain-morin commented Feb 12, 2024

abubelinha commented Feb 12, 2024

abubelinha commented Feb 12, 2024

Publishing a new version by a script (command line) #1480

Publishing a new version by a script (command line) #1480

Comments

meliezer commented Jun 17, 2020

sylvain-morin commented Nov 2, 2020

timrobertson100 commented Apr 28, 2021

meliezer commented Apr 28, 2021

abubelinha commented Apr 29, 2021

MattBlissett commented Apr 29, 2021

abubelinha commented Nov 17, 2022 • edited Loading

MattBlissett commented Nov 17, 2022

abubelinha commented Nov 17, 2022

sylvain-morin commented Nov 17, 2022 • edited Loading

sylvain-morin commented Nov 17, 2022

sylvain-morin commented Nov 17, 2022

MattBlissett commented Nov 17, 2022

sylvain-morin commented Nov 17, 2022

abubelinha commented Nov 17, 2022

sylvain-morin commented Nov 30, 2022 • edited Loading

abubelinha commented Feb 10, 2024

sylvain-morin commented Feb 12, 2024

abubelinha commented Feb 12, 2024

abubelinha commented Feb 12, 2024

abubelinha commented Nov 17, 2022 •

edited

Loading

sylvain-morin commented Nov 17, 2022 •

edited

Loading

sylvain-morin commented Nov 30, 2022 •

edited

Loading