Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publishing a new version by a script (command line) #1480

Open
meliezer opened this issue Jun 17, 2020 · 19 comments
Open

Publishing a new version by a script (command line) #1480

meliezer opened this issue Jun 17, 2020 · 19 comments

Comments

@meliezer
Copy link
Contributor

Hello,
Would it be possible to add an option to create a new version after I overwrite the source files with a script or simply because database content has changed, maybe by API using curl?
Of course the HTTP response code would tell my script if something is wrong, and the details will be found in the validation log.

Cheers,
Menashè

@sylvain-morin
Copy link
Contributor

Hello @meliezer

I totally agree with this feature.
On my side, I'm using a python script to publish and register automatically a list of resources.

I list the resource IDs in a file (one ID by row)
The python script will do the HTTP calls to login, then publish each or register each.
https://github.com/gbiffrance/ipt-batch-import/blob/master/src/py/Automate-IPT-INPN.py

But I agree that a REST API would be better !

@timrobertson100 timrobertson100 added this to the 2.5 milestone Apr 28, 2021
@timrobertson100
Copy link
Member

I'm not sure a full REST API for the IPT makes sense, as it quickly becomes the GBIF registry, but having some management API for key features like the one @meliezer and @sylmorin-gbif need would make the tool more useful for their workflows.

I suggest we keep the API small initially so that it can be included in a 2.5.1 or 2.5.2 release

@meliezer
Copy link
Contributor Author

Thank you @timrobertson100 ! Personally I would like to:

  1. Upload source files. Overwrite if exist.
  2. Map them (Auto-mapped, simple). If you cannot, so only remapping after the first time was defined using the web version.
  3. Publish.
  4. Register.
    The first two steps are more important.

@abubelinha
Copy link
Contributor

Perhaps this could also be possible?

  1. Update metadata (i.e. passing a JSON dictionary of metadata as body request)

@MattBlissett
Copy link
Member

Update metadata (i.e. passing a JSON dictionary of metadata as body request)

That relates to #955.

@abubelinha
Copy link
Contributor

abubelinha commented Nov 17, 2022

I'm not sure a full REST API for the IPT makes sense, as it quickly becomes the GBIF registry

@timrobertson100 : did you mean it would be easier to publish datasets directly to registry just using its current api?

i.e., is there a way to avoid IPT hosting and keep our datasets and their metadata somewhere (i.e. github repository) we can tell GBIF registry to read from?
Anyone using this approach? Any example scripts or pseudocode?

Thanks
@abubelinha

@MattBlissett
Copy link
Member

If you have a Darwin Core Archive (or just an EML file for a metadata-only dataset) on a public URL you can register it directly with the GBIF.

https://github.com/gbif/registry/tree/dev/registry-examples/src/test/scripts has an example using Bash, which is obviously not production-ready in any way! There are probably 20 or so publishers registering datasets in this way.

Dataset metadata is read from the EML file within the DWCA. You need to keep track of what GBIF dataset UUID is assigned to datasets you have registered, so you can update them.

@abubelinha
Copy link
Contributor

There are probably 20 or so publishers registering datasets in this way.

Thanks a lot! Is it possible to somehow search/know those publishers?
(if anyone publishes DwCa in github, maybe is sharing their publishing protocols/code too).

@sylvain-morin
Copy link
Contributor

sylvain-morin commented Nov 17, 2022

To sum up, we should:

  1. put our dwca archives (zip files) on a web server
    each archive will be accessible with a direct URL like http://myhost/dwca12345.zip

  2. register (once) this web server as an "installation", to get an installationKey
    it means, calling POST https://api.gbif.org/v1/installation
    with a body like this: https://api.gbif.org/v1/installation/a957a663-2f17-415f-b1c8-5cf6398df8ed
    but with the installationType HTTP_INSTALLATION

  3. if the dataset is a new one (not yet register), we have 2 calls to do

a) POST https://api.gbif-uat.org/v1/dataset
with the following body:

{
		  "publishingOrganizationKey": "$ORGANIZATION",
		  "installationKey": "$INSTALLATION",
		  "type": "OCCURRENCE",
		  "title": "Example dataset registration",
		  "description": "The dataset is registered with minimal metadata, which is overwritten once GBIF can access the file.",
		  "language": "eng",
		  "license": "http://creativecommons.org/publicdomain/zero/1.0/legalcode"
		}

(copied from the scripts @MattBlissett mentioned ! thanks !)

to create the dataset on GBIF.org, and get the GBIF uuid of the dataset (let's call it aaaa-bbbb-cccc-dddd)

b) POST https://api.gbif-uat.org/v1/dataset/aaaa-bbbb-cccc-dddd/endpoint
with the following body:

{
			"type": "DWC_ARCHIVE",
			"url": "http://myhost/dwca12345.zip"
		} 

to tell GBIF.org how to access the dwca archive on our web server

  1. if the dataset is already registered, we are doing an update.
    let's say we overwrite the file on our webserver (so same URL to access it)

we call PUT https://api.gbif-uat.org/v1/dataset/aaaa-bbbb-cccc-dddd
with the correct GBIF uuid and kind of the same the following body as in 3)

is it enough to trigger the update? since the URL has not changed, we don't have to call the "endpoint" URL, right?

@sylvain-morin
Copy link
Contributor

as it is written in @mike-podolskiy90 register.sh script:

Using a process like this, you should make sure you store the UUID GBIF assigns to your dataset, so you don't accidentally re-register existing datasets as new ones.

if our archives are named with stable internal UUIDs, we can even just rely on the GBIF.org API.

calling http://api.gbif.org/v1/organization/1928bdf0-f5d2-11dc-8c12-b8a03c50a862/publishedDataset for your publishing org gives you all the information to retrieve the GBIF UUID from your internal ones (using the endpoints section)

that's what I do with some scripts to compare what is on GBIF.org and what is on my IPT, after an update

@sylvain-morin
Copy link
Contributor

@timrobertson100 you told me to do this 2 years ago... but I love too much IPT to abandon it :-)

I guess it's time for me to migrate to this solution - having 10K datasets on my IPT is becoming difficult to handle.

I'm just wondering if the migration will be easy.
For all my current IPT datasets, I will call

  1. PUT https://api.gbif-uat.org/v1/dataset/aaaaa-bbbbbbb-cccccc
{
		  "publishingOrganizationKey": "$ORGANIZATION",
		  "installationKey": "NEW_WEB_SERVER_INSTALLATION_KEY",
		  "type": "OCCURRENCE",
		  "title": "Example dataset registration",
		  "description": "The dataset is registered with minimal metadata, which is overwritten once GBIF can access the file.",
		  "language": "eng",
		  "license": "http://creativecommons.org/publicdomain/zero/1.0/legalcode"
		}

to update the installationKey

  1. PUT https://api.gbif-uat.org/v1/dataset/aaaaa-bbbbbbb-cccccc/endpoint
{
			"type": "DWC_ARCHIVE",
			"url": "http://myhost/dwca12345.zip"
		}

to update the URL endpoint of the archive

Can we update the installationKey of an existing dataset?
Won't the GBIF.org block me?

What about the user to use for this call?
Can I do these operations with any account, or is there some specific registration to do for the account?

Currently, I don't care about this, since it's the IPT that is doing the registry calls.

@MattBlissett
Copy link
Member

If there are no modifications to make you can call (with authentication) GET https://api.gbif-uat.org/v1/dataset/aaaaa-bbbbbbb-cccccc/crawl to request we re-crawl/interpret the dataset. Please don't do this with 10000 datasets at once -- for that many either run batches of about 200 and wait for them to complete, or just wait for the weekly crawl which will happen within 7 days anyway.

To migrate, you would only need to include the changed field:

PUT https://api.gbif-uat.org/v1/dataset/aaaaa-bbbbbbb-cccccc
{
   "installationKey": "NEW_WEB_SERVER_INSTALLATION_KEY",
}

(I think, haven't done this for a while.) Updates cause a crawl after 1 minute, in case there are more updates.

You should write to helpdesk@gbif.org to get authorization to make these requests. It's usually best to create a new institutional account on gbif.org for this. Create one on gbif-uat.org too, so you can test everything there first.

@sylvain-morin
Copy link
Contributor

Thank you @MattBlissett.
I guess I will switch to this (I just asked confirmation to the GBIF France team) before leaving, so it will ease the maintenance.
I could share my scripts to the community for sure.

@abubelinha
Copy link
Contributor

Wow ... tons of information today.
Thanks a lot @MattBlissett & @sylmorin-gbif ... are you planning to use Python for this too?

@sylvain-morin
Copy link
Contributor

sylvain-morin commented Nov 30, 2022

Hi @MattBlissett,
I'm testing the migration, as we discussed above:
PUT https://api.gbif-uat.org/v1/dataset/aaaaa-bbbbbbb-cccccc

{
   "installationKey": "NEW_WEB_SERVER_INSTALLATION_KEY"
}

Here is the result:

<ul>
	<li>Validation of [publishingOrganizationKey] failed: must not be null</li>
	<li>Validation of [title] failed: must not be null</li>
	<li>Validation of [type] failed: must not be null</li>
</ul>

So I added them:
PUT https://api.gbif-uat.org/v1/dataset/aaaaa-bbbbbbb-cccccc

{
   "installationKey": "NEW_WEB_SERVER_INSTALLATION_KEY",
   "publishingOrganizationKey": "xxxx",
   "title": "xxxx",
   "type": "OCCURRENCE"
}

Here is the new result:

<ul>
	<li>Validation of [created] failed: must not be null</li>
	<li>Validation of [key] failed: must not be null</li>
	<li>Validation of [modified] failed: must not be null</li>
</ul>

I don't think adding "created" or "modified" dates is normal....
But I did it :)

PUT https://api.gbif-uat.org/v1/dataset/aaaaa-bbbbbbb-cccccc

{
   "installationKey": "NEW_WEB_SERVER_INSTALLATION_KEY",
   "publishingOrganizationKey": "xxxx",
   "title": "xxxx",
   "type": "OCCURRENCE",
   "key": "xxxxx",
   "created": "2022-11-25T16:18:46.134+00:00",
   "modified": "2022-11-25T16:18:46.134+00:00"
}

And the result is... 400 BAD REQUEST

Any idea?

@abubelinha
Copy link
Contributor

I guess I will switch to this (I just asked confirmation to the GBIF France team) before leaving, so it will ease the maintenance.
I could share my scripts to the community for sure.

@sylvain-morin did you finally end up with a solution you can share?
I'd love to find a way to upload DwCA files to a public repository (Zenodo, Github, whatever) and let GBIF registry read directly from them whenever we publish updates.

@sylvain-morin
Copy link
Contributor

I did a very simple Python app server, to handle our needs at GBIF France.
https://github.com/gbiffrance/apt

In short:

  • you register this "APT" as a GBIF "HTTP installation"
  • you set the GBIF keys (publisher, installation, ...) in the APT config (by env var)
  • you define the folder to store the datasets (mounted by docker volume)
  • you use the POST endpoint to push a ZIP dataset to the APT

the POST endpoint will:

  • store the file in the folder you defined
  • register (or update) the dataset on gbif.org

It's really basic, but it has been handling +15 000 datasets for 1 year (https://www.gbif.org/installation/e44d0fd7-0edf-477f-aa82-50a81836ab46)

Our goal was to have a simple tool to handle the GBIF publication/update at the end of our dataset pipeline.

@abubelinha
Copy link
Contributor

Oh great! Thanks a lot for the summarized explanations.
I suggest you to add them to the repository (INSTALLATION.md or wherever you think it's better).

I understand APT basically replicates IPT behaviour, but in a way that you must create DwCA files yourself before, and then use APT to both serve and register them (and their updates).
So this means the APT is expected to be up and running at any time, just like an IPT. Am I right?

This is great but I am mostly interested in "serving" datasets from a different place (i.e. institutional repository, or Zenodo), but using Python/APT only to register them.
I guess this might be possible by changing registry.py dataset_url(id):

def dataset_url(id):
    # your current code:
    # return CONFIG.APT_PUBLIC_URL + "/dataset/"+id 

    # use my own function to get dataset urls from wherever I store them (i.e. database, excel, ...):
    return get_remote_dataset_url(id)

Also the defaultserver.py post_dataset(id) must be changed to upload the file to a repository instead of storing it into APT server.

In such an scenario, would it be possible to use APT from a local machine (not accessible to GBIF registry) so that I only run it when publishing, but Zenodo or my institution's repository take care of keeping the DwC datasource accesible online 24x7?
I encourage gbif staff (@timrobertson100 @MattBlissett ...?) to give their opinions about this approach.

I suppose the main concern would be checking for valid DwCA file structure before uploading it to a public url and registering it. But perhaps python-dwca-reader might do the trick.
@sylvain-morin did you use any particular approach for that?

Of course creating valid DwCA files on your own might not be trivial (specially the metadata part) ... but that is a different question.

@abubelinha
Copy link
Contributor

I think I missed the "HTTP installation" role of APT in my previous message.

I guess both APT & IPT are expected to be accessible online, so they constitute kind of an index page for the datasets they serve.
So I would slightly change my question: could this "HTTP installation" be a simple html index page of those datasets?

In other words, can we just store (and keep updated) both the "installation" and its datasets in a static website?
(so we may use any repository which can keep them available online in permanent urls)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants