DOI downloader doesn't work for figshare "collections" #274

rabernat · 2021-11-02T01:19:25Z

Thanks again for this excellent package.

I am documenting this issue I encountered, which may or may not be a bug.

I am trying to use the DOI downloader following these instructions. The example code there works for me. However, it doesn't seem to work on the following repository:

https://figshare.com/collections/xmitgcm_test_data/4362224

for which the figshare DOI is

https://doi.org/10.6084/m9.figshare.c.4362224.v1

Code to reproduce the issue

import pooch

POOCH = pooch.create(
    path=pooch.os_cache("mitgcm-test-data"),
    base_url="doi:10.6084/m9.figshare.c.4362224.v1/",
    registry={
        "global_oce_latlon.tar.gz": None
    }
)

f = POOCH.fetch('global_oce_latlon.tar.gz')

IndexError

Downloading file 'global_oce_latlon.tar.gz' from 'doi:10.6084/m9.figshare.c.4362224.v1/global_oce_latlon.tar.gz' to '/home/jovyan/.cache/mitgcm-test-data'.
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/tmp/ipykernel_354/1742425038.py in <module>
      9 )
     10 
---> 11 f = POOCH.fetch('global_oce_latlon.tar.gz')

/srv/conda/envs/notebook/lib/python3.8/site-packages/pooch/core.py in fetch(self, fname, processor, downloader)
    537                 downloader = choose_downloader(url)
    538 
--> 539             stream_download(
    540                 url,
    541                 full_path,

/srv/conda/envs/notebook/lib/python3.8/site-packages/pooch/core.py in stream_download(url, fname, known_hash, downloader, pooch, retry_if_failed)
    722             # hash before overwriting the original.
    723             with temporary_file(path=str(fname.parent)) as tmp:
--> 724                 downloader(url, tmp, pooch)
    725                 hash_matches(tmp, known_hash, strict=True, source=str(fname.name))
    726                 shutil.move(tmp, str(fname))

/srv/conda/envs/notebook/lib/python3.8/site-packages/pooch/downloaders.py in __call__(self, url, output_file, pooch)
    566                 "please open an issue at https://github.com/fatiando/pooch/issues"
    567             )
--> 568         download_url = converters[repository](
    569             archive_url=archive_url,
    570             file_name=parsed_url["path"].split("/")[-1],

/srv/conda/envs/notebook/lib/python3.8/site-packages/pooch/downloaders.py in figshare_download_url(archive_url, file_name, doi)
    653     """
    654     # Use the figshare API to find the article ID from the DOI
--> 655     article = requests.get(f"https://api.figshare.com/v2/articles?doi={doi}").json()[0]
    656     article_id = article["id"]
    657     # With the ID, we can get a list of files and their download links

IndexError: list index out of range

Figshare datasets vs. collections

I think that the core issue is the following:

Example DOI from Pooch docs:

$  curl "https://api.figshare.com/v2/articles?doi=10.6084/m9.figshare.14763051.v1"
[{"id": 14763051, "title": "Test data for the Pooch library", "doi": "10.6084/m9.figshare.14763051.v1", "handle": "", "url": "https://api.figshare.com/v2/articles/14763051", "published_date": "2021-06-10T14:45:37Z", "thumb": "", "defined_type": 3, "defined_type_name": "dataset", "group_id": null, "url_private_api": "https://api.figshare.com/v2/account/articles/14763051", "url_public_api": "https://api.figshare.com/v2/articles/14763051", "url_private_html": "https://figshare.com/account/articles/14763051", "url_public_html": "https://figshare.com/articles/dataset/Test_data_for_the_Pooch_library/14763051", "timeline": {"posted": "2021-06-10T14:45:37", "firstOnline": "2021-06-10T14:45:37", "revision": "2021-06-10T14:45:37"}, "resource_title": "", "resource_doi": ""}]%                                                   (base) rpa@MacBook-Pro mds2zarr %

My DOI:

$ curl "https://api.figshare.com/v2/articles?doi=10.6084/m9.figshare.c.4362224.v1"
[]

I think the core problem is that my DOI points to a figshare collection not a dataset. I didn't even realize this distinction existed until I decided to write up this issue.

The text was updated successfully, but these errors were encountered:

leouieda · 2021-11-02T06:00:22Z

Thanks for reporting @rabernat! I had no idea collections existed to be honest. It's strange because they don't resolve to a particular dataset but to a collection of other datasets with their own DOIs. I suspect the API end point is different for collections and there may be no way around specifying which dataset you want (not just the file name).

Testing this out, I can get a list of the contents of a collection:

$ curl https://api.figshare.com/v2/collections/4362224/articles
[
  {
    "id": 7571852,
    "title": "internal_wave.tar.gz",
    "doi": "10.6084/m9.figshare.7571852.v1",
    "handle": "",
    "url": "https://api.figshare.com/v2/articles/7571852",
    "published_date": "2019-01-10T17:19:56Z",
    "thumb": "",
    "defined_type": 3,
    "defined_type_name": "dataset",
    "group_id": null,
    "url_private_api": "https://api.figshare.com/v2/account/articles/7571852",
    "url_public_api": "https://api.figshare.com/v2/articles/7571852",
    "url_private_html": "https://figshare.com/account/articles/7571852",
    "url_public_html": "https://figshare.com/articles/dataset/internal_wave_tar_gz/7571852",
    "timeline": {
      "posted": "2019-01-10T17:19:56",
      "firstOnline": "2019-01-10T17:19:56",
      "revision": "2019-01-10T17:19:56"
    },
    "resource_title": null,
    "resource_doi": null
  },
  ...
]

The main difficulty is that the title of each dataset doesn't have to be the file name that you want to download (in this case they are but it's not required). So you'd really have to specify the collection DOI, the dataset name/ID, and the file name in the dataset. At this point, it's probably much easier to just specify the dataset DOI instead of the collection.

I'm not sure there is a good way for us to resolve this. It's not hard to figure out if a DOI is a collection (they seem to always contain figshare.c.). But getting to the file download URL is tricky.

Any suggestions from anyone on how to resolve this? We should at least document that it won't work for collection DOIs (PR welcome 🙂).

rabernat · 2021-11-02T12:47:43Z

I think the only feasible solution is to document what we have learned about the difference between figshare collections and datasets. PR forthcoming.

leouieda · 2021-11-08T16:16:32Z

Now that I think about it, it would be great if we actually handled that error better. It's coming from the figshare API returning 0 matching datasets for the DOI and then we try to index the resulting empty list.

A better implementation would be to check if the list is empty and raise an exception saying that no datasets were found for the DOI and explain that if this is a collection then we don't support it and use the dataset DOI instead.

leouieda added bug Report a problem that needs to be fixed question Further information is requested labels Nov 2, 2021

rabernat mentioned this issue Nov 2, 2021

Mention Fighshare Collections vs. Datasets in protocols.rst #275

Merged

6 tasks

leouieda linked a pull request Nov 2, 2021 that will close this issue

Mention Fighshare Collections vs. Datasets in protocols.rst #275

Merged

6 tasks

leouieda removed a link to a pull request Nov 2, 2021

Mention Fighshare Collections vs. Datasets in protocols.rst #275

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOI downloader doesn't work for figshare "collections" #274

DOI downloader doesn't work for figshare "collections" #274

rabernat commented Nov 2, 2021

leouieda commented Nov 2, 2021

rabernat commented Nov 2, 2021

leouieda commented Nov 8, 2021

DOI downloader doesn't work for figshare "collections" #274

DOI downloader doesn't work for figshare "collections" #274

Comments

rabernat commented Nov 2, 2021

Code to reproduce the issue

Figshare datasets vs. collections

leouieda commented Nov 2, 2021

rabernat commented Nov 2, 2021

leouieda commented Nov 8, 2021