Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOI downloader doesn't work for figshare "collections" #274

Open
rabernat opened this issue Nov 2, 2021 · 3 comments
Open

DOI downloader doesn't work for figshare "collections" #274

rabernat opened this issue Nov 2, 2021 · 3 comments
Labels
bug Report a problem that needs to be fixed question Further information is requested

Comments

@rabernat
Copy link
Contributor

rabernat commented Nov 2, 2021

Thanks again for this excellent package.

I am documenting this issue I encountered, which may or may not be a bug.

I am trying to use the DOI downloader following these instructions. The example code there works for me. However, it doesn't seem to work on the following repository:

https://figshare.com/collections/xmitgcm_test_data/4362224

for which the figshare DOI is

https://doi.org/10.6084/m9.figshare.c.4362224.v1

Code to reproduce the issue

import pooch

POOCH = pooch.create(
    path=pooch.os_cache("mitgcm-test-data"),
    base_url="doi:10.6084/m9.figshare.c.4362224.v1/",
    registry={
        "global_oce_latlon.tar.gz": None
    }
)

f = POOCH.fetch('global_oce_latlon.tar.gz')
IndexError
Downloading file 'global_oce_latlon.tar.gz' from 'doi:10.6084/m9.figshare.c.4362224.v1/global_oce_latlon.tar.gz' to '/home/jovyan/.cache/mitgcm-test-data'.
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/tmp/ipykernel_354/1742425038.py in <module>
      9 )
     10 
---> 11 f = POOCH.fetch('global_oce_latlon.tar.gz')

/srv/conda/envs/notebook/lib/python3.8/site-packages/pooch/core.py in fetch(self, fname, processor, downloader)
    537                 downloader = choose_downloader(url)
    538 
--> 539             stream_download(
    540                 url,
    541                 full_path,

/srv/conda/envs/notebook/lib/python3.8/site-packages/pooch/core.py in stream_download(url, fname, known_hash, downloader, pooch, retry_if_failed)
    722             # hash before overwriting the original.
    723             with temporary_file(path=str(fname.parent)) as tmp:
--> 724                 downloader(url, tmp, pooch)
    725                 hash_matches(tmp, known_hash, strict=True, source=str(fname.name))
    726                 shutil.move(tmp, str(fname))

/srv/conda/envs/notebook/lib/python3.8/site-packages/pooch/downloaders.py in __call__(self, url, output_file, pooch)
    566                 "please open an issue at https://github.com/fatiando/pooch/issues"
    567             )
--> 568         download_url = converters[repository](
    569             archive_url=archive_url,
    570             file_name=parsed_url["path"].split("/")[-1],

/srv/conda/envs/notebook/lib/python3.8/site-packages/pooch/downloaders.py in figshare_download_url(archive_url, file_name, doi)
    653     """
    654     # Use the figshare API to find the article ID from the DOI
--> 655     article = requests.get(f"https://api.figshare.com/v2/articles?doi={doi}").json()[0]
    656     article_id = article["id"]
    657     # With the ID, we can get a list of files and their download links

IndexError: list index out of range

Figshare datasets vs. collections

I think that the core issue is the following:

Example DOI from Pooch docs:

$  curl "https://api.figshare.com/v2/articles?doi=10.6084/m9.figshare.14763051.v1"
[{"id": 14763051, "title": "Test data for the Pooch library", "doi": "10.6084/m9.figshare.14763051.v1", "handle": "", "url": "https://api.figshare.com/v2/articles/14763051", "published_date": "2021-06-10T14:45:37Z", "thumb": "", "defined_type": 3, "defined_type_name": "dataset", "group_id": null, "url_private_api": "https://api.figshare.com/v2/account/articles/14763051", "url_public_api": "https://api.figshare.com/v2/articles/14763051", "url_private_html": "https://figshare.com/account/articles/14763051", "url_public_html": "https://figshare.com/articles/dataset/Test_data_for_the_Pooch_library/14763051", "timeline": {"posted": "2021-06-10T14:45:37", "firstOnline": "2021-06-10T14:45:37", "revision": "2021-06-10T14:45:37"}, "resource_title": "", "resource_doi": ""}]%                                                   (base) rpa@MacBook-Pro mds2zarr % 

My DOI:

$ curl "https://api.figshare.com/v2/articles?doi=10.6084/m9.figshare.c.4362224.v1"
[]

I think the core problem is that my DOI points to a figshare collection not a dataset. I didn't even realize this distinction existed until I decided to write up this issue.

@leouieda
Copy link
Member

leouieda commented Nov 2, 2021

Thanks for reporting @rabernat! I had no idea collections existed to be honest. It's strange because they don't resolve to a particular dataset but to a collection of other datasets with their own DOIs. I suspect the API end point is different for collections and there may be no way around specifying which dataset you want (not just the file name).

Testing this out, I can get a list of the contents of a collection:

$ curl https://api.figshare.com/v2/collections/4362224/articles
[
  {
    "id": 7571852,
    "title": "internal_wave.tar.gz",
    "doi": "10.6084/m9.figshare.7571852.v1",
    "handle": "",
    "url": "https://api.figshare.com/v2/articles/7571852",
    "published_date": "2019-01-10T17:19:56Z",
    "thumb": "",
    "defined_type": 3,
    "defined_type_name": "dataset",
    "group_id": null,
    "url_private_api": "https://api.figshare.com/v2/account/articles/7571852",
    "url_public_api": "https://api.figshare.com/v2/articles/7571852",
    "url_private_html": "https://figshare.com/account/articles/7571852",
    "url_public_html": "https://figshare.com/articles/dataset/internal_wave_tar_gz/7571852",
    "timeline": {
      "posted": "2019-01-10T17:19:56",
      "firstOnline": "2019-01-10T17:19:56",
      "revision": "2019-01-10T17:19:56"
    },
    "resource_title": null,
    "resource_doi": null
  },
  ...
]

The main difficulty is that the title of each dataset doesn't have to be the file name that you want to download (in this case they are but it's not required). So you'd really have to specify the collection DOI, the dataset name/ID, and the file name in the dataset. At this point, it's probably much easier to just specify the dataset DOI instead of the collection.

I'm not sure there is a good way for us to resolve this. It's not hard to figure out if a DOI is a collection (they seem to always contain figshare.c.). But getting to the file download URL is tricky.

Any suggestions from anyone on how to resolve this? We should at least document that it won't work for collection DOIs (PR welcome 🙂).

@leouieda leouieda added bug Report a problem that needs to be fixed question Further information is requested labels Nov 2, 2021
@rabernat
Copy link
Contributor Author

rabernat commented Nov 2, 2021

I think the only feasible solution is to document what we have learned about the difference between figshare collections and datasets. PR forthcoming.

@leouieda
Copy link
Member

leouieda commented Nov 8, 2021

Now that I think about it, it would be great if we actually handled that error better. It's coming from the figshare API returning 0 matching datasets for the DOI and then we try to index the resulting empty list.

A better implementation would be to check if the list is empty and raise an exception saying that no datasets were found for the DOI and explain that if this is a collection then we don't support it and use the dataset DOI instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Report a problem that needs to be fixed question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants