Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

download a zenodo archive #31

Closed
hansvancalster opened this issue May 25, 2020 · 17 comments
Closed

download a zenodo archive #31

hansvancalster opened this issue May 25, 2020 · 17 comments
Assignees
Milestone

Comments

@hansvancalster
Copy link

We recently wrote a function to download a Zenodo archive. See this discussion. The function is available through a package that bundles useful utilities for our institution but we feel it should better be moved to a focused package like zen4R.
We have discussed this here, where we also refer to the zenodo package that has a similar goal as this one, but seems not very actively maintained anymore (- but still it might be a good idea to join forces?).

Our question is whether you think the inborutils::download_zenodo() could be a useful addition to your package.

@eblondel
Copy link
Owner

eblondel commented Jun 1, 2020

Thanks @hansvancalster i'll have a look ASAP

@hansvancalster
Copy link
Author

Thanks for looking into this. Note that downloading files might also be possible via the API, but I haven't seen examples yet.

As a side-note, I noticed that the github LICENSE file should be changed to be in accordance with the license that is mentioned in the description of the package (and also the license mentioned on your deposit of the package on zenodo):
License: MIT + file LICENSE

@eblondel
Copy link
Owner

eblondel commented Jun 4, 2020

For the LICENSE, it has been set-up in this way to fulfill CRAN expectations, the full license text is not handled in the LICENSE file because CRAN resolves this in 2 separate things: the original LICENSE text and the specific LICENSE details for the software, in the case of MIT it is the copyright holder. See https://cran.r-project.org/package=zen4R where see both MIT and LICENSE links. All R CRAN packages are handled in this way, see eg https://cran.r-project.org/package=rdflib

@hansvancalster
Copy link
Author

Thanks for the explanation. Didn't know that...

@adamhsparks
Copy link

Just sticking my nose in here to say that I found this package by looking for exactly this functionality. I'd like to be able to download the data archive in R and use the data directly. I can write something to do this, but proper functionality that builds on a package like zen4R would be just what I'm looking for.

@eblondel
Copy link
Owner

Thanks @adamhsparks i'm very busy currently, but asap I will have a look to had a function to download data archives from zenodo records. Indeed I have colleagues that are also interested in this feature.

@adamhsparks
Copy link

No worries. I was just voicing community support for the idea.

@eblondel eblondel self-assigned this Jul 20, 2020
@eblondel eblondel added this to the 0.4 milestone Aug 5, 2020
eblondel added a commit that referenced this issue Aug 6, 2020
eblondel added a commit that referenced this issue Aug 6, 2020
@eblondel
Copy link
Owner

eblondel commented Aug 6, 2020

Hi all, i've added 2 new functions associated to a ZenodoRecord:

  • listFiles by default will give you the list of files in a data.frame, otherwise set \code{pretty = FALSE} to return a raw list
  • downloadFiles. which is quite similar in term of logic to the one you mentioned, with changes to adapt it to zen4R and with more flexibility for parallel handling.

By default it will be done sequentially and files are downloaded into the target dir (default is current wd). Since you were handling parallel, i've put some code to do this with more flexibility required on the way parallel is handled. By default it will use standard parallel::mclapply, with arguments from this function than can be specified as arguments of the function.
To make parallel working in other plateforms that do not implement mclapply (eg. Windows), or to use other strategies (eg cluster), an custom parallel_handler can be specified. This is the case of parallel with cluster approach. For that you can specify the handler e.g. parLapply, and the appropriate cl cluster object that you have to create earlier. This cannot be handled within the function as the cluster characteristics will change depending on the user/machine.

To reuse zen4R within your package, and download files from a particular record, you will have to do in 2 times: first reach the record, and then download files. You can reach the record by ID or DOI (or eventually by Concept Id / DOI):

#instantiate Zen4R client 
zenodo <- ZenodoManager$new(
   token = <your_token>, 
   logger = "INFO" # use "DEBUG" to see detailed API operation logs, use NULL if you don't want logs at all
)

#reach your record by Id or DOI (or same by concept ID / DOI)
rec = zenodo$getRecordById("<your id>")
rec = zenodo$getRecordByDOI("<your doi>")
rec = zenodo$getRecordByConceptId("<your concept id>")
rec = zenodo$getRecordByConceptDOI("<your concept doi>")

#list files
rec$listFiles()

#download files as seq
rec$downloadFiles(path = "<my target dir>")

#download files as parallel (standard mclapply in Unix)
rec$downloadFiles(path = "<my target dir>", parallel = TRUE, mc.cores = 4)

#download files as parallel (using a cluster, compatible with Win OS)
cl <- makeCluster(4)
rec$downloadFiles(path = "<my target dir>", parallel = TRUE, parallel_handler = parLapply, cl = cl)

Last but not least, i did a round of other improvements in zen4R, and i'm planning to do a CRAN release soon, in case you have comments / suggestions on the new functions.

@adamhsparks
Copy link

Good timing. I have a workshop coming up at the end of the month where I want to download data from Zenodo. I'll install this and give it a go and let you know if I have any feedback for your CRAN submission.

@florisvdh
Copy link
Contributor

Thanks for the efforts @eblondel, this seems promising! (I didn't test it yet.) Thanks also for providing sample code!

My colleague @hansvancalster will be back online in the week of 24 Aug. Meanwhile I'll try to have a closer look at this.

At first sight (note that I'm not yet familiar with zen4R) it seems a bit more complex for a basic R user who just wants to download the contents of a published record, compared to the token-less download_zenodo(). But perhaps some steps can still be wrapped in one function to that aim.

  • E.g. a download-function that takes as input (Concept)Id/Doi and downloads the files. But I can understand it if you feel that your modular approach better fits the zen4R approach.
  • Needing a token implies having a Zenodo account, while we have several use-cases which should not need those, i.e. for published records which are downloadable as such. IMO token-less download especially makes sense with regard to the (general) reproducibility of a data-science R script. Perhaps that usecase could be distinguished with another download-function in zen4R? (If for this aim it would be better to integrate download_zenodo() as a separate function, then you're welcome to do so). Do you see other ways?

From your code, I see you also return helpful feedback messages 👍 . In download_zenodo() we also tested the checksums to verify file integrity; would that be an interesting addition according to you?

@eblondel
Copy link
Owner

eblondel commented Aug 7, 2020

Thanks @florisvdh for your feedback,

  • on tokens: Zen4R is capable to handle token-less methods (those that do not require token). The example I show you is a "generic" one, but indeed token is not required for methods that do not deal with CRUD operations. Look at this example:
zenodo = ZenodoManager$new()
rec = zenodo$getRecordByConceptDOI("10.5281/zenodo.2547036")
rec$downloadFiles()
  • on the 'wrapper': Zen4r is not 'download' focus but supports a R client to do all operations supported by Zenodo , with primary focus on depositing records. The R client is handled by the ZenodoManager that provides all methods found in the Zenodo API. Of course if the above code is not shorter enough, we can provide a wrapper function that wraps the above code.

  • for the checksums I suppose it can be added sure

eblondel added a commit that referenced this issue Aug 7, 2020
@eblondel
Copy link
Owner

eblondel commented Aug 7, 2020

@florisvdh I've just added a download_zenodo wrapper. Example:

download_zenodo("10.5281/zenodo.2547036")

@eblondel
Copy link
Owner

eblondel commented Aug 7, 2020

@florisvdh justed added the missing md5sum integrity check.
@florisvdh @hansvancalster If you want and allow me, I can add you as contribs in pkg desc. If have any objection to that, let me know (If you have an ORCID to add as well let me know).

eblondel added a commit that referenced this issue Aug 7, 2020
eblondel added a commit that referenced this issue Aug 7, 2020
@florisvdh
Copy link
Contributor

florisvdh commented Aug 13, 2020

Hi @eblondel , I've taken closer look. It's great that you added a download_zenodo() wrapper! And it works without token - nice. You're welcome to add us as contributors (@hansvancalster is absent but I expect the same for him); see here and here for the requested info.

Some further tweaks and fixes are proposed in PR #35.

Further aspects to be discussed / solved IMO are below. Some points have to do with our wish to drop our 'miscellaneous' function inborutils::download_zenodo() in favour of zen4R::download_zenodo(), which explains why I'm inclined to compare both and be as demanding as we were to ourselves 😉 , before doing so. That being said, I should stress that zen4R::download_zenodo() generally works well already!

  • I'm a bit concerned about the use of parallel::mclapply() as a default. While I'm not familiar with this function, from the documentation it seems not advisable to use it on Windows (which many of my colleagues use, and as many other users elsewhere):

    It relies on forking and hence is not available on Windows unless mc.cores = 1.

    I hope a default approach can be used which will work out of the box, cross-platform. I used clusterMap() in inborutils::download_zenodo() - the idea came from here - it was tested successfully on at least Linux and Windows. In this thread, another possibility is given. There may be more options.

  • messages:

    • in the referred PR, I added an option quiet (default: FALSE) to suppress all informative messages. I believe that better suits automated tasks, the use inside other functions etc., i.e. where the user does not directly interact with download_zenodo().
    • regarding the messages for human interaction, they can still be made more informative / human-readable I think.
      • in the inborutils function we converted the number of bytes to a human-readable format. You're welcome to recycle the human_filesize() function (here) in zen4R - you could make it internal with @keywords internal (and dropping @export).
      • I really like your addition of providing the absolute download path BTW 👍 Note, in the PR I removed some redundancy here (relevant for multi-file records).
      • beside giving ID / DOI, I'd suggest to provide (the human-readable) record title and version name in the initial message ('Will download x files ...'). It will assist a user in verifying it is indeed the correct record and version.
    • At package level, I wondered why cat() is used instead of message()? The colouring of messages (in e.g. RStudio) makes them well separated from returned values.
  • speed. The following may matter most for very large downloads: it appeared that zen4R::download_zenodo() was a little slower than inborutils::download_zenodo(). Beside the somewhat retarded start of the download (about 1-2 s later), the observed download speed was a few percent slower (my case: roughly 6050 vs 6200 kB/s), maybe due to download.file() vs. curl_download(). Anyway, for a 37.5 MiB file, typical timings in my case are below. Perhaps you could compare download speeds through the use of curl_download()?

Code, output, session info
> system.time(
+   inborutils::download_zenodo("10.5281/zenodo.2682323") #doi
+ )
Will download 1 file (total size: 37.5 MiB) from https://doi.org/10.5281/zenodo.2682323 (GRTS master sample for habitat monitoring in Flanders; version: 2)

 [100%] Downloaded 39306606 bytes...

Verifying file integrity...

GRTSmaster_habitats.tif was downloaded and its integrity verified (md5sum: 20de76e1abfbafd6edcc00e1a9cf87a0)
   user  system elapsed 
  1.278   1.959   7.751 
> system.time(
+   download_zenodo("10.5281/zenodo.2682323") #doi
+ )
[zen4R][INFO] ZenodoRecord - Download in sequential mode 
[zen4R][INFO] ZenodoRecord - Will download 1 file from record '2682323' (doi: '10.5281/zenodo.2682323') - total size: 39306606 
[zen4R][INFO] Downloading file 'GRTSmaster_habitats.tif' from record '2682323' (doi: '10.5281/zenodo.2682323') - size: 39306606
trying URL 'https://zenodo.org/api/files/ca78f68d-9753-4223-8115-4b8717760e96/GRTSmaster_habitats.tif'
Content type 'image/tiff' length 39306606 bytes (37.5 MB)
==================================================
downloaded 37.5 MB

[zen4R][INFO] File downloaded at '/media/floris/DATA/git_repositories/zen4R'.
[zen4R][INFO] ZenodoRecord - Verifying file integrity... 
[zen4R][INFO] File 'GRTSmaster_habitats.tif': integrity verified (md5sum: 20de76e1abfbafd6edcc00e1a9cf87a0)
[zen4R][INFO] ZenodoRecord - End of download 
   user  system elapsed 
  0.881   1.398   9.418 
Warning messages:
1: In default_backend_auto() :
  Selectingenvbackend. Secrets are stored in environment variables
2: In default_backend_auto() :
  Selectingenvbackend. Secrets are stored in environment variables
3: In default_backend_auto() :
  Selectingenvbackend. Secrets are stored in environment variables
Session info
Session info ────────────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.6.3 (2020-02-29)
 os       Linux Mint 18.1             
 system   x86_64, linux-gnu           
 ui       RStudio                     
 language (EN)                        
 collate  nl_BE.UTF-8                 
 ctype    nl_BE.UTF-8                 
 tz       Europe/Brussels             
 date     2020-08-13Packages ────────────────────────────────────────────────────────────────────────
 ! package     * version    date       lib source                          
   assertable    0.2.7      2019-09-21 [1] CRAN (R 3.6.1)                  
   assertthat    0.2.1      2019-03-21 [1] CRAN (R 3.6.0)                  
   backports     1.1.8      2020-06-17 [1] CRAN (R 3.6.3)                  
   bit           4.0.4      2020-08-04 [1] CRAN (R 3.6.3)                  
   bit64         4.0.2      2020-07-30 [1] CRAN (R 3.6.3)                  
   blob          1.2.1      2020-01-20 [1] CRAN (R 3.6.2)                  
   callr         3.4.3      2020-03-28 [1] CRAN (R 3.6.3)                  
   class         7.3-17     2020-04-26 [4] CRAN (R 3.6.3)                  
   classInt      0.4-3      2020-04-07 [1] CRAN (R 3.6.3)                  
   cli           2.0.2      2020-02-28 [1] CRAN (R 3.6.3)                  
   colorspace    1.4-1      2019-03-18 [1] CRAN (R 3.6.0)                  
   conditionz    0.1.0      2019-04-24 [1] CRAN (R 3.6.3)                  
   crayon        1.3.4      2017-09-16 [1] CRAN (R 3.6.0)                  
   crosstalk     1.1.0.1    2020-03-13 [1] CRAN (R 3.6.3)                  
   curl          4.3        2019-12-02 [1] CRAN (R 3.6.2)                  
   data.table    1.13.0     2020-07-24 [1] CRAN (R 3.6.3)                  
   DBI           1.1.0      2019-12-15 [1] CRAN (R 3.6.2)                  
   desc          1.2.0      2018-05-01 [1] CRAN (R 3.6.0)                  
   devtools      2.3.1      2020-07-21 [1] CRAN (R 3.6.3)                  
   digest        0.6.25     2020-02-23 [1] CRAN (R 3.6.3)                  
   dplyr         1.0.1      2020-07-31 [1] CRAN (R 3.6.3)                  
   drat          0.1.8      2020-07-18 [1] CRAN (R 3.6.3)                  
   e1071         1.7-3      2019-11-26 [1] CRAN (R 3.6.2)                  
   ellipsis      0.3.1      2020-05-15 [1] CRAN (R 3.6.3)                  
   evaluate      0.14       2019-05-28 [1] CRAN (R 3.6.1)                  
   fansi         0.4.1      2020-01-08 [1] CRAN (R 3.6.2)                  
   fs            1.5.0      2020-07-31 [1] CRAN (R 3.6.3)                  
   generics      0.0.2      2018-11-29 [1] CRAN (R 3.6.0)                  
   geoaxe        0.1.0      2016-02-19 [1] CRAN (R 3.6.0)                  
   ggplot2       3.3.2      2020-06-19 [1] CRAN (R 3.6.3)                  
   glue          1.4.1      2020-05-13 [1] CRAN (R 3.6.3)                  
   gtable        0.3.0      2019-03-25 [1] CRAN (R 3.6.0)                  
   hms           0.5.3      2020-01-08 [1] CRAN (R 3.6.2)                  
   htmltools     0.5.0      2020-06-16 [1] CRAN (R 3.6.3)                  
   htmlwidgets   1.5.1      2019-10-08 [1] CRAN (R 3.6.1)                  
   httr          1.4.2      2020-07-20 [1] CRAN (R 3.6.3)                  
   inborutils    0.1.0.9086 2020-07-10 [1] Github (inbo/inborutils@e07eec1)
   iterators     1.0.12     2019-07-26 [1] CRAN (R 3.6.1)                  
   jsonlite      1.7.0      2020-06-25 [1] CRAN (R 3.6.3)                  
   KernSmooth    2.23-17    2020-04-26 [4] CRAN (R 3.6.3)                  
   keyring       1.1.0      2018-07-16 [1] CRAN (R 3.6.3)                  
   knitr         1.29       2020-06-23 [1] CRAN (R 3.6.3)                  
   lattice       0.20-41    2020-04-02 [4] CRAN (R 3.6.3)                  
   lazyeval      0.2.2      2019-03-15 [1] CRAN (R 3.6.3)                  
   leaflet       2.0.3      2019-11-16 [1] CRAN (R 3.6.2)                  
   lifecycle     0.2.0      2020-03-06 [1] CRAN (R 3.6.3)                  
   lubridate     1.7.9      2020-06-08 [1] CRAN (R 3.6.3)                  
   magrittr      1.5        2014-11-22 [1] CRAN (R 3.6.0)                  
   memoise       1.1.0      2017-04-21 [1] CRAN (R 3.6.0)                  
   munsell       0.5.0      2018-06-12 [1] CRAN (R 3.6.0)                  
   oai           0.3.0      2019-09-07 [1] CRAN (R 3.6.1)                  
   odbc          1.2.3      2020-06-18 [1] CRAN (R 3.6.3)                  
   packrat       0.5.0      2018-11-14 [1] CRAN (R 3.6.0)                  
   pillar        1.4.6      2020-07-10 [1] CRAN (R 3.6.3)                  
   pkgbuild      1.1.0      2020-07-13 [1] CRAN (R 3.6.3)                  
   pkgconfig     2.0.3      2019-09-22 [1] CRAN (R 3.6.1)                  
   pkgload       1.1.0      2020-05-29 [1] CRAN (R 3.6.3)                  
   plyr          1.8.6      2020-03-03 [1] CRAN (R 3.6.3)                  
   prettyunits   1.1.1      2020-01-24 [1] CRAN (R 3.6.3)                  
   processx      3.4.3      2020-07-05 [1] CRAN (R 3.6.3)                  
   ps            1.3.3      2020-05-08 [1] CRAN (R 3.6.3)                  
   purrr         0.3.4      2020-04-17 [1] CRAN (R 3.6.3)                  
   R6            2.4.1      2019-11-12 [1] CRAN (R 3.6.2)                  
   Rcpp          1.0.5      2020-07-06 [1] CRAN (R 3.6.3)                  
   readr         1.3.1      2018-12-21 [1] CRAN (R 3.6.2)                  
   remotes       2.2.0      2020-07-21 [1] CRAN (R 3.6.3)                  
   rgbif         3.2.0      2020-07-23 [1] CRAN (R 3.6.3)                  
   rgeos         0.5-3      2020-05-08 [1] CRAN (R 3.6.3)                  
   rlang         0.4.7      2020-07-09 [1] CRAN (R 3.6.3)                  
   rmarkdown     2.3        2020-06-18 [1] CRAN (R 3.6.3)                  
   rprojroot     1.3-2      2018-01-03 [1] CRAN (R 3.6.0)                  
   RSQLite       2.2.0      2020-01-07 [1] CRAN (R 3.6.2)                  
   rstudioapi    0.11       2020-02-07 [1] CRAN (R 3.6.3)                  
   scales        1.1.1      2020-05-11 [1] CRAN (R 3.6.3)                  
   sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 3.6.0)                  
   sf            0.9-5      2020-07-14 [1] CRAN (R 3.6.3)                  
   sp            1.4-2      2020-05-20 [1] CRAN (R 3.6.3)                  
   stringi       1.4.6      2020-02-17 [1] CRAN (R 3.6.3)                  
   stringr       1.4.0      2019-02-10 [1] CRAN (R 3.6.0)                  
   testthat    * 2.3.2      2020-03-02 [1] CRAN (R 3.6.3)                  
   tibble        3.0.3      2020-07-10 [1] CRAN (R 3.6.3)                  
   tidyr         1.1.1      2020-07-31 [1] CRAN (R 3.6.3)                  
   tidyselect    1.1.0      2020-05-11 [1] CRAN (R 3.6.3)                  
   units         0.6-7      2020-06-13 [1] CRAN (R 3.6.3)                  
   usethis       1.6.1      2020-04-29 [1] CRAN (R 3.6.3)                  
   uuid          0.1-4      2020-02-26 [1] CRAN (R 3.6.3)                  
   vctrs         0.3.2      2020-07-15 [1] CRAN (R 3.6.3)                  
   whisker       0.4        2019-08-28 [1] CRAN (R 3.6.1)                  
   withr         2.2.0      2020-04-20 [1] CRAN (R 3.6.3)                  
   xfun          0.16       2020-07-24 [1] CRAN (R 3.6.3)                  
   xml2          1.3.2      2020-04-23 [1] CRAN (R 3.6.3)                  
   yaml          2.2.1      2020-02-01 [1] CRAN (R 3.6.2)                  
 P zen4R       * 0.4        2020-08-11 [?] local                           

[1] /home/floris/lib/R/library
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library

 P ── Loaded and on-disk path mismatch.

The difference in elapsed time is especially noticeable for small downloads, e.g. the 56.9 KiB (2 files) from "10.5281/zenodo.3378733" (zen4R 0.3) took about 1.5 s and 3.5 s respectively, for the whole function to execute.

  • warnings: as can be seen from the above output, the below warning occurs multiple times:
Warning message:
In default_backend_auto() :
  Selecting ‘env’ backend. Secrets are stored in environment variables

The warning appears to come from the keyring package, and maybe it could be solved at system configuration level (i.e. in the user's system). Anyhow, for a token-less approach (download_zenodo()) and from a 'simple' user's perspective this seems unwanted / unneeded so it would be nice if you could prevent the warnings from happening. (A user just using download_zenodo() shouldn't be bothered about backends and secrets.) When done step by step, the warnings seem to come from ZenodoManager$new() and zenodo$getRecordByConceptDOI(), not from rec$downloadFiles():

> zenodo = ZenodoManager$new()
Warning message:
In default_backend_auto() :
  Selectingenvbackend. Secrets are stored in environment variables
> rec = zenodo$getRecordByConceptDOI("10.5281/zenodo.2547036")
Warning messages:
1: In default_backend_auto() :
  Selectingenvbackend. Secrets are stored in environment variables
2: In default_backend_auto() :
  Selectingenvbackend. Secrets are stored in environment variables

BTW get_versions() is also an interesting function! It too throws many warnings however (the last is about a locale), which probably originate from elsewhere:

Warnings by get_versions()
> get_versions("10.5281/zenodo.2547036")
        date  version                    doi
1 2019-01-22 0.1-beta 10.5281/zenodo.2547037
2 2019-06-03      0.1 10.5281/zenodo.3238351
3 2019-08-02      0.2 10.5281/zenodo.3358590
4 2019-08-27      0.3 10.5281/zenodo.3378733
Warning messages:
1: In default_backend_auto() :
  Selectingenvbackend. Secrets are stored in environment variables
2: In default_backend_auto() :
  Selectingenvbackend. Secrets are stored in environment variables
3: In default_backend_auto() :
  Selectingenvbackend. Secrets are stored in environment variables
4: In default_backend_auto() :
  Selectingenvbackend. Secrets are stored in environment variables
5: In default_backend_auto() :
  Selectingenvbackend. Secrets are stored in environment variables
6: In Sys.setlocale("LC_TIME", "us_US") :
  OS reports request to set locale to "us_US" cannot be honored
  • finally, some oddities that I noticed with specific Zenodo records:
    • Zenodo record of zen4R:
      • download_zenodo("10.5281/zenodo.2547036"): this is the concept DOI --> however it resolves to version 0.2 and not to the latest (inborutils::download_zenodo() however does resolve to the latest (currently 0.3))
    • Zenodo record of another package: it appears that the GitHub-Zenodo webhook by default makes the item (file) names on the Zenodo website appear as 'organisation/reponame-version.zip' (see here for an example), while the downloaded file is just named 'reponame-version.zip'. The slash appears not to trouble inborutils::download_zenodo() but an error occurs in zen4R::download_zenodo().
      • example below. From the output it appears as if the file is downloaded, but that is not the case.
Example
> download_zenodo("10.5281/zenodo.3630532") # files with '/' in their label at Zenodo fail
[zen4R][INFO] ZenodoRecord - Download in sequential mode 
[zen4R][INFO] ZenodoRecord - Will download 1 file from record '3836625' (doi: '10.5281/zenodo.3836625') - total size: 99960 
[zen4R][INFO] Downloading file 'inbo/watina-v0.3.0.zip' - size: 99960
[zen4R][INFO] File downloaded at '/media/floris/DATA/git_repositories/zen4R'.
[zen4R][INFO] ZenodoRecord - Verifying file integrity... 
Error in if (target_file_md5sum == file$checksum) { : 
  missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In default_backend_auto() :
  Selectingenvbackend. Secrets are stored in environment variables
2: In default_backend_auto() :
  Selectingenvbackend. Secrets are stored in environment variables
3: In default_backend_auto() :
  Selectingenvbackend. Secrets are stored in environment variables
4: In default_backend_auto() :
  Selectingenvbackend. Secrets are stored in environment variables
5: In default_backend_auto() :
  Selectingenvbackend. Secrets are stored in environment variables
6: In download.file(url = file$links$download, destfile = target_file,  :
  URL https://zenodo.org/api/files/28df5d2b-40f5-43d6-a2f0-822ec2270733/inbo/watina-v0.3.0.zip: cannot open destfile './inbo/watina-v0.3.0.zip', reason 'Bestand of map bestaat niet'
7: In download.file(url = file$links$download, destfile = target_file,  :
  download had nonzero exit status

@eblondel
Copy link
Owner

eblondel commented Sep 1, 2020

@florisvdh I've read carefully your notes/requirements and update R code accordingly:

  • parallel: i've removed default handler. The user is responsible to put parallel handler, cluster-based or not. I've added examples in download_zenodo for the 3 approaches (sequential, parallel/cluster, parallel/mclapply).

  • messages:

    • cat is using as it is used for grabing in log files. AFAIK message is not.
    • I merged your pull request with your suggestions to improve messaging
    • human_filesize added & recycled
  • speed: in zen4R, we first search for the record based on DOI, the small delay at init is due to this. Some improvement may be done in a later release/revision in case there is some code optimization possible.

  • warnings: suppressed on ZenodoManager in case of download_zenodo and get_versions

  • about:

download_zenodo("10.5281/zenodo.2547036"): this is the concept DOI --> however it resolves to version 0.2 and not to the latest (inborutils::download_zenodo() however does resolve to the latest (currently 0.3))

Fixed in #39 (will see if there is room for improvement based on Zenodo API in next release)

  • about:
  • Zenodo record of another package: it appears that the GitHub-Zenodo webhook by default makes the item (file) names on the Zenodo website appear as 'organisation/reponame-version.zip' (see here for an example), while the downloaded file is just named 'reponame-version.zip'. The slash appears not to trouble inborutils::download_zenodo() but an error occurs in zen4R::download_zenodo().

Fixed here in #31

Since many changes done in current milestone, i'm going to init a CRAN release for 0.4.
With the new download_zenodo, you should get most of features you had in your initial function.

Best

@eblondel eblondel closed this as completed Sep 1, 2020
@florisvdh
Copy link
Contributor

florisvdh commented Sep 2, 2020

Thanks for the follow-up @eblondel ! 👍 Thank you for providing a solution to the parallel download.

I prepared a small PR (#40) for you to get rid of the extra dependencies. Looking forward to the CRAN release!

Below is current behaviour, which works well indeed.

Some stuff for later track if you like:

  • note that the warnings aren't gone
  • rather no feedback messages are returned when using a parallel approach, I'm not sure whether that was the case before.
  • the 'slashed' cases (e.g. inbo/watina from "10.5281/zenodo.3630532") would best drop the prefix in the filename (watina-version.zip rather than inbo_watina-version.zip), because that's what Zenodo does when downloading the file from the website.
Code and output
> download_zenodo("10.5281/zenodo.2547036")
[zen4R][INFO] ZenodoRecord - Download in sequential mode 
[zen4R][INFO] ZenodoRecord - Will download 2 files from record '3378733' (doi: '10.5281/zenodo.3378733') - total size: 56.9 KiB 
[zen4R][INFO] Downloading file 'zen4R-0.3.tar.gz' - size: 24.8 KiB
trying URL 'https://zenodo.org/api/files/c8a4b50b-27ce-4a03-85aa-27c631219b98/zen4R-0.3.tar.gz'
Content type 'application/octet-stream' length 25350 bytes (24 KB)
==================================================
downloaded 24 KB

[zen4R][INFO] Downloading file 'zen4R-0.3.zip' - size: 32.2 KiB
trying URL 'https://zenodo.org/api/files/c8a4b50b-27ce-4a03-85aa-27c631219b98/zen4R-0.3.zip'
Content type 'application/octet-stream' length 32957 bytes (32 KB)
==================================================
downloaded 32 KB

[zen4R][INFO] Files downloaded at '/media/floris/DATA/git_repositories/zen4R'.
[zen4R][INFO] ZenodoRecord - Verifying file integrity... 
[zen4R][INFO] File 'zen4R-0.3.tar.gz': integrity verified (md5sum: 66c585a0398d81b741c19029292c7e3f)
[zen4R][INFO] File 'zen4R-0.3.zip': integrity verified (md5sum: be1ce3a0e52f83ab1c42fa058d6b5451)
[zen4R][INFO] ZenodoRecord - End of download 
Warning messages:
1: In default_backend_auto() :
  Selectingenvbackend. Secrets are stored in environment variables
2: In default_backend_auto() :
  Selectingenvbackend. Secrets are stored in environment variables
3: In default_backend_auto() :
  Selectingenvbackend. Secrets are stored in environment variables
4: In default_backend_auto() :
  Selectingenvbackend. Secrets are stored in environment variables
5: In Sys.setlocale("LC_TIME", "us_US") :
  OS reports request to set locale to "us_US" cannot be honored
6: In default_backend_auto() :
  Selectingenvbackend. Secrets are stored in environment variables
7: In default_backend_auto() :
  Selectingenvbackend. Secrets are stored in environment variables
> download_zenodo("10.5281/zenodo.3630532")
[zen4R][INFO] ZenodoRecord - Download in sequential mode 
[zen4R][INFO] ZenodoRecord - Will download 1 file from record '3836625' (doi: '10.5281/zenodo.3836625') - total size: 97.6 KiB 
[zen4R][INFO] Downloading file 'inbo_watina-v0.3.0.zip' - size: 97.6 KiB
trying URL 'https://zenodo.org/api/files/28df5d2b-40f5-43d6-a2f0-822ec2270733/inbo/watina-v0.3.0.zip'
Content type 'application/octet-stream' length 99960 bytes (97 KB)
==================================================
downloaded 97 KB

[zen4R][INFO] File downloaded at '/media/floris/DATA/git_repositories/zen4R'.
[zen4R][INFO] ZenodoRecord - Verifying file integrity... 
[zen4R][INFO] File 'inbo_watina-v0.3.0.zip': integrity verified (md5sum: 4c0f952cbd1e70195f957688428af960)
[zen4R][INFO] ZenodoRecord - End of download 
Warning messages:
1: In default_backend_auto() :
  Selectingenvbackend. Secrets are stored in environment variables
2: In default_backend_auto() :
  Selectingenvbackend. Secrets are stored in environment variables
3: In default_backend_auto() :
  Selectingenvbackend. Secrets are stored in environment variables
4: In default_backend_auto() :
  Selectingenvbackend. Secrets are stored in environment variables
5: In Sys.setlocale("LC_TIME", "us_US") :
  OS reports request to set locale to "us_US" cannot be honored
6: In default_backend_auto() :
  Selectingenvbackend. Secrets are stored in environment variables
7: In default_backend_auto() :
  Selectingenvbackend. Secrets are stored in environment variables
> download_zenodo("10.5281/zenodo.2547036", 
+                 parallel = TRUE, parallel_handler = parLapply, cl = makeCluster(2))
[zen4R][INFO] ZenodoRecord - Download in parallel mode 
Error in rec$downloadFiles(path = path, quiet = quiet, ...) : 
  object 'parLapply' not found
In addition: Warning messages:
1: In default_backend_auto() :
  Selectingenvbackend. Secrets are stored in environment variables
2: In default_backend_auto() :
  Selectingenvbackend. Secrets are stored in environment variables
3: In default_backend_auto() :
  Selectingenvbackend. Secrets are stored in environment variables
4: In default_backend_auto() :
  Selectingenvbackend. Secrets are stored in environment variables
5: In Sys.setlocale("LC_TIME", "us_US") :
  OS reports request to set locale to "us_US" cannot be honored
6: In default_backend_auto() :
  Selectingenvbackend. Secrets are stored in environment variables
7: In default_backend_auto() :
  Selectingenvbackend. Secrets are stored in environment variables
Session info
Session info ──────────────────────────────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 4.0.2 (2020-06-22)
 os       Linux Mint 20               
 system   x86_64, linux-gnu           
 ui       RStudio                     
 language nl_BE:nl                    
 collate  nl_BE.UTF-8                 
 ctype    nl_BE.UTF-8                 
 tz       Europe/Brussels             
 date     2020-09-02Packages ──────────────────────────────────────────────────────────────────────────────────────────
 ! package     * version date       lib source        
   assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.0.2)
   backports     1.1.8   2020-06-17 [1] CRAN (R 4.0.2)
   callr         3.4.3   2020-03-28 [1] CRAN (R 4.0.2)
   cli           2.0.2   2020-02-28 [1] CRAN (R 4.0.2)
   crayon        1.3.4   2017-09-16 [1] CRAN (R 4.0.2)
   desc          1.2.0   2018-05-01 [1] CRAN (R 4.0.2)
   devtools      2.3.1   2020-07-21 [1] CRAN (R 4.0.2)
   digest        0.6.25  2020-02-23 [1] CRAN (R 4.0.2)
   ellipsis      0.3.1   2020-05-15 [1] CRAN (R 4.0.2)
   fansi         0.4.1   2020-01-08 [1] CRAN (R 4.0.2)
   fs            1.5.0   2020-07-31 [1] CRAN (R 4.0.2)
   glue          1.4.1   2020-05-13 [1] CRAN (R 4.0.2)
   httr          1.4.2   2020-07-20 [1] CRAN (R 4.0.2)
   jsonlite      1.7.0   2020-06-25 [1] CRAN (R 4.0.2)
   keyring       1.1.0   2018-07-16 [1] CRAN (R 4.0.2)
   magrittr      1.5     2014-11-22 [1] CRAN (R 4.0.2)
   memoise       1.1.0   2017-04-21 [1] CRAN (R 4.0.2)
   pkgbuild      1.1.0   2020-07-13 [1] CRAN (R 4.0.2)
   pkgload       1.1.0   2020-05-29 [1] CRAN (R 4.0.2)
   prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.0.2)
   processx      3.4.3   2020-07-05 [1] CRAN (R 4.0.2)
   ps            1.3.4   2020-08-11 [1] CRAN (R 4.0.2)
   R6            2.4.1   2019-11-12 [1] CRAN (R 4.0.2)
   remotes       2.2.0   2020-07-21 [1] CRAN (R 4.0.2)
   rlang         0.4.7   2020-07-09 [1] CRAN (R 4.0.2)
   rprojroot     1.3-2   2018-01-03 [1] CRAN (R 4.0.2)
   rstudioapi    0.11    2020-02-07 [1] CRAN (R 4.0.2)
   sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.2)
   testthat    * 2.3.2   2020-03-02 [1] CRAN (R 4.0.2)
   usethis       1.6.1   2020-04-29 [1] CRAN (R 4.0.2)
   withr         2.2.0   2020-04-20 [1] CRAN (R 4.0.2)
   xml2          1.3.2   2020-04-23 [1] CRAN (R 4.0.2)
 R zen4R       * 0.4     <NA>       [?] <NA>          

[1] /home/floris/lib/R/library
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library

 R ── Package was removed from disk.

@eblondel
Copy link
Owner

eblondel commented Sep 2, 2020

Thanks, i've made some slight change regarding the 'slashed' case.

Warnings: to check later what's happening there. Here it doesn't show up.
Parallel and messages: There's no cross-plateform solution, and i didn't find any way to make it work. See https://stackoverflow.com/questions/16717461/how-can-i-print-or-cat-when-using-parallel

zen4R 0.4 just submitted to CRAN team for revision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants