Script download and install of data sets #221

bmcfee · 2016-10-26T00:44:53Z

It would be useful to have a way in the config file to automate downloading data sets or other resources automatically without including them in the rpz file.

remram44 · 2016-10-27T14:43:06Z

Should this be in the RPZ file or in a "wrapper" that indicates how to combine RPZ file and data files? The same RPZ file and associated data might live in different storages and each of them might want to link to their own copy of the data.

bmcfee · 2016-10-27T15:15:44Z

I don't have strong opinions about this, but here's one way you might think about it.

Python's setuptools lets you specify optional "extra" dependencies through the extras_require directive. This is often used for things not core to the package's functionality, such as testing or documentation-building, where you might still want to have explicit dependencies in place. For example, I have the following in one of my setup.py scripts:

   extras_require={
        'docs': ['numpydoc', 'sphinx!=1.3.1', 'sphinx_rtd_theme',
                 'matplotlib >= 1.5'],
        'numba': ['numba >= 0.25'],
        'display': ['matplotlib >= 1.5'],
    }

You then install them by saying pip install packagename[docs,numba] (or whatever you want to name them.

I'm not sure how this would play out in RPZ land. I could imagine a simple interface where you can install a bare-bones rpz using the normal install procedure, but if you want the heavy-weight optional dependencies (eg, datasets hosted on s3 or something), you can install those by an extra flag like pip does. These would be treated as external dependencies, and not bundled within the rpz, so you'd have to have some part of the config that specifies how to collect the external dependencies.

remram44 · 2016-10-27T16:11:13Z

I actually have a reprounzip[all] for plugins 😉

The difference here is that setup.py lists dependencies by name and not location, and that is my issue here. Optionally the RPZ file could identify these missing input files by hash, but putting the location into the RPZ package (more or less meant to be immutable) can be discussed.

@VickySteeves comments?

fchirigati · 2016-10-31T15:19:49Z

@bmcfee I like the idea of adding "external dependencies" to ReproZip, and this is particularly useful for big datasets that do not need to be packed. I think the specification of these datasets/resources could come in the RPZ file (whoever packed the application informs ReproZip how to obtain these datasets/resources), but it should also be possible for the user, while unpacking, to associate these resources with their own copy of the data. The unpackers could then automatically download the datasets while setting up the environment, if users choose to do so via a flag.

remram44 · 2017-06-30T20:35:51Z

Also related: #220

nuest · 2019-12-20T11:54:04Z

Have you discussed the definition of remote file resources further?

ResearchObjects do this, AFAIK, but I wonder if there is a useful shared "specification" here, e.g. a YAML file that tells a platform "I need access to these remote sources", and then the platform can say if it can manage that (in a performant way). IMO this would be useful for o2r's reproducibility service and Binder's, too. I've thought about this because with remote sensing data, you won't just put 1PB of data into a research compendium (see also this poster).

If you're interested in a discussion, I can dig up more. IIRC this was also discussed in the Open Science Infrastructure Working Group calls.

remram44 added T-enhancement Type: En enhancement to existing code, or a new feature C-unpacker Component: New unpacker or common unpacker functionalities good first issue Newcomers welcome! This should be easy and self-contained enough for someone new to the codebase labels Oct 27, 2016

remram44 added the A-question Attention: There is an open question for which more input is requested label Oct 27, 2016

remram44 removed the good first issue Newcomers welcome! This should be easy and self-contained enough for someone new to the codebase label Aug 29, 2018

remram44 assigned VickyRampin Aug 29, 2018

remram44 mentioned this issue Apr 15, 2021

Access/mount data from repositories VIDA-NYU/reproserver#39

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Script download and install of data sets #221

Script download and install of data sets #221

bmcfee commented Oct 26, 2016

remram44 commented Oct 27, 2016

bmcfee commented Oct 27, 2016

remram44 commented Oct 27, 2016

fchirigati commented Oct 31, 2016

remram44 commented Jun 30, 2017

nuest commented Dec 20, 2019

Script download and install of data sets #221

Script download and install of data sets #221

Comments

bmcfee commented Oct 26, 2016

remram44 commented Oct 27, 2016

bmcfee commented Oct 27, 2016

remram44 commented Oct 27, 2016

fchirigati commented Oct 31, 2016

remram44 commented Jun 30, 2017

nuest commented Dec 20, 2019