Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script download and install of data sets #221

Open
bmcfee opened this issue Oct 26, 2016 · 6 comments
Open

Script download and install of data sets #221

bmcfee opened this issue Oct 26, 2016 · 6 comments
Assignees
Labels
A-question Attention: There is an open question for which more input is requested C-unpacker Component: New unpacker or common unpacker functionalities T-enhancement Type: En enhancement to existing code, or a new feature

Comments

@bmcfee
Copy link

bmcfee commented Oct 26, 2016

It would be useful to have a way in the config file to automate downloading data sets or other resources automatically without including them in the rpz file.

@remram44 remram44 added T-enhancement Type: En enhancement to existing code, or a new feature C-unpacker Component: New unpacker or common unpacker functionalities good first issue Newcomers welcome! This should be easy and self-contained enough for someone new to the codebase labels Oct 27, 2016
@remram44
Copy link
Member

Should this be in the RPZ file or in a "wrapper" that indicates how to combine RPZ file and data files? The same RPZ file and associated data might live in different storages and each of them might want to link to their own copy of the data.

@bmcfee
Copy link
Author

bmcfee commented Oct 27, 2016

I don't have strong opinions about this, but here's one way you might think about it.

Python's setuptools lets you specify optional "extra" dependencies through the extras_require directive. This is often used for things not core to the package's functionality, such as testing or documentation-building, where you might still want to have explicit dependencies in place. For example, I have the following in one of my setup.py scripts:

   extras_require={
        'docs': ['numpydoc', 'sphinx!=1.3.1', 'sphinx_rtd_theme',
                 'matplotlib >= 1.5'],
        'numba': ['numba >= 0.25'],
        'display': ['matplotlib >= 1.5'],
    }

You then install them by saying pip install packagename[docs,numba] (or whatever you want to name them.

I'm not sure how this would play out in RPZ land. I could imagine a simple interface where you can install a bare-bones rpz using the normal install procedure, but if you want the heavy-weight optional dependencies (eg, datasets hosted on s3 or something), you can install those by an extra flag like pip does. These would be treated as external dependencies, and not bundled within the rpz, so you'd have to have some part of the config that specifies how to collect the external dependencies.

@remram44
Copy link
Member

I actually have a reprounzip[all] for plugins 😉

The difference here is that setup.py lists dependencies by name and not location, and that is my issue here. Optionally the RPZ file could identify these missing input files by hash, but putting the location into the RPZ package (more or less meant to be immutable) can be discussed.

@VickySteeves comments?

@remram44 remram44 added the A-question Attention: There is an open question for which more input is requested label Oct 27, 2016
@fchirigati
Copy link
Member

@bmcfee I like the idea of adding "external dependencies" to ReproZip, and this is particularly useful for big datasets that do not need to be packed. I think the specification of these datasets/resources could come in the RPZ file (whoever packed the application informs ReproZip how to obtain these datasets/resources), but it should also be possible for the user, while unpacking, to associate these resources with their own copy of the data. The unpackers could then automatically download the datasets while setting up the environment, if users choose to do so via a flag.

@remram44
Copy link
Member

Also related: #220

@remram44 remram44 removed the good first issue Newcomers welcome! This should be easy and self-contained enough for someone new to the codebase label Aug 29, 2018
@nuest
Copy link

nuest commented Dec 20, 2019

Have you discussed the definition of remote file resources further?

ResearchObjects do this, AFAIK, but I wonder if there is a useful shared "specification" here, e.g. a YAML file that tells a platform "I need access to these remote sources", and then the platform can say if it can manage that (in a performant way). IMO this would be useful for o2r's reproducibility service and Binder's, too. I've thought about this because with remote sensing data, you won't just put 1PB of data into a research compendium (see also this poster).

If you're interested in a discussion, I can dig up more. IIRC this was also discussed in the Open Science Infrastructure Working Group calls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-question Attention: There is an open question for which more input is requested C-unpacker Component: New unpacker or common unpacker functionalities T-enhancement Type: En enhancement to existing code, or a new feature
Projects
None yet
Development

No branches or pull requests

5 participants